Next: References Up: 9. Statistical Databases Previous: 9.6 Metadata and XML

9.7 Privacy and Security

9.7.1 Preventing Disclosure of Confidential Information

The statistical databases that are built by government agencies and non-profit organizations often contain confidential information such as income, credit ratings, type of disease or test scores of individuals. In corporate DWs, some strategic figures that are not related to individuals like sales for recently launched products may also be confidential. Whenever sensitive data is exchanged, it must be transmitted over a secure channel like the Secure Socket Layer (SSL), see Netscape (1996) in order to prevent unauthorized use of the system. For the purposes of this chapter, we assume that adequate measures for security and access control are in place, see Stallings (1999).

However, even if the information in the statistical database safely reaches the correct addressee, the system has to ensure that the released information does not compromise the privacy of individuals or other confidential information. Privacy breaches do not only occur as obvious disclosures of individual values in single queries. Often, the combination of multiple non-confidential query results may allow for the inference of new confidential facts that were formerly unknown.

We give an example. From Table 9.1, we take the total sales for ''Tennis Shoes'' (), ''Tennis Balls'' (), ''Tennis Nets'' () and a fourth, new product (''Tennis Socks'', ). We assume that sum queries for groups of products are allowed but that single, product-specific sales values are confidential. After querying the sum for balls and shoes () and for balls and socks (), the user can infer an interval of [; ] for the sales of shoes, as sales cannot be negative. The length of the interval, which is the maximum error of the user's estimation of the confidential shoe sales is only $3.3\,{\%}$ of the actual value. This particular case of disclosure is called interval inference, see Li et al. (2002). Other types of inference include exact inference (concluding the exact value of for shoes sales) and statistical inference (inferring estimates like mean

$\displaystyle \bar{x}_{\text{Tennis Shoes}} = 30{,}000\quad \text{and standard deviation}\quad s_{\text{Tennis Shoes}} = 5000)\;.$

If a researcher is granted ad-hoc access to a statistical database, there are basically two different approaches to protect information that is private and confidential from being revealed by a malevolent snooper, see Adam and Wortmann (1989), Agrawal and Srikant (2000), Fig. 9.9. In the first approach, the kind and number of queries that a researcher poses to the statistical database is restricted (query restriction). In the second approach, the entire database is subject to a manipulation that protects single values but preserves the statistical properties which are of interest to the user. Then the perturbed database can be accessed by a researcher without restrictions (data perturbation). In the following, we give an overview of disclosure protection techniques of this kind.

**Figure 9.9:** (a) Query set restriction and (b) data perturbation. Source: Adam and Wortmann (1989)
$\includegraphics[width=9.4cm]{text/2-9/II9.eps}$

9.7.2 Query Set Restriction

With this approach a query is either denied or responded with an exact answer as the upper sketch in Fig. 9.9 indicates.

Query Set Size Control.

Fellegi (1972) works by setting lower and upper bounds for the size of the query answer set based on the properties of the database and on the preferences fixed by the database administrator. If the number of returned records did not lie within these bounds, the information request would have to be rejected and the query answer is denied. As queries that are issued sequentially by one user often have a large numbers of entities in common, an improvement is the restriction of these entities to a maximum number, see Dobkin et al. (1979). Although popular, this method is not robust enough as a stand-alone solution, see Denning (1982).

Auditing.

It involves keeping up-to-date logs of all queries made by each user and constantly checking for possible disclosures whenever a new query is issued. One major drawback of this method is that it requires huge amounts of storage and CPU time to keep these logs updated. A well-known implementation of such an audit system is Audit Expert by Chin and Özsoyoglu (1982). It uses binary matrices, see bitmap indexes in Sect. 9.4.3, to indicate whether or not a record was involved in a query.

Cell Suppression.

See Cox (1980) is an important method for categorical databases when information is published in tabular form. Especially Census Bureaus often make use of tabular data and publish counts of individuals based on different categories. One of the main privacy objectives is to avoid answers of small size. For example, if a snooper knows somebody's residence, age and employer, he can issue a query for (ZIP = 10,178, Age = 57, Employer = 'ABC'). If the answer is one entity, the snooper could go on and query for (ZIP = 10,178, Age = 57, Employer = 'ABC', Diagnosis = 'Depression'). If the answer is one again, the database is compromised and the person with the diagnosis identified. The cells should have to be suppressed. A common criterion to decide whether or not to suppress a cell is the N-k rule where a cell is suppressed if the top respondents contribute at least % of the cell total. and are parameters that are fixed by the database administrator, i.e. the Census Bureau. In the exemplary case of and $k = 10\,{\%}$ , a cell which indicates aggregated income ($10M) of individuals would have to be suppressed if the top two earners' aggregate income exceeded $1M.

9.7.3 Data Perturbation

In the query restriction approach, either exact data is delivered from the original database or the query is denied. As depicted in the lower sketch of Fig. 9.9, an alternative is to perturb the original values such that confidential, individual data become useless for a snooper while the statistical properties of the attribute are preserved. The manipulated data is stored in a second database and is then freely accessible for the users.

If in Table 9.1, we permute the sales of tennis balls, tennis nets and tennis shoes, individual sales data is not correct anymore. But the arithmetic average and the standard deviation of the attribute sales stay the same. This procedure is called data swapping, see Denning (1982).

Noise addition for numerical attributes, see Traub et al. (1984), means adding a disturbing term to each value: $Y_{k} = X_{k} + e_{k}$ , where $X_{k}$ is the original value and $e_{k}$ adheres to a given probability distribution with mean zero. As for every value $X_{k}$ value, the perturbation $e_{k}$ is fixed, conducting multiple queries does not refine the snooper's search for confidential single values.

A hybrid approach are random-sample queries, Denning (1982), where a sample is drawn from the query set in such a way that each entity of the complete set is included in the sample with probability . If, for example, the sample of a COUNT query has entities, then the size of the not perturbed query set can be estimated as . If is large, there should be a set-size restriction to avoid small query sets where all entities are included.

9.7.4 Disclosure Risk versus Data Utility

All methods presented in the preceding sections aim at lowering the disclosure risk for data that is private and confidential. But at the same time, each of these methods reduces, in some way, the utility of the data for the legitimate data user. Duncan and Keller-McNulty (2001) present a formal framework to measure this trade-off between disclosure risk and data utility, the Risk-Utility (R-U) map. There are numberless measures for disclosure risk, see Domingo-Ferrer et al. (2002) for an excellent overview. We already gave an intuitive measure for interval inference. The sales for tennis shoes were predicted with an error of only $3.3\,{\%}$ , see Sect. 9.7.1.

However, it is far more difficult to measure data utility because it strongly depends on the varying preferences of the data user. Especially for this reason, classifying statistical disclosure control methods as presented here on an absolute scale is almost an impossible task.

Next: References Up: 9. Statistical Databases Previous: 9.6 Metadata and XML