Next: 9.2 Fundamentals of Data Up: 9. Statistical Databases Previous: 9. Statistical Databases

9.1 Introduction

Most data collected in statistics and science is still stored in simple flat files, usually data matrices with rows identified by a case identifier (case_id), columns corresponding to attributes (variables), and numerical data types for the elements of each matrix due to coding of all attributes involved. Each row (tuple) carries the (coded) values of the attributes, besides the case_id. Due to a suitable encoding that maps a natural domain to numerical ones, all matrix entries have a numeric data type. The scales of the attributes may of course be quite different.

A simple example is given by census data stored at statistical offices in files according to a schema like

census_questionnaire (case_id, age-group, gender,
  profession,...).

While science gains their data from experiments, statistical agencies collect their data still mostly off-line from surveys, reports or census. Industry and services get their data on-line from their business process management, i.e., from their logistical, production and administrative transactions. A typical example is sales data, which may be represented by a schema like

sales (transaction_id, customer_id, date,
  product_name, quantity, price, amount).

Such data is called microdata, since it is kept in its original form and is not divisible but atomic. In the business area such data is labeled as on-line transaction data because it is subject to frequent updates and is the basis for continuous business transactions. The use of a simple file system to store microdata is rarely a good choice because of a lack of safety and integrity, and retrieval problems. Such data should rather be stored as tables of a relational database. A database management system (DBMS) asserts safety, integrity and retrieval flexibility. For instance, a query like ''Find prices and amount of all sales since year 2001 where customer

with product

is involved'' can be simply stated in structured query language (SQL) as

SELECT price, amount FROM sales 
WHERE year >= 2001 
AND customer_id = 007 
AND product_name = 4711;.

It is interesting to note that SQL provides for a set of query operators that is relationally complete. One may thus process any reasonable query as long as it does not involve ''transitive closure'', i.e. a potentially infinite loop based on some logical inference (such as a part-of hierarchy).

Macrodata is derived from microdata by applying statistical functions, aggregation and grouping, and consequently has a larger granularity. For example, a business analyst might be interested in a three-way table (data cube) of total sales classified by month and year, customer_id and product_name. Such a retrieval can be achieved on sales by the command:

SELECT SUM(sales), date.month, date.year, customer_id,
  product_name 
FROM sales Group BY date.month, date.year,
  customer_id, product_name;.

This type of activities is coined on-line analytical operations (OLAP), which expresses clearly its objective, i.e. a statistical analysis of data for planning, decision support, and controlling.

As we shall see later there does not exist a clear boundary between retrieval and statistical modeling. However, a statistical function like sum (or average) must be selected for a specific query, which does imply modeling. Consequently, there will not exist a closed set of operators on such multi-way tables. Moreover, there are two further problems involved. First of all, which data structure of such kind of data is efficient, and secondly, what kind of background information is needed, to assist the management and the interpretation of real data? This leads to discuss metadata as data about real data and functions. Modern database management systems encapsulate metadata in a repository (integrated metadata database).

In the following we are first concerned with some fundamentals of data management. Then we turn to the architecture of a statistical database or a data warehouse (DW) and some concepts related to it. We pay special attention to conceptual data structures and related operators involved, the summarizability problem, and hierarchical attributes. We discuss metadata, access methods and ''extraction, transformation and loading'' (ETL). We close with metadata and extensible markup language (XML), and privacy.

Next: 9.2 Fundamentals of Data Up: 9. Statistical Databases Previous: 9. Statistical Databases