next up previous contents index
Next: 9.3 Architectures, Concepts and Up: 9. Statistical Databases Previous: 9.1 Introduction

Subsections


9.2 Fundamentals of Data Management

We start our discussion with file systems, have a look at database systems (DBSs) useful to store transaction or microdata, and finally turn to DWs which host macrodata either in a real (materialized) or virtual form.

9.2.1 File Systems

Data is classically stored in files. Files can be viewed as a conceptually related set R of records, which are represented by a given record type, see Wirth (1986), and an access mode (direct or sequential). If the records have a numeric type for each of its fields and the mode is sequential, then a data matrix can be stored in a sequential file. A collection of such files is called a file system (FS), if there exist logical relations between the files f$ \in$FS, a set of constraints on FS and application software. Typical applications in statistics are simple surveys like price surveys, where in most cases only one file is needed. A more complex file system is compulsory if, for instance, stratified or panel sampling designs are considered, where various sampling periods, areas, objects and units (carriers of interest) are involved. Moreover, relational data mining, as described by Dzeroski and Lavrac (2001) and Wrobel (2001), is devoted to such data structures.

File systems are appropriate if only single user-access and weakly logically connected files with simple constraints are effective. Note that application programs must be specially tailored to execute queries, and to achieve data safety and security. This implies data dependence between the software and the files referenced, which reduces the program's flexibility with respect to structural changes of the data structure. These pitfalls can be overcome by DBSs.


9.2.2 Relational Database Systems (RDBS)

Multi-user access, complex data structures and logical restrictions ask for a relational database system (RDBS). It consists of a set T of relations (flat tables) together with a set S of corresponding schemas and a set C of constraints, a database management system and application software. A database schema describes the attributes (variables) of a specific table, its data types and roles. To avoid redundancy and anomalies during insert, delete or update transactions, those tables should be transformed into a ''normal form'', see Elmasri and Navathe (1999). As an example, we take a Census. When we look at the RDBS `Census' from a conceptual point of view, there are four table schemas involved: Census-questionnaire, household, dwelling, and employment. We shall consider only the first two in some detail, and select only some few attributes for the sake of brevity. The first schema is

census_questionnaire(case_id, age-group, gender,
  profession, ...).
Its first three attributes are numeric and the fourth one is of type `string'. The attribute case_id acts as a primary key, i.e., the remaining attributes are functionally dependent on it. Because a key attribute uniquely identifies any tuple (record) of the corresponding table (set of tuples), there is one constraint among others saying that duplicates in a given table are not allowed. In order to mention just one further constraint, the domain of the identifier case_id may be restricted to the set of positive integers.

The next schema is

household (household_id, case-id, role,...).
The first two attributes have a numeric domain, while role is of type `string' with the value set {''member'', ''owner''}. Of course, we have again the constraint that duplicates are not allowed, but we need at least one further restriction to ensure reference integrity, i.e., whenever there exist entries of people grouped together in a household, each of their corresponding records in census_questionnaire has to exist.

Last but not least, we reconsider our sales example from the introduction. The schema is

sales (transaction_id, customer_id, date, product_name,
  quantity, price, amount)
The primary key is transaction_id, which implies that only one product can be part of any transaction. Evidently, this scheme is not normalized, because price depends on product_name besides of transaction_id, and $ {\text{{\tt amount}}} = {\text{{\tt quantity}}} \times
{\text{{\tt price}}}$. The relation itself is of degree (number of attributes) seven. The six attributes customer_id, date, ..., amount span a six-dimensional data space, where each tuple has six data elements, and is identified by its corresponding transaction_id. We represent four tuples in Table 9.1 to illustrate the difference between a schema and its corresponding relation (table). We use abbreviations in the header of the table sales.


Table 9.1: The relational table sales of degree 7 and cardinality 4
Transaction_id customer_id Date Product_name Quantity Price Amount
$ 015$ A 4 Jan 97 Tennis Shoes $ 200$ 95 19,000.00
$ 018$ A 4 Jan 97 Tennis Balls $ 300$  1.50   450.00
$ 004$ A 3 Jan 97 Tennis Nets $ 350$ 27  9450.00
$ 009$ C 3 Jan 97 Tennis Shoes $ 100$ 95  9500.00
$ \ldots$ $ \ldots$ $ \ldots$ $ \ldots$ $ \ldots$ $ \ldots$ $ \ldots$

The need of various users for different data can be satisfied by the concept of virtual relations (views), which can be created on top of an existing DBS.

Note that the term ''table'' used in a relational database to store such information is quite different from the tables statisticians use for the same purpose. Table 9.2 shows the representation of the same information in a different table structure that allows the natural computation of aggregates along rows and columns (''margin sums'' etc.). Note that this table structure cannot be mapped directly into a relational database context due to the margins (Total or ALL), see Gray et al. (1996).


Table 9.2: sales data in the form of the three-way statistical table total_sales
3 Jan 1997 Tennis shoes Tennis balls Tennis nets
Customer A   0 0 $ 350$
Customer B   0 0   0
Customer C $ 100$ 0   0
Total $ 100$ 0 $ 350$
4 Jan 1997 Tennis shoes Tennis balls Tennis nets
Customer A  $ 300$  $ 400$  $ 450$
Customer B $ 1100$ $ 1100$  $ 800$
Customer C  $ 600$ $ 1600$  $ 350$
Total $ 2350$ $ 3400$ $ 1900$

Let us close this example with a discussion of the background information needed. We mentioned above metadata like schema names, attribute names, data types, roles (key versus non key) of attributes, constraints etc. All this can be considered as technical metadata. Moreover, we need further metadata of a semantic and statistical type. Take for instance the attributes quantity, price and amount. What is their definition? As far as amount is concerned we have ''amount = quantity * price''. Furthermore, we need the corresponding measurement units which may be units, EUR/unit and EUR. As far as data collection at Statistical Offices is concerned, we may need information about the periodicity of data surveys like `annual', `quarterly' or `monthly'. With respect to data analysis we may be interested in the measurement scale. While product_name has a nominal scale allowing only operations like `equal' and `not equal', the attributes quantity, price and amount have a metric scale allowing for all basic numerical operations. There exist further ambiguities. For example, the generation mode of the attribute sales may have the categories `real', `simulated' or `forecasted'. There may exist further vagueness about sales of category `real' unless its update state is set to `final', and not to `provisional'.

9.2.3 Data Warehouse Systems (DWS)

data warehouse system (DWS) consists of a (replicated) micro database, a set of materialized or virtual multi-way tables (data cubes) needed to represent macro (pre-aggregated and grouped) data, a data warehouse management system (DWMS), and a repository, which stores all required technical, statistical and semantic metadata.

As an example of a data cube, we remind the reader of the three-way table presented above:

total_sales (date.month, date.year, customer_id,
             product_name, sum(sales)).
This table is represented in a relational form, where date, customer_id and product_name are concatenated as a primary key. These attributes are called dimensions. Evidently, the non-key attribute sum(sales) is fully dependent upon this key, i.e. given the values of date.month, date.year, customer_id, product_name there exist one and only one value of sum(sales) if missing values (null values) are excluded.

Views are useful again and can be provided by joining cubes or sub-cubes in combination with table projections to lower dimensions. It is worthwhile considering separately the attributes sum(sales), date and product_name. The first attribute is sometimes called summary attribute and is composed of the statistical sum applied to the attribute sales, see Shoshani (1997). This operation is feasible because the function sum and the attribute sales have an identical data type, i.e., a metric type. Moreover, the attribute sales is of attribute type flow, but not stock. While summarizing over flows (rates) is reasonable, such an operation over stocks like `number of customers' is nonsense. Evidently, such and further integrity constraints must be effective for a DWS, in order to protect the naive user from nonsense queries. This is extremely important for data warehousing, because contrary to database queries, in DWS the application of statistical functions is an inherent part of any query.

Furthermore, there exists a specific problem related to date. This attribute can be decomposed into month and year but these components are functionally dependent, i.e., for a given month of a calendar year the year is fixed. We thus have (month, year) -> year as a functional dependency. Therefore only one dimension called date is used for the two attributes month and year in the data cube above. There may be further temporal levels like hour, day, month, quarter and year. Such hierarchical attributes are called taxonomies and need special attention, see Lehner et al. (1998). It is quite remarkable that all dimensions can be allocated to three principal groups: time, location and subject. This is called the 3D-principle, see Lenz (1994).

Let us have a further look at taxonomies that are unbalanced and asymmetric. This may happen in case of a product or regional hierarchy. In our running example the subgroups tennis shoes and balls may be grouped together as product_group1, while tennis nets build-up product_group2, but are free of sub-grouping. Both groups 1 and 2 build the root group product_all. As subgroups exist only for shoes and balls, subgroups are no longer functionally dependent on product_name, but only weakly functionally dependent, see Lehner et al. (1998), Fig. 9.1. This implies that queries, which involve sub-grouping over products, are not feasible and must be refused. Further pitfalls of operations on a data-cube are given in Lenz and Shoshani (1997) and Lenz and Thalheim (2001).

Figure 9.1: The product taxonomy with a weak functional dependency
\includegraphics[width=8.2cm]{text/2-9/II1.eps}

This discussion shows that real data without metadata is more or less useless especially for on-line analytical processing (OLAP). A repository with metadata has become a prerequisite of any DBS engineering and sound data analysis.


next up previous contents index
Next: 9.3 Architectures, Concepts and Up: 9. Statistical Databases Previous: 9.1 Introduction