next up previous contents index
Next: 9.4 Access Methods Up: 9. Statistical Databases Previous: 9.2 Fundamentals of Data

Subsections


9.3 Architectures, Concepts and Operators

We first consider the architecture of micro or operational data used for online transaction processing (OLTP), and then illustrate the different architecture of macro or analytical data used for decision support and its relation to operational data, see OLAP. We note that the key features of a DBS for OLTP data are: transaction-oriented, measurement- or record-based, real time processing of inserts, deletes and updates of records. In contrary, a DWS for OLAP data is characterized by the features: subject-oriented, integrated and aggregated, calendar or fiscal period related, and non-volatile, see Inmon (1992).

9.3.1 Architecture of a Database System for OLTP

The architecture of DBS can be represented by the quintuple (data sources, application server, DB server with a DBMS, application server, DB and repository); see also Fig. 9.2. As mentioned above, business processes act as data sources in commercial systems, while at statistical offices data is supplied by surveys, periodic reports or a census. Similarly, in science the data is generated by observations or measurements collected by field or simulation experiments. We represent the architecture in Fig. 9.2.

Figure 9.2: Architecture of a DBS used to manage and query operational data
\includegraphics[width=7.8cm]{text/2-9/II2.eps}

As an example from business we consider a company, which manages wages and salaries of its employees. The data is generated by bookkeeping, the DBMS administers the real and metadata, processes queries, and controls transactions. The application server is responsible for running the software for wage and salary computation, while the client is used as a presentation layer.

9.3.2 Architecture of a Data Warehouse

The main components of the architecture of any OLAP application are heterogeneous data sources S like internal or external databases or files, an OLAP server with DWMS, DW, Repository and Data Marts, and OLAP clients. The DWMS is responsible for load management, query management and warehouse management.

Figure 9.3: DW Architecture
\includegraphics[width=9.2cm]{text/2-9/II3.eps}

The DW (see Fig. 9.3) incorporates data replications, archived data and aggregated data stored as data cubes. The departmental view on the whole data is given by subsets of the data cube, called data marts.

As can be seen from Fig. 9.3, analytical processing is concerned with data from various data sources, i.e., external or internal (operational) data. These sources are integrated by ETL in data marts in an unified manner. The data marts can be viewed as collections of data cubes.

There exist two types of OLAP clients:

  1. stand-alone applications like spreadsheets with a DW interface, and

  2. Web clients that use an Internet browser and often applets.

9.3.3 Concepts (ROLAP, MOLAP, HOLAP, Cube Operators)

As we have seen above, the schema of a data cube consists of a cube identifier (name), a list of identifying attributes called dimensions and a statistical function like min, max, count (frequency), sum, avg (arithmetic mean) applied to a summary attribute. Furthermore, the data types of the attributes and integrity constraints must be given. As an example we take from above the data cube ''sales cross-classified by (month, year), customer and product'':

total_sales (date.month, date.year, customer_id,
  product_name, sum(sales)).
Evidently, the dimensions span a three-dimensional space on which the statistical function sum(sales) is defined. The corresponding data types are date (mm.yyyy), integer, string and decimal.

9.3.3.1 Relational OLAP (ROLAP)

In the following we turn to the conceptual mapping of a data cube into a relational database schema. This approach is called ROLAP for Relational OLAP, see Raden (1996). There exist two schemas, star and snowflake schemas. As illustrated in Fig. 9.4, the star schema uses two different types of schemas, which refer to two types of corresponding tables:

  1. fact table with a primary key reference to each dimension and the facts which are composed of at least a statistical function and a summary attribute.

  2. dimension table for each dimension with a primary key and a level indicator for each entry of a hierarchical attribute.

Figure 9.4: Star schema of a three-dimensional data cube (one fact table, three dimension tables; the product hierarchy is assumed to have two levels)
\includegraphics[width=7.2cm]{text/2-9/II4.eps}

The star schema models all kind of hierarchical attributes including parallel hierarchies, see Lehner et al. (1997). The schema is not normalized as becomes obvious, for example, from the dimension table Date. The attributes month and year are nested, which implies some redundancy. For small or medium-sized data volumes, such schemas have a sufficient performance because join operations are only necessary between the fact table and the related dimension tables.

In order to normalize tables by level attributes, the snowflake schema was introduced. Instead of modelling each dimension by one table, a table is created for each level of a hierarchical attribute. The schemas involved are related by identifiers, which play the role either of a primary or a foreign key. In Fig. 9.5 we display only the normalized dimension tables Month and Year and the fact table Sales. The identifiers are month-no in the fact table and dimension table Month and year-no in the dimension table Year.

Figure 9.5: Data cube Sales represented (fractionally) as a snowflake schema
\includegraphics[width=6.1cm]{text/2-9/II5.eps}

It can be shown that the normalization is lossless by applying an inner join to the tables of a snowflake schema.


9.3.3.2 Other Storage Modes (MOLAP, HOLAP)

The above conceptual model of a star or snowflake schema may lead to the wrong conclusion that data cubes are exclusively represented by a relational data model approach. There exist further storage modes, which are in use.

The main advantage of ROLAP lies in the reliability, security and ease of loading of the DW based on Relational DBMS (RDBMS) technology. As was mentioned above, this is achieved due to the mapping of facts into a normalized relation and dimension into a mostly non-normalized relation of a relational database. As the set of statistical functions in SQL is too restrictive, some of the functionality of OLAP must be added to the application server. An example is to find the top-ten among all products sold in a given period.

Multi-dimensional OLAP (MOLAP) makes use of specially tailored data structures like arrays and associated dimension lists or bitmaps. The operational data is extracted and stored as aggregates in those structures. The performance is acceptable for up to medium-sized data sets ( $ < 1\,$Gbyte). There exists a multi-dimensional query language called MDX (Multidimensional Expressions), see Microsoft (1998). ''XML for Analysis'' defines a standardized programming interface for an OLAP server, see http://www.xmla.org. An OLAP client encodes a query of a data cube and inserts it into a XML document, which specifies the method ''execute'' and the accompanying parameters according to the ''Simple Object Access Protocol'' (SOAP). This document is transmitted over the Internet based on the ''Hypertext Transfer Protocol'' (HTTP). After decoding the OLAP server executes the query, and sends the data back in a XML document to the client according to SOAP. For further details see Messerschmidt and Schweinsberg (2003). MOLAP has the disadvantage of ''miss hits'' if a data cube cannot be stored fully in-core and an access to a second storage device is necessary. Moreover, array compression or sparse array handling is needed because mostly the data cube or, equivalently, the arrays are sparse.

Hybrid OLAP (HOLAP) tries to combine the advantages of relational and multi-dimensional database technology. The relational model is used to store replicated and low-level aggregates, while the multi-dimensional model is responsible for high-level aggregates.

9.3.3.3 Data Cube Operators

Data cubes are used for analytical purposes and not for (simple) transaction processing. Therefore there does not exist a clear boundary between data extraction or retrieval and data analysis. Therefore there does not exist a minimal, closed and complete set of OLAP operators. The mostly built-in operators on data cubes in commercial DWs are the following, see Shoshani (1997) and Jarke et al. (2000).

Slicing $ \sigma_{\text{c}}$(T) is to select data from a cube T according to a fixed condition c. This operation is called in Statistics conditioning if only frequencies (counts) applied to multi-way tables are considered. For example, we can retrieve data from total_sales according to $ \sigma_{\text{{\tt product\_id, customer\_id,
month, year==97}}}$ (total_sales).

Dicing $ \pi_{\text{c}}$(T) is table projection on T by selecting a sub-cube T$ ^{\prime}$ of some lower dimension c than the original cube T has. This operation is equivalent to marginalization in Statistics, i.e. projection of a data space into a lower dimension. For instance $ \pi_{\text{{\tt date,
customer\_no}}}$(total_sales) retrieves a sub-cube of total sales cross-classified by date and customer.

Table aggregation (roll-up) and disaggregation (drill-down) are operations on data cubes if at least on dimension is hierarchical. For example $ \rho_{\text{{\tt year, customer\_no,
product\_no}}}$(total_sales) is a query for less fine-grained data, i.e. for years and summarizing over all months per year. This specific operation is called temporal aggregation. We observe that such an operation is not allowable if a type conflict happens with respect to the summary attribute. This is the case if the attribute `sales' is substituted by `no of employees', see Lenz and Shoshani (1997).

Drill-across $ \delta_{\text{level, node, attribute}}$(T) is a navigation on the same level through the various subtrees of a hierarchical attribute starting at a given node. For example, retrieving products from level 1 (product_group) with start at product_group1 (shoes and balls) of the taxonomy ''Product'' delivers data about tennis nets.

In order to compute ratios, products etc. of data cubes the join operator $ \gamma_{\otimes}$(T$ _{1}$,T$ _{2}$) is needed. For instance, as sales = turnover * price we have sales:= $ \gamma$ * (turnover, price).

We note that there exist further operators like pivot (rotation of a cube), see Jarke et al. (2000), or cube, which was introduced by Gray et al. (1996). It delivers the margins ALL for any subset of dimensions.

9.3.4 Summarizability and Normal Forms

The main objective of summarizability is to guarantee correct results of the cube operation roll-up and the utilization of statistical (aggregation) functions like min, max, avg, sum and count under all circumstances, see Lenz and Shoshani (1997). The corresponding integrity constraints are non-overlapping levels of dimensions, completeness and type compatibility. The first condition assures that each node of a taxonomy has at most one preceding node except for the root node. The second one ascertains that any node on a low level granularity corresponds to at least one node of a higher granularity. Type compatibility guarantees that the application of any statistical function on a summary attribute is sound. In a preceding section we mentioned the unfeasibility of aggregation of stocks over time. Another example is the misuse of the sum operator applied to code numbers of professions.

As Lehner et al. (1998) pointed out, the integrity constraint of completeness may turn out to be too restrictive. This happens if there exist structural missing values (null values) in taxonomies. For example, the German state Bavaria is divided into regions called ''Kreise''. Berlin is a city as well as an autonomous German state. It is not divided into regions, but into suburbs called ''Bezirke''. In such cases a context sensitive summarizability constraint is appropriate. The authors consequently proposed three multi-dimensional normal forms for fact tables. Lechtenbörger and Vossen (2001) improved the design of these normal forms.

9.3.5 Comparison of Terminologies

To sum up this chapter, Tables 9.3 and 9.4 compare the terminology of statistical databases and OLAP, see Shoshani (1997).


Table 9.3: Comparison of concepts
Statistical databases OLAP
Categorical attribute Dimension
Structural attribute Dimension hierarchy
Category Dimension value
Summary attribute Fact
Statistical object, multidimensional table Data cube
Cross product Multidimensionality


Table 9.4: Comparison of operators
Statistical databases OLAP
Table projection Dice
Table selection Slice
Table aggregation Roll-up
Table disaggregation Drill-down
Table join term missing
Term missing Drill across
Viewing pivoting


next up previous contents index
Next: 9.4 Access Methods Up: 9. Statistical Databases Previous: 9.2 Fundamentals of Data