next up previous contents index
Next: 9.5 Extraction, Transformation and Up: 9. Statistical Databases Previous: 9.3 Architectures, Concepts and

Subsections


9.4 Access Methods

9.4.1 Views (Virtual Tables)

Statistical databases are often accessed by different users with different intentions and different access rights. As already indicated in Sect. 9.2.2, these different requirements can be accounted for by using views. These views are derived virtual tables, which are computed from the (actually stored) base tables, see Elmasri and Navathe (1999). There are two main purposes for the use of views.

  1. It makes the use of the DBS or DW more convenient for the user by providing only customized parts of the whole data cube.

  2. It enforces security constraints by restricting operations on the base tables and by granting users access to their specific views only.
The following SQL statement creates a view for the manager of the product ''Tennis Nets'' from our example in Table 9.1. It only permits to look up the revenues for ''Tennis Nets'' while for all other products, viewing the sales and modifying the corresponding base tables is not possible.
CREATE VIEW tennis_nets_manager AS
  SELECT date.month, date.year, customer_id,
  sum(sales)
  FROM total_sales WHERE product_name = "Tennis
  Nets";
Views can never contain information that is not present in the base tables as the DBS translates all view queries into equivalent queries that refer only to base tables.

Base tables of a DW may contain millions of tuples. Scanning these tables can be time-consuming and may slow down the interaction between the decision support system and the user significantly. One strategy to speed up the access to aggregated data is to pre-compute a range of probable queries and to store the results in materialized views, see Gupta et al. (1997). The access to these materialized views is then much faster than computing data on demand. Yet there are drawbacks to this strategy. The pre-computed data need space, the prediction of the users' queries is difficult, and each change in the base table requires an update of the materialized view also. This is known as the view maintenance problem, see Huyn (1997).

9.4.2 Tree-based Indexing

The tables of a DW can physically be accessed either by a sequential scan or by random access. With today's hard disks, a sequential scan is $ 10$ to $ 20$ times faster than random access, see Jürgens (2002). That means if more than approximately $ 5\,{\%}$ to $ 10\,{\%}$ of the data has to be accessed in a table, it is faster to scan the entire table than addressing specific tuples via random access. In order to avoid full table scans, the number of tuples involved in the result computation has to be reduced. This can be achieved via index structures, which permit a fast look-up of specific tuples.

The best-known index structure for one-dimensional data (i.e. data with just one key such as product_id) is the B-tree, see Bayer and McCreight (1972), Comer (1979). Pointers to the data items are stored in the leaf nodes of a balanced tree. The B-tree is a very general and flexible index structure, yet in some specific cases it may be outperformed by different kinds of hashing, see Gaede and Günther (1998).

The universal B-tree (UB-tree, see Bayer, 1997) is an extension of the B-tree for indexing multidimensional data such as total_sales (date.month, date.year, customer_id, product_name, sum(sales)). The approach partitions the multidimensional data space into squares each of which is captured by a space-filling Z-curve, see Fig. 9.6. For each record, the Z-address of the square, which contains the key values is computed. These Z-addresses are one-dimensional and serve as the new primary keys for the records, which can then be indexed with a standard B-tree.

Figure 9.6: The UB-tree: partition and capture of multidimensional space with the Z-curve
\includegraphics[width=5.1cm]{text/2-9/II6.eps}

Another approach for indexing multidimensional data is the R-tree, see Guttman (1984). It uses rectangles to represent multidimensional intervals. The leaf rectangles correspond to entries in the database. The parent nodes contain all child nodes and the minimal bounding rectangle. The root rectangle covers the entire query space. An example of how to store sales indexes in an R-tree when product_name and customer_id build the concatenated primary key is shown in Fig. 9.7. The minimal bounding rectangle of the dashed-line rectangles A, B and C constitutes the entire search space.

Figure 9.7: An exemplary R-tree
\includegraphics[width=5.8cm]{text/2-9/II7.eps}

Refinements are the R$ ^{+}$-tree of Sellis et al. (1985), the R$ ^{\ast}$-tree of Beckmann et al. (1990) and a slightly improved version called $ R_{{\text{a}}}^{\ast}$-tree of Jürgens (2002).


9.4.3 Bitmap Index Structures

An important alternative to tree index structures is bitmap indexing. For each value of an attribute, a bitmap vector indicates whether or not it is assumed in the records of the table, see Chan and Ioannidis (1998), O'Neil and Quass (1997), Wu and Buchmann (1998). Table one shows a bitmap index for the attribute product_name corresponding to the example presented in Table 9.5.


Table 9.5: Bitmap index for the attribute product_name
transaction_id Tennis balls Tennis nets Tennis shoes
$ 015$ 0 0 $ 1$
$ 018$ $ 1$ 0 0
$ 004$ 0 $ 1$ 0
$ 009$ 0 0 $ 1$

The bitmap vector for the attribute value ''Tennis Balls'' is $ (0,
1, 0, 0)^{\text{T}}$. Such a set of bitmap vectors is created for all dimensions. In our total_sales example, bitmap indexes have to be created further for (date.month, date.year) and customer_id.

The size of the bitmap index depends on the number of tuples and on the cardinality of the attribute domain. As the required operations on bitmaps are simple they are very fast. Thus loading blocks from disc and performing the basic Boolean operations is efficient, especially if the number of dimension is high, see Jürgens (2002). As bitmaps are often sparse, they are well suited for compression techniques. This is the reason why many commercial DBSs are implemented using bitmaps. However, standard bitmaps indexes become space consuming for high attribute's domain cardinality, and they are not very efficient for (low dimensional) range queries, which are typical for DW systems.

Several approaches have been proposed to overcome these drawbacks like the multi-component equality encoded bitmap index, see Chan and Ioannidis (1998). The basic idea is to compress bitmap indexes by encoding all values into a smaller number system by applying modular multiplication. This significantly reduces the space requirements for attributes of high cardinality.

To summarize, bitmaps are more suited for high-dimensional queries with low attribute cardinality. Tree index structures are better for low-dimensional range queries with attributes of high cardinality.


next up previous contents index
Next: 9.5 Extraction, Transformation and Up: 9. Statistical Databases Previous: 9.3 Architectures, Concepts and