Informatica Tutorials

Big Data Analytics

Typical Data Warehouse Integrity Constraints

This section assumes that you are familiar with the typical use of constraints. That is,constraints that are both enabled and validated. For data warehousing, many users
have discovered that such constraints may be prohibitively costly to build and
maintain. The topics discussed are:

■ UNIQUE Constraints in a Data Warehouse
■ FOREIGN KEY Constraints in a Data Warehouse
■ RELY Constraints
■ NOT NULL Constraints
■ Integrity Constraints and Parallelism
■ Integrity Constraints and Partitioning
■ View Constraints

UNIQUE Constraints in a Data Warehouse
A UNIQUE constraint is typically enforced using a UNIQUE index. However, in a data
warehouse whose tables can be extremely large, creating a unique index can be costly
both in processing time and in disk space.

Suppose that a data warehouse contains a table sales, which includes a column
sales_id. sales_id uniquely identifies a single sales transaction, and the data
warehouse administrator must ensure that this column is unique within the data
warehouse.
One way to create the constraint is as follows:
ALTER TABLE sales ADD CONSTRAINT sales_uk
UNIQUE (prod_id, cust_id, promo_id, channel_id, time_id);
By default, this constraint is both enabled and validated. Oracle implicitly creates a
unique index on sales_id to support this constraint. However, this index can be
problematic in a data warehouse for three reasons:
■ The unique index can be very large, because the sales table can easily have
millions or even billions of rows.
■ The unique index is rarely used for query execution. Most data warehousing
queries do not have predicates on unique keys, so creating this index will
probably not improve performance.
■ If sales is partitioned along a column other than sales_id, the unique index
must be global. This can detrimentally affect all maintenance operations on the
sales table.
A unique index is required for unique constraints to ensure that each individual row
modified in the sales table satisfies the UNIQUE constraint.
For data warehousing tables, an alternative mechanism for unique constraints is
illustrated in the following statement:
ALTER TABLE sales ADD CONSTRAINT sales_uk
UNIQUE (prod_id, cust_id, promo_id, channel_id, time_id) DISABLE VALIDATE;
This statement creates a unique constraint, but, because the constraint is disabled, a
unique index is not required. This approach can be advantageous for many data
warehousing environments because the constraint now ensures uniqueness without
the cost of a unique index.
However, there are trade-offs for the data warehouse administrator to consider with
DISABLE VALIDATE constraints. Because this constraint is disabled, no DML
statements that modify the unique column are permitted against the sales table. You
can use one of two strategies for modifying this table in the presence of a constraint:

■ Use DDL to add data to this table (such as exchanging partitions).
■ Before modifying this table, drop the constraint. Then, make all necessary data
modifications. Finally, re-create the disabled constraint. Re-creating the constraint
is more efficient than re-creating an enabled constraint. However, this approach
does not guarantee that data added to the sales table while the constraint has
been dropped is unique.

FOREIGN KEY Constraints in a Data Warehouse

In a star schema data warehouse, FOREIGN KEY constraints validate the relationship
between the fact table and the dimension tables. A sample constraint might be:
ALTER TABLE sales ADD CONSTRAINT sales_time_fk
FOREIGN KEY (time_id) REFERENCES times (time_id)
ENABLE VALIDATE;

However, in some situations, you may choose to use a different state for the FOREIGN
KEY constraints, in particular, the ENABLE NOVALIDATE state. A data warehouse
administrator might use an ENABLE NOVALIDATE constraint when either:
■ The tables contain data that currently disobeys the constraint, but the data
warehouse administrator wishes to create a constraint for future enforcement.
■ An enforced constraint is required immediately.
Suppose that the data warehouse loaded new data into the fact tables every day, but
refreshed the dimension tables only on the weekend. During the week, the dimension
tables and fact tables may in fact disobey the FOREIGN KEY constraints. Nevertheless,
the data warehouse administrator might wish to maintain the enforcement of this
constraint to prevent any changes that might affect the FOREIGN KEY constraint
outside of the ETL process. Thus, you can create the FOREIGN KEY constraints every
night, after performing the ETL process, as shown in the following:

ALTER TABLE sales ADD CONSTRAINT sales_time_fk
FOREIGN KEY (time_id) REFERENCES times (time_id)
ENABLE NOVALIDATE;

ENABLE NOVALIDATE can quickly create an enforced constraint, even when the
constraint is believed to be true. Suppose that the ETL process verifies that a FOREIGN KEY constraint is true. Rather than have the database re-verify this FOREIGN KEY constraint, which would require time and database resources, the data warehouse
administrator could instead create a FOREIGN KEY constraint using ENABLE
NOVALIDATE.

RELY Constraints

The ETL process commonly verifies that certain constraints are true. For example, it
can validate all of the foreign keys in the data coming into the fact table. This means that you can trust it to provide clean data, instead of implementing constraints in the data warehouse. You create a RELY constraint as follows:

ALTER TABLE sales ADD CONSTRAINT sales_time_fk
FOREIGN KEY (time_id) REFERENCES times (time_id)
RELY DISABLE NOVALIDATE;

This statement assumes that the primary key is in the RELY state. RELY constraints,
even though they are not used for data validation, can:

■ Enable more sophisticated query rewrites for materialized views.
■ Enable other data warehousing tools to retrieve information regarding constraints
directly from the Oracle data dictionary.
Creating a RELY constraint is inexpensive and does not impose any overhead during
DML or load. Because the constraint is not being validated, no data processing is
necessary to create it.

NOT NULL Constraints
When using query rewrite, you should consider whether NOT NULL constraints are
required. The primary situation where you will need to use them is for join back query rewrite.


Integrity Constraints and Parallelism

All constraints can be validated in parallel. When validating constraints on very large tables, parallelism is often necessary to meet performance goals. The degree of
parallelism for a given constraint operation is determined by the default degree of
parallelism of the underlying table.

Integrity Constraints and Partitioning

You can create and maintain constraints before you partition the data. Later chapters
discuss the significance of partitioning for data warehousing. Partitioning can improve constraint management just as it does to management of many other operations. For example, "Maintaining the Data Warehouse" provides a scenario creating
UNIQUE and FOREIGN KEY constraints on a separate staging table, and these
constraints are maintained during the EXCHANGE PARTITION statement.

View Constraints

You can create constraints on views. The only type of constraint supported on a view
is a RELY constraint.
This type of constraint is useful when queries typically access views instead of base
tables, and the database administrator thus needs to define the data relationships
between views rather than tables. View constraints are particularly useful in OLAP
environments, where they may enable more sophisticated rewrites for materialized
views.

Related Posts Plugin for WordPress, Blogger...

Please Share

Twitter Delicious Facebook Digg Stumbleupon Favorites More

 
Follow TutorialBlogs
Share on Facebook
Tweet this Blog
Add Blog to Technorati
Home