Informatica Tutorials

Big Data Analytics

Optimizing DML Operations During Refresh

You can optimize DML performance through the following techniques:

■ Implementing an Efficient MERGE Operation
■ Maintaining Referential Integrity
■ Purging Data


Implementing an Efficient MERGE Operation
Commonly, the data that is extracted from a source system is not simply a list of new
records that needs to be inserted into the data warehouse. Instead, this new data set is a combination of new records as well as modified records. For example, suppose that most of data extracted from the OLTP systems will be new sales transactions. These records will be inserted into the warehouse's sales table, but some records may
reflect modifications of previous transactions, such as returned merchandise or
transactions that were incomplete or incorrect when initially loaded into the data
warehouse. These records require updates to the sales table.

As a typical scenario, suppose that there is a table called new_sales that contains
both inserts and updates that will be applied to the sales table. When designing the
entire data warehouse load process, it was determined that the new_sales table
would contain records with the following semantics:

■ If a given sales_transaction_id of a record in new_sales already exists in
sales, then update the sales table by adding the sales_dollar_amount and
sales_quantity_sold values from the new_sales table to the existing row in
the sales table.
■ Otherwise, insert the entire new record from the new_sales table into the sales
table.

This UPDATE-ELSE-INSERT operation is often called a merge. A merge can be
executed using one SQL statement.


Example MERGE Operation

MERGE INTO sales s USING new_sales n
ON (s.sales_transaction_id = n.sales_transaction_id)
WHEN MATCHED THEN
UPDATE SET s.sales_quantity_sold = s.sales_quantity_sold + n.sales_quantity_sold,
s.sales_dollar_amount = s.sales_dollar_amount + n.sales_dollar_amount
WHEN NOT MATCHED THEN INSERT (sales_transaction_id, sales_quantity_sold,
sales_dollar_amount)

In addition to using the MERGE statement for unconditional UPDATE ELSE INSERT
functionality into a target table, you can also use it to:
■ Perform an UPDATE only or INSERT only statement.
■ Apply additional WHERE conditions for the UPDATE or INSERT portion of the
MERGE statement.
■ The UPDATE operation can even delete rows if a specific condition yields true.


Example Omitting the INSERT Clause
In some data warehouse applications, it is not allowed to add new rows to historical
information, but only to update them. It may also happen that you don't want to
update but only insert new information. The following example demonstrates
INSERT-only with UPDATE-only functionality:

MERGE USING Product_Changes S -- Source/Delta table
INTO Products D1 -- Destination table 1
ON (D1.PROD_ID = S.PROD_ID) -- Search/Join condition
WHEN MATCHED THEN UPDATE -- update if join
SET D1.PROD_STATUS = S.PROD_NEW_STATUS


Example Omitting the UPDATE Clause
The following statement illustrates an example of omitting an UPDATE:

MERGE USING New_Product S -- Source/Delta table
INTO Products D2 -- Destination table 2
ON (D2.PROD_ID = S.PROD_ID) -- Search/Join condition
WHEN NOT MATCHED THEN -- insert if no join
INSERT (PROD_ID, PROD_STATUS) VALUES (S.PROD_ID, S.PROD_NEW_STATUS)

When the INSERT clause is omitted, Oracle performs a regular join of the source and
the target tables. When the UPDATE clause is omitted, Oracle performs an antijoin of
the source and the target tables. This makes the join between the source and target
table more efficient.

Example Skipping the UPDATE Clause
In some situations, you may want to skip the UPDATE operation when merging a given
row into the table. In this case, you can use an optional WHERE clause in the UPDATE
clause of the MERGE. As a result, the UPDATE operation only executes when a given
condition is true. The following statement illustrates an example of skipping the
UPDATE operation:

MERGE
USING Product_Changes S -- Source/Delta table
INTO Products P -- Destination table 1
ON (P.PROD_ID = S.PROD_ID) -- Search/Join condition
WHEN MATCHED THEN
UPDATE -- update if join
SET P.PROD_LIST_PRICE = S.PROD_NEW_PRICE
WHERE P.PROD_STATUS <> "OBSOLETE" -- Conditional UPDATE

This shows how the UPDATE operation would be skipped if the condition P.PROD_
STATUS <> "OBSOLETE" is not true. The condition predicate can refer to both the
target and the source table.


Example Conditional Inserts with MERGE Statements
You may want to skip the INSERT operation when merging a given row into the table.
So an optional WHERE clause is added to the INSERT clause of the MERGE. As a result,
the INSERT operation only executes when a given condition is true. The following
statement offers an example:

MERGE USING Product_Changes S -- Source/Delta table
INTO Products P -- Destination table 1
ON (P.PROD_ID = S.PROD_ID) -- Search/Join condition
WHEN MATCHED THEN UPDATE -- update if join
SET P.PROD_LIST_PRICE = S.PROD_NEW_PRICE
WHERE P.PROD_STATUS <> "OBSOLETE" -- Conditional
WHEN NOT MATCHED THEN
INSERT (PROD_ID, PROD_STATUS, PROD_LIST_PRICE) -- insert if not join
VALUES (S.PROD_ID, S.PROD_NEW_STATUS, S.PROD_NEW_PRICE)
WHERE S.PROD_STATUS <> "OBSOLETE"; -- Conditional INSERT


This example shows that the INSERT operation would be skipped if the condition
S.PROD_STATUS <> "OBSOLETE" is not true, and INSERT will only occur if the
condition is true. The condition predicate can refer to the source table only. The
condition predicate can only refer to the source table.


Example Using the DELETE Clause with MERGE Statements
You may want to cleanse tables while populating or updating them. To do this, you
may want to consider using the DELETE clause in a MERGE statement, as in the
following example:

MERGE USING Product_Changes S
INTO Products D ON (D.PROD_ID = S.PROD_ID)
WHEN MATCHED THEN
UPDATE SET D.PROD_LIST_PRICE =S.PROD_NEW_PRICE, D.PROD_STATUS = S.PROD_NEWSTATUS
DELETE WHERE (D.PROD_STATUS = "OBSOLETE")
WHEN NOT MATCHED THEN
INSERT (PROD_ID, PROD_LIST_PRICE, PROD_STATUS)
VALUES (S.PROD_ID, S.PROD_NEW_PRICE, S.PROD_NEW_STATUS);

Thus when a row is updated in products, Oracle checks the delete condition
D.PROD_STATUS = "OBSOLETE", and deletes the row if the condition yields true.
The DELETE operation is not as same as that of a complete DELETE statement. Only
the rows from the destination of the MERGE can be deleted. The only rows that will be
affected by the DELETE are the ones that are updated by this MERGE statement. Thus,
although a given row of the destination table meets the delete condition, if it does not join under the ON clause condition, it will not be deleted.


Example Unconditional Inserts with MERGE Statements

You may want to insert all of the source rows into a table. In this case, the join between the source and target table can be avoided. By identifying special constant join conditions that always result to FALSE, for example, 1=0, such MERGE statements will be optimized and the join condition will be suppressed.

MERGE USING New_Product S -- Source/Delta table
INTO Products P -- Destination table 1
ON (1 = 0) -- Search/Join condition
WHEN NOT MATCHED THEN -- insert if no join
INSERT (PROD_ID, PROD_STATUS) VALUES (S.PROD_ID, S.PROD_NEW_STATUS)


Maintaining Referential Integrity
In some data warehousing environments, you might want to insert new data into
tables in order to guarantee referential integrity. For example, a data warehouse may
derive sales from an operational system that retrieves data directly from cash
registers. sales is refreshed nightly. However, the data for the product dimension
table may be derived from a separate operational system. The product dimension
table may only be refreshed once for each week, because the product table changes
relatively slowly.

If a new product was introduced on Monday, then it is possible for
that product's product_id to appear in the sales data of the data warehouse before
that product_id has been inserted into the data warehouses product table.
Although the sales transactions of the new product may be valid, this sales data will
not satisfy the referential integrity constraint between the product dimension table
and the sales fact table. Rather than disallow the new sales transactions, you might
choose to insert the sales transactions into the sales table. However, you might also
wish to maintain the referential integrity relationship between the sales and
product tables. This can be accomplished by inserting new rows into the product
table as placeholders for the unknown products.

As in previous examples, we assume that the new data for the sales table will be
staged in a separate table, new_sales. Using a single INSERT statement (which can
be parallelized), the product table can be altered to reflect the new products:

INSERT INTO product
(SELECT sales_product_id, 'Unknown Product Name', NULL, NULL ...
FROM new_sales WHERE sales_product_id NOT IN
(SELECT product_id FROM product));


Purging Data
Occasionally, it is necessary to remove large amounts of data from a data warehouse.
A very common scenario is the rolling window discussed previously, in which older
data is rolled out of the data warehouse to make room for new data.

However, sometimes other data might need to be removed from a data warehouse.
Suppose that a retail company has previously sold products from XYZ Software, and
that XYZ Software has subsequently gone out of business. The business users of the
warehouse may decide that they are no longer interested in seeing any data related to
XYZ Software, so this data should be deleted.

One approach to removing a large volume of data is to use parallel delete as shown in
the following statement:

DELETE FROM sales WHERE sales_product_id IN (SELECT product_id
FROM product WHERE product_category = 'XYZ Software');

This SQL statement will spawn one parallel process for each partition. This approach
will be much more efficient than a serial DELETE statement, and none of the data in
the sales table will need to be moved. However, this approach also has some
disadvantages. When removing a large percentage of rows, the DELETE statement will
leave many empty row-slots in the existing partitions. If new data is being loaded
using a rolling window technique (or is being loaded using direct-path INSERT or
load), then this storage space will not be reclaimed. Moreover, even though the
DELETE statement is parallelized, there might be more efficient methods. An
alternative method is to re-create the entire sales table, keeping the data for all
product categories except XYZ Software.

CREATE TABLE sales2 AS SELECT * FROM sales, product
WHERE sales.sales_product_id = product.product_id
AND product_category <> 'XYZ Software'
NOLOGGING PARALLEL (DEGREE 8)
#PARTITION ... ; #create indexes, constraints, and so on
DROP TABLE SALES;
RENAME SALES2 TO SALES;

This approach may be more efficient than a parallel delete. However, it is also costly in terms of the amount of disk space, because the sales table must effectively be instantiated twice.

An alternative method to utilize less space is to re-create the sales table one partition at a time:

CREATE TABLE sales_temp AS SELECT * FROM sales WHERE 1=0;
INSERT INTO sales_temp PARTITION (sales_99jan)
SELECT * FROM sales, product
WHERE sales.sales_product_id = product.product_id
AND product_category <> 'XYZ Software';

ALTER TABLE sales EXCHANGE PARTITION sales_99jan WITH TABLE sales_temp;

Continue this process for each partition in the sales table.

Related Posts Plugin for WordPress, Blogger...

Please Share

Twitter Delicious Facebook Digg Stumbleupon Favorites More

 
Follow TutorialBlogs
Share on Facebook
Tweet this Blog
Add Blog to Technorati
Home