Informatica Tutorials

Big Data Analytics

Data Warehousing Extraction Examples

You can extract data in two ways:

■ Extraction Using Data Files
■ Extraction Through Distributed Operations

Extraction Using Data Files

Most database systems provide mechanisms for exporting or unloading data from the
internal database format into flat files. Extracts from mainframe systems often use
COBOL programs, but many databases, as well as third-party software vendors,
provide export or unload utilities.
Data extraction does not necessarily mean that entire database structures are unloaded in flat files. In many cases, it may be appropriate to unload entire database tables or objects.

In other cases, it may be more appropriate to unload only a subset of a given table such as the changes on the source system since the last extraction or the results of
joining multiple tables together. Different extraction techniques vary in their
capabilities to support these two scenarios.

When the source system is an Oracle database, several alternatives are available for
extracting data into files:

■ Extracting into Flat Files Using SQL*Plus
■ Extracting into Flat Files Using OCI or Pro*C Programs
■ Exporting into Export Files Using the Export Utility
■ Extracting into Export Files Using External Tables


Extracting into Flat Files Using SQL*Plus

The most basic technique for extracting data is to execute a SQL query in SQL*Plus
and direct the output of the query to a file. For example, to extract a flat file,
country_city.log, with the pipe sign as delimiter between column values,
following SQL script could be run:

SET echo off SET pagesize 0 SPOOL country_city.log
SELECT distinct t1.country_name ||'|'|| t2.cust_city
FROM countries t1, customers t2 WHERE t1.country_id = t2.country_id
AND t1.country_name= 'United States of America';
SPOOL off

The exact format of the output file can be specified using SQL*Plus system variables.
This extraction technique offers the advantage of storing the result in a customized
format. Note that, using the external table data pump unload facility, you can also
extract the result of an arbitrary SQL operation. The example previously extracts the
results of a join.

This extraction technique can be parallelized by initiating multiple, concurrent
SQL*Plus sessions, each session running a separate query representing a different
portion of the data to be extracted. For example, suppose that you wish to extract data from an orders table, and that the orders table has been range partitioned by
month, with partitions orders_jan1998, orders_feb1998, and so on. To extract a
single year of data from the orders table, you could initiate 12 concurrent SQL*Plus
sessions, each extracting a single partition. The SQL script for one such session could be:

SPOOL order_jan.dat
SELECT * FROM orders PARTITION (orders_jan1998);
SPOOL OFF

These 12 SQL*Plus processes would concurrently spool data to 12 separate files. You
can then concatenate them if necessary (using operating system utilities) following the extraction. If you are planning to use SQL*Loader for loading into the target, these 12 files can be used as is for a parallel load with 12 SQL*Loader sessions.

Even if the orders table is not partitioned, it is still possible to parallelize the
extraction either based on logical or physical criteria. The logical method is based on logical ranges of column values, for example:

SELECT ... WHERE order_date
BETWEEN TO_DATE('01-JAN-99') AND TO_DATE('31-JAN-99');

The physical method is based on a range of values. By viewing the data dictionary, it is possible to identify the Oracle Database data blocks that make up the orders table.

Using this information, you could then derive a set of rowid-range queries for
extracting data from the orders table:

SELECT * FROM orders WHERE rowid BETWEEN value1 and value2;

Parallelizing the extraction of complex SQL queries is sometimes possible, although
the process of breaking a single complex query into multiple components can be
challenging. In particular, the coordination of independent processes to guarantee a
globally consistent view can be difficult. Unlike the SQL*Plus approach, using the
external table data pump unload functionality provides transparent parallel
capabilities.

Note that all parallel techniques can use considerably more CPU and I/O resources on
the source system, and the impact on the source system should be evaluated before
parallelizing any extraction technique.


Extracting into Flat Files Using OCI or Pro*C Programs
OCI programs (or other programs using Oracle call interfaces, such as Pro*C
programs), can also be used to extract data. These techniques typically provide
improved performance over the SQL*Plus approach, although they also require
additional programming. Like the SQL*Plus approach, an OCI program can extract the
results of any SQL query. Furthermore, the parallelization techniques described for the SQL*Plus approach can be readily applied to OCI programs as well.

When using OCI or SQL*Plus for extraction, you need additional information besides
the data itself. At minimum, you need information about the extracted columns. It is
also helpful to know the extraction format, which might be the separator between
distinct columns.


Exporting into Export Files Using the Export Utility

The Export utility allows tables (including data) to be exported into Oracle Database
export files. Unlike the SQL*Plus and OCI approaches, which describe the extraction
of the results of a SQL statement, Export provides a mechanism for extracting database objects. Thus, Export differs from the previous approaches in several important ways:

■ The export files contain metadata as well as data. An export file contains not only
the raw data of a table, but also information on how to re-create the table,
potentially including any indexes, constraints, grants, and other attributes
associated with that table.
■ A single export file may contain a subset of a single object, many database objects,
or even an entire schema.
■ Export cannot be directly used to export the results of a complex SQL query.
Export can be used only to extract subsets of distinct database objects.
■ The output of the Export utility must be processed using the Import utility.
Oracle provides the original Export and Import utilities for backward compatibility
and the data pump export/import infrastructure for high-performant, scalable and
parallel extraction.


Extracting into Export Files Using External Tables
In addition to the Export Utility, you can use external tables to extract the results from any SELECT operation. The data is stored in the platform independent, Oracle-internal data pump format and can be processed as regular external table on the target system.

The following example extracts the result of a join operation in parallel into the four specified files. The only allowed external table type for extracting data is the
Oracle-internal format ORACLE_DATAPUMP.

CREATE DIRECTORY def_dir AS '/net/dlsun48/private/hbaer/WORK/FEATURES/et';
DROP TABLE extract_cust;
CREATE TABLE extract_cust
ORGANIZATION EXTERNAL
(TYPE ORACLE_DATAPUMP DEFAULT DIRECTORY def_dir ACCESS PARAMETERS
(NOBADFILE NOLOGFILE)
LOCATION ('extract_cust1.exp', 'extract_cust2.exp', 'extract_cust3.exp',
'extract_cust4.exp'))
PARALLEL 4 REJECT LIMIT UNLIMITED AS
SELECT c.*, co.country_name, co.country_subregion, co.country_region
FROM customers c, countries co where co.country_id=c.country_id;


The total number of extraction files specified limits the maximum degree of
parallelism for the write operation. Note that the parallelizing of the extraction does not automatically parallelize the SELECT portion of the statement.

Unlike using any kind of export/import, the metadata for the external table is not part of the created files when using the external table data pump unload. To extract the appropriate metadata for the external table, use the DBMS_METADATA package, as
illustrated in the following statement:

SET LONG 2000
SELECT DBMS_METADATA.GET_DDL('TABLE','EXTRACT_CUST') FROM DUAL;


Extraction Through Distributed Operations
Using distributed-query technology, one Oracle database can directly query tables
located in various different source systems, such as another Oracle database or a
legacy system connected with the Oracle gateway technology. Specifically, a data
warehouse or staging database can directly access tables and data located in a
connected source system. Gateways are another form of distributed-query technology.
Gateways allow an Oracle database (such as a data warehouse) to access database
tables stored in remote, non-Oracle databases. This is the simplest method for moving
data between two Oracle databases because it combines the extraction and
transformation into a single step, and requires minimal programming. However, this
is not always feasible.

Suppose that you wanted to extract a list of employee names with department names
from a source database and store this data into the data warehouse. Using an Oracle
Net connection and distributed-query technology, this can be achieved using a single
SQL statement:

CREATE TABLE country_city AS SELECT distinct t1.country_name, t2.cust_city
FROM countries@source_db t1, customers@source_db t2
WHERE t1.country_id = t2.country_id
AND t1.country_name='United States of America';

This statement creates a local table in a data mart, country_city, and populates it
with data from the countries and customers tables on the source system.
This technique is ideal for moving small volumes of data. However, the data is
transported from the source system to the data warehouse through a single Oracle Net
connection. Thus, the scalability of this technique is limited. For larger data volumes, file-based data extraction and transportation techniques are often more scalable and thus more appropriate.

Related Posts Plugin for WordPress, Blogger...

Please Share

Twitter Delicious Facebook Digg Stumbleupon Favorites More

 
Follow TutorialBlogs
Share on Facebook
Tweet this Blog
Add Blog to Technorati
Home