I/O performance should always be a key consideration for data warehouse designers
and administrators. The typical workload in a data warehouse is especially I/O
intensive, with operations such as large data loads and index builds, creation of
materialized views, and queries over large volumes of data. The underlying I/O
system for a data warehouse should be designed to meet these heavy requirements.
In fact, one of the leading causes of performance issues in a data warehouse is poor
I/O configuration. Database administrators who have previously managed other
systems will likely need to pay more careful attention to the I/O configuration for a
data warehouse than they may have previously done for other environments.
This chapter provides the following five high-level guidelines for data-warehouse I/O
configurations:
■ Configure I/O for Bandwidth not Capacity
■ Stripe Far and Wide
■ Use Redundancy
■ Test the I/O System Before Building the Database
■ Plan for Growth
The I/O configuration used by a data warehouse will depend on the characteristics of
the specific storage and server capabilities, so the material in this chapter is only
intended to provide guidelines for designing and tuning an I/O system.
Configure I/O for Bandwidth not Capacity
Storage configurations for a data warehouse should be chosen based on the I/O
bandwidth that they can provide, and not necessarily on their overall storage capacity.Buying storage based solely on capacity has the potential for making a mistake,especially for systems less than 500GB is total size. The capacity of individual disk drives is growing faster than the I/O throughput rates provided by those disks, leading to a situation in which a small number of disks can store a large volume of data, but cannot provide the same I/O throughput as a larger number of small disks.
As an example, consider a 200GB data mart. Using 72GB drives, this data mart could
be built with as few as six drives in a fully-mirrored environment. However, six drives might not provide enough I/O bandwidth to handle a medium number of concurrent
users on a 4-CPU server. Thus, even though six drives provide sufficient storage, a
larger number of drives may be required to provide acceptable performance for this
system.
While it may not be practical to estimate the I/O bandwidth that will be required by a data warehouse before a system is built, it is generally practical with the guidance of the hardware manufacturer to estimate how much I/O bandwidth a given server can potentially utilize, and ensure that the selected I/O configuration will be able to successfully feed the server. There are many variables in sizing the I/O systems, but one basic rule of thumb is that your data warehouse system should have multiple disks for each CPU (at least two disks for each CPU at a bare minimum) in order to achieve optimal performance.
Stripe Far and Wide
The guiding principle in configuring an I/O system for a data warehouse is to
maximize I/O bandwidth by having multiple disks and channels access each database
object. You can do this by striping the datafiles of the Oracle Database. A striped file is a file distributed across multiple disks. This striping can be managed by software (such as a logical volume manager), or within the storage hardware. The goal is to ensure that each tablespace is striped across a large number of disks (ideally, all of the disks) so that any database object can be accessed with the highest possible I/O bandwidth.
Use Redundancy
Because data warehouses are often the largest database systems in a company, they
have the most disks and thus are also the most susceptible to the failure of a single
disk. Therefore, disk redundancy is a requirement for data warehouses to protect
against a hardware failure. Like disk-striping, redundancy can be achieved in many
ways using software or hardware.
A key consideration is that occasionally a balance must be made between redundancy
and performance. For example, a storage system in a RAID-5 configuration may be
less expensive than a RAID-0+1 configuration, but it may not perform as well, either.
Redundancy is necessary for any data warehouse, but the approach to redundancy
may vary depending upon the performance and cost constraints of each data
warehouse.
Test the I/O System Before Building the Database
The most important time to examine and tune the I/O system is before the database is
even created. Once the database files are created, it is more difficult to reconfigure the files. Some logical volume managers may support dynamic reconfiguration of files, while other storage configurations may require that files be entirely rebuilt in order to reconfigure their I/O layout. In both cases, considerable system resources must be devoted to this reconfiguration.When creating a data warehouse on a new system, the I/O bandwidth should be tested before creating all of the database datafiles to validate that the expected I/O levels are being achieved. On most operating systems, this can be done with simple scripts to measure the performance of reading and writing large test files.
Plan for Growth
A data warehouse designer should plan for future growth of a data warehouse. There
are many approaches to handling the growth in a system, and the key consideration is
to be able to grow the I/O system without compromising on the I/O bandwidth. You
cannot, for example, add four disks to an existing system of 20 disks, and grow the
database by adding a new tablespace striped across only the four new disks. A better
solution would be to add new tablespaces striped across all 24 disks, and over time
also convert the existing tablespaces striped across 20 disks to be striped across all 24 disks.
Storage Management
Two features to consider for managing disks are Oracle Managed Files and Automatic
Storage Management. Without these features, a database administrator must manage
the database files, which, in a data warehouse, can be hundreds or even thousands of
files. Oracle Managed Files simplifies the administration of a database by providing
functionality to automatically create and manage files, so the database administrator
no longer needs to manage each database file. Automatic Storage Management
provides additional functionality for managing not only files but also the disks.
With Automatic Storage Management, the database administrator would administer a small number of disk groups. Automatic Storage Management handles the tasks of striping and providing disk redundancy, including rebalancing the database files when new disks are added to the system.
Big Data Analytics
Hardware and I/O Considerations in Data Warehouses
2:59 AM
divjeev