Extraction, Transformation and Load: Data Warehouse Term Paper

ETL: Data Warehouse The data that is to be stored in data warehouse emanates from multiple and heterogeneous sources. Data in each of the operational sources are distinct in characteristics and form. Quality of data in these sources is many of the times inadequate and inconsistent. These source systems contain data presented in various forms and can not be directly transferred into a data warehouse. As we know that data warehouses are built to provide managers vital information for strategic decision making purposes. The data can only be vital and relevant if they are in consistent and homogenous form. ETL (Extraction, Transformation and Loading) is the process of reshaping the data into useful information that is to be stored in a data warehouse (Ponniah, 2003, p. 257). Figure 1. Steps of building a data warehouse: the ETL process Data Extraction The process of extraction can be very complex because you need to extract sheer amount of data that lie on disparate and diverse systems. The sources of data can be on ASCII files, VSAM files, legacy databases and systems, mainframe computers and other proprietary systems. Source identification is a critical issue that needs to be considered during data extraction phase. You need to identify all the relevant and proper sources of data. Do not overload your data warehouse with unnecessary pile of data. As a first step you need to make your dimension and fact tables and include all the target data items (attributes and metrics) that are required. For each target data item find the source data item and if there are multiple sources for one target data item than you need to select the preferred source. For missing values inspect the source data thoroughly. Data in the operational systems are in a state of constant flux. Business transactions keep the changing the data in operational systems. For example the value of bank account balances increases or decreases as you make a transaction. This issue makes it necessary for data warehouses to store the current state of data as well. There are two types of capturing the data when they change their current state. 1. Immediate Data Extraction 2. Deferred Data Extraction Under first option the data extraction is real time. They are generally divided into three categories. The three methods are: 1. Capture through transaction logs 2. Capture through database triggers 3. Capture through source applications The first option uses transaction logs of the Database Management Systems. As each transaction adds, updates or deletes data than the DBMS immediately writes entries on to the log file. This technique reads the log file and selects all the updated data. This option only works for DBMS. If your source system data lies on flat files than this method will not work under such circumstances. The second option uses triggers which are special stored procedures that are stored on the database and are fired whenever any predefined event takes place. Under this method, trigger programs are created for all events for which data is to be captured. For instance, if you need to fetch data for a Product table, then you will create a trigger program which will capture all the changes in Product table. This method is also only applicable for DBMS. The last option uses an application to capture the changes in data. Unlike the previous two methods, this procedure allows you to capture the data whether it resides on a flat file or DBMS. Deferred data Extraction methods do not capture the data in real time. They are divided into two types: 1. Capture Based on Date and Time Stamp 2. Capture through Comparing Files Under the first option every time a source record is updated it is marked with a stamp showing the date and time of the created record. The time stamp provides a basis for selecting records for data extraction. If you run an extraction program after every one day then it will select only those records which have the time stamp later than of the previous day. Under the second method you do a full file comparison between today’s copy and the last copy of the data. After that you capture the changes between two copies. Data Integration, Cleansing and Transformation After the extraction of data it can not be directly loaded into a data warehouse. All the extracted data must be made usable and working for analysis by the decision makers. You need to enrich and enhance the quality of the extracted data before handing over it to the users of the data warehouse. The improved data than becomes relevant and can be used for strategic decision making. Data warehouse are used for making crucial decisions that’s why correctness and high quality of data is vital to avoid wrong conclusions. Data quality problems are the biggest challenge faced by the data warehouse builders. Let us analyze few of the data quality problems. 1. Entity Identification problem becomes obvious when you are capturing data from disparate systems. For instance, if you have three different legacy applications running in your organizations at different times in the past than you will have three customer files running in each of the different applications. One of them would may be for the order entry system, second can be for the customer service system and the third for the marketing system. Each of the customer file will have Unique Identification Number (Primary Key) in each application but there is a possibility that each unique identifier might be different in each of the application. This is an identification problem in which you do not know that which of the record relates to the same customer (Ponniah, 2003, p. 276). 2. Records can have same primary key but might contain different data. This can occur if two units or organizations have merged together. Another example of such kind of problem can be apparent when a legacy system stores data about its customers using a fixed digit primary key (suppose that five digit primary key). If number of customer exceeds 99,999 than the legacy system can solve this problem by archiving the records of the old customers and reassign a primary key of 1 to the new customer. This can create a complication for data warehouse because when you capture data from the archived file and the new file than the duplication of primary keys can raise an issue. 3. Sources might contain invalid or dummy data. For instance, if you are registering on the website for some sort of membership and if it is necessary to enter the zip code than you might fill the zip code field with an invalid number 99999 to pass the numerical edits. This problem also occurs in operational system where some of the fields are filled up with invalid data. 4. Most of the time the data fields are left blank and they contain null value. In operational systems users are only concerned with the important fields for the customer (Phone Number, Address and Email Address) that are needed to mail a billing statement or for a calling purpose. They do not pay attention to the demographic attributes of the customer (Income, Age Range and Gender) which are essential for analysis and decision making in a data warehouse. 5. Data records can have inconsistent values and can be encoded in different ways in different systems. The value of gender in some system are specified as (M=Male and F=Female) whereas the other source system will use (0=Male and 1= Female). 6. There are often multiple ways to represent same piece of information. For instance some source system might use “Oxford University”, some will use “OA” for the same data field and other might use “University of Oxford”. In a data warehouse we will need to store a single way of representing the data. 7. Two fields in the source system might depend on each other and can contradict each other. For example, you can not have a state (California) with a zip code of 08817 (New Jersey). 8. Fields can have different data but with the same name. “Total Sales” can mean fiscal year sales for one of the source and it can mean Calendar year sales for the other source. 9. Data can even violate business rules. In a Bank Loan System, the minimum interest rate can not be higher than the maximum rate for a variable rate loan. Data cleansing deals with detecting and removing these errors in order to improve quality of the data (Rahm & DoHai, 2000). These quality issues can be tackled by using various techniques such as Statistical tools, Clustering, and a Reference system. Statistical tools identify outlier fields and records. Clustering identify outliers record by using algorithms. Reference System is a central repository of data standards. These standards are distributed to each data source which than imposes the standards on itself (Boyno, 2003). Data Transformation is a process of making wide variety of manipulations to change the extracted source data into usable information to be stored in the data warehouse (). We will describe some of the functions of transformation process. 1. Data formats are emended by changing the data types (char, date, int) and length of fields. 2. Large field records are split into fewer and smaller records. For instance, if we have a single field in the source system named “Address” which contains the data about address, city, state and country. For the data warehouse we need to split that field into individual component of Address, City, State and Country. 3. Information is also merged during this phase. For example, the “Product” dimension table has attributes from various sources. Some of the basic attributes can be captured from the source “Product” table. Other data can be captured from “Package Type” and “Brand” data tables. 4. Values of some field are calculated during the process of transformation. Some metrics that are present in the fact table need to be calculated which are not present in source data. If we need to calculate the profit margin than we can simply use the total sales and total cost from the source data to calculate or target data item. 5. Unit of measurements are converted as per requirements of the data warehouse. If your company has overseas operations than you need to convert the unit of the record so that all the numbers change into one standard unit of measurement. 6. Records in the data warehouse are often transformed into summarized form. For example, if a retail organization wants to analyze its sales patterns than it may not be necessary to store data about each individual sales transaction. Instead of that you might store the summary level data (Sales of each week). 7. Key Restructuring is an essential function of the data transformation process. For instance, if you have a Product ID=W1M531234 (where W1 is the warehouse code, M53 is sales territory and 1234 is the product number) than such built-in primary keys can create problem. When you load the data in warehouse using that primary key and if the warehouse for that product changes into W2, than the new Product ID becomes W2M531234. For incremental loads in data warehouse the same product will be stored using a different primary key ID. To avoid such problems you should never use primary key with built-in meanings. Transform such keys into generic keys generated by the system itself (Ponniah, 2003, p. 275). Data Loading As soon as the transformation functions are finished than the next major function of data loading takes the prepared data and store it in the data warehouse. The loading process takes excessive amount of time therefore the data warehouse has to be offline at the time of loading process. It is so because the loading process is very heavy and it can impact the performance of the data warehouse. You need to find a window of time when the loads may be scheduled without affecting the end users. References Ponniah, P. (2004). Data Warehousing Fundamentals. India: Replica Press. Boyno, A. E. (2003). Extraction, Transformation, and Loading in a Data Warehouse Course. Retrieved March 28, 2008, from http://www.isedj.org/isecon/2003/3521/ISECON.2003.Boyno.pdf Rahm, E. & DoHai, H. (2000). Data Cleaning: Problems and Current Approaches Retrieved March 28, 2008, from http://www.homepages.inf.ed.ac.uk/wenfei/tdd/reading/cleaning.pdf Maletic, J. I. & Marcus, A. (2000) Data Cleansing: Beyond Integrity Analysis Retrieved March 28, 2008, from http://www.cs.wayne.edu/~amarcus/papers/IQ2000.pdf Read More

Extraction, Transformation and Load: Data Warehouse - Term Paper Example

Extract of sample "Extraction, Transformation and Load: Data Warehouse"

CHECK THESE SAMPLES OF Extraction, Transformation and Load: Data Warehouse

(LOG501): SUPPLY CHAIN THEORY Word Task Assignment: Ordering Kentucky Hooch Brew

Decision Support and Business Intelligence Systems

Master Data and Data Warehousing and Business Intelligence Management

Accounting info system

Amazon.coms European Distribution Strategy

The Impact of Political Wrangling in the Implementation

Data Mining: the Personalization of the Organizations Business Processes

Data Warehouse and Business Intelligence Implementation