17
Getting the data into the warehouse: extract, transform, load MIS2502 Data Analytics

GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Embed Size (px)

Citation preview

Page 1: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Getting the data into the warehouse: extract, transform, loadMIS2502

Data Analytics

Page 2: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Getting the information into the data mart

Now let’s address this part…

Page 3: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Extract, Transform, Load (ETL)

• The process of copying data from the transactional database to the analytical database

• Going from relational to dimensional

• Basically, it’s a matter of identifying where the data should come from to fill the data mart

Page 4: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

ETL Defined

Page 5: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

The Actual Process

Transactional Database 2

Transactional Database 1

Data Mart

Query

Query

Data conversion

Data conversion

Query

Query

Extract Transform Load

Page 6: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Main ETL Issues: Conversion Stage

Page 7: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Data Consistency: The Problem with Legacy Systems

• An IT infrastructure evolves over time• Systems are created and acquired by different people using different specifications

This can happen through:•Changes in management•Mergers & Acquisitions•Externally mandated standards•General poor planning

Page 8: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

This leads to many issues

• Redundant data across the organization• Customer record maintained by accounts

receivable and marketing

• The same data element stored in different formats• Social Security number (123-45-6789 versus

123456789)

• Different naming conventions• “Doritos” versus “Frito-Lay’s Doritos” verus

“Regular Doritos”

• Different unique identifiers used• Account_Number versus Customer_ID

What are the problems with each of these

?

Page 9: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

What’s the big deal?

• This is a fundamental problem for creating data cubes

• We often need to combine information from several transactional databases

• How do we know if we’re talking about the same customer or product?

Page 10: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Now think about this scenarioHotel Reservation Database Café Database

CustomerCustomer_numberCustomer_nameCustomer_addressCustomer_cityCustomer_zipcode

OrderOrder_numberCustomer_numberHotel_idFood_item_idOrder_dateOrder_timeTable_number

Food itemOrder numberFood_item_idOrder_dateOrder_time

HotelsHotel_idCountry_codeHotel_nameHotel_addressHotel_cityHotel_zipcode

HotelsHotel_idCountry_codeHotel_nameHotel_addressHotel_cityHotel_zipcode

CountriesCountry_codeCountry_currencyCountry_name

Hotel roomsRoom_numberHotel_idRoom_typeRoom_floor

Room typesRoom_type_codeRoom_standard_rateRoom_descriptionSmoking_YN

Room BookingsBooking_idRoom_type_codeHotel_idCheckin_dateNumber_of_daysRoom_count

Guest BookingsBooking_idGuest_number

GuestsGuest_numberGuest_firstnameGuest_lastnameGuest_addressGuest_cityGuest_zipcodeGuest_email

Hotel Amenities LookupCharacteristic_idCharacteristic_description

Hotel AmenitiesCharacteristic_idHotel_id

Page 11: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Solution: “Single view” of data

• The entire organization understands a unit of data in the same way

• It’s both a business goal and a technology goal

and really more this…

..than this

Page 12: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Closer look at the Guest/CustomerGuests

Guest_numberGuest_firstnameGuest_lastnameGuest_addressGuest_cityGuest_zipcodeGuest_email

CustomerCustomer_numberCustomer_nameCustomer_addressCustomer_cityCustomer_zipcode

vs.vs.

Page 13: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Organizational issues

• Why might there be resistance to data standardization?

• Is it an option to just “fix” the transactional databases?

• If two data elements conflict, who’s standard “wins?”

Page 14: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Data Quality

• The degree to which the data reflects the actual environment

Page 15: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Finding the right data

Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf

Page 16: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Ensuring accuracy

Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf

Page 17: GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Reliability of the collection process

Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf