17
MIS2502: Data Analytics Extract, Transform, Load David Schuff [email protected] http://community.mis.temple.edu/dschuff

MIS2502: Data Analytics Extract, Transform, Load

  • Upload
    ziva

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

MIS2502: Data Analytics Extract, Transform, Load. David Schuff [email protected] http://community.mis.temple.edu/dschuff. Where we are…. Now we’re here…. Data entry. Transactional Database. Data extraction. Analytical Data Store. Data analysis. Stores real-time transactional data. - PowerPoint PPT Presentation

Citation preview

Page 1: MIS2502: Data Analytics Extract, Transform, Load

MIS2502:Data AnalyticsExtract, Transform, Load

David [email protected]

http://community.mis.temple.edu/dschuff

Page 2: MIS2502: Data Analytics Extract, Transform, Load

Where we are…

Transactional Database

Analytical Data Store

Stores real-time transactional data

Stores historical transactional and

summary data

Data entry

Data extraction

Data analysis

Now we’re here…

Page 3: MIS2502: Data Analytics Extract, Transform, Load

Getting the information into the data mart

The data in the operational database…

…is put into a data

warehouse…

…which feeds the data mart…

…and is analyzed as a

cube.

Now let’s address this part…

Page 4: MIS2502: Data Analytics Extract, Transform, Load

Extract, Transform, Load (ETL)

Extract data from the operational data store

Transform data into an analysis-ready format

Load it into the analytical data store

Page 5: MIS2502: Data Analytics Extract, Transform, Load

The Actual Process

Transactional Database 2

Transactional Database 1

Data Mart

Query

Query

Data conversion

Data conversion

Query

Query

Extract Transform Load

Relational database Dimensional database

Page 6: MIS2502: Data Analytics Extract, Transform, Load

ETL’s Not That Easy!

•What if the data is in different formats?

Data Consiste

ncy

•How do we know it’s correct?•What if there is missing data?•What if the data we need isn’t there?

Data Quality

Page 7: MIS2502: Data Analytics Extract, Transform, Load

Data Consistency: The Problem with Legacy Systems

• An IT infrastructure evolves over time

• Systems are created and acquired by different people using different specifications

This can happen through:• Changes in management• Mergers & Acquisitions• Externally mandated standards• Generally poor planning

Page 8: MIS2502: Data Analytics Extract, Transform, Load

This leads to many issuesRedundant data across the organization•Customer record maintained by accounts receivable and marketing

The same data element stored in different formats•Social Security number (123-45-6789 versus 123456789)

Different naming conventions•“Doritos” versus “Frito-Lay’s Doritos” verus “Regular Doritos”

Different unique identifiers used•Account_Number versus Customer_ID

What are the problems with each of these

?

Page 9: MIS2502: Data Analytics Extract, Transform, Load

What’s the big deal?

This is a fundamental problem for creating data cubes

We often need to combine information from several transactional databases

How do we know if we’re talking about the same customer or product?

Page 10: MIS2502: Data Analytics Extract, Transform, Load

Now think about this scenarioHotel Reservation Database Café Database

What are the differences between a “guest” and a “customer”?

Is there any way to know if a customer of the café is staying at the hotel?

CustomerCustomer_numberCustomer_nameCustomer_addressCustomer_cityCustomer_zipcode

OrderOrder_numberCustomer_numberHotel_idFood_item_idOrder_dateOrder_timeTable_number

Food itemOrder numberFood_item_idOrder_dateOrder_time

HotelsHotel_idCountry_codeHotel_nameHotel_addressHotel_cityHotel_zipcode

HotelsHotel_idCountry_codeHotel_nameHotel_addressHotel_cityHotel_zipcode

CountriesCountry_codeCountry_currencyCountry_name

Hotel roomsRoom_numberHotel_idRoom_typeRoom_floor

Room typesRoom_type_codeRoom_standard_rateRoom_descriptionSmoking_YN

Room BookingsBooking_idRoom_type_codeHotel_idCheckin_dateNumber_of_daysRoom_count

Guest BookingsBooking_idGuest_number

GuestsGuest_numberGuest_firstnameGuest_lastnameGuest_addressGuest_cityGuest_zipcodeGuest_email

Hotel Amenities LookupCharacteristic_idCharacteristic_description

Hotel AmenitiesCharacteristic_idHotel_id

Page 11: MIS2502: Data Analytics Extract, Transform, Load

Solution: “Single view” of data

• The entire organization understands a unit of data in the same way

• It’s both a business goal and a technology goal

but it’s really more this…

...than this

Page 12: MIS2502: Data Analytics Extract, Transform, Load

Closer look at the Guest/CustomerGuests

Guest_numberGuest_firstnameGuest_lastnameGuest_addressGuest_cityGuest_zipcodeGuest_email

CustomerCustomer_numberCustomer_nameCustomer_addressCustomer_cityCustomer_zipcode

Getting to a “single view” of data:

How would you represent “name?”

What would you use to uniquely

identify a guest/customer?

Would you include email address?

How do you figure out if you’re talking

about the same person?

vs.

Page 13: MIS2502: Data Analytics Extract, Transform, Load

Organizational issues

Why might there be resistance to data standardization?

Is it an option to just “fix” the transactional databases?

If two data elements conflict, who’s standard “wins?”

Page 14: MIS2502: Data Analytics Extract, Transform, Load

Data Quality

The degree to which the data reflects the actual environment

Do we have the right data?

Is the collection process reliable?

Is the data accurate?

Page 15: MIS2502: Data Analytics Extract, Transform, Load

Finding the right data

Choose data consistent with the goals of analysis

Verify that the data really measures what it claims to measure

Include the analysts in the design process

Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf

Page 16: MIS2502: Data Analytics Extract, Transform, Load

Ensuring accuracy

Know where the data comes from

Manual verification through sampling

Use of knowledge experts

Verify calculations for derived measures

Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf

Page 17: MIS2502: Data Analytics Extract, Transform, Load

Reliability of the collection process

Build fault tolerance into the process

Periodically run reports, check logs, and verify results

Keep up with (and communicate) changes

Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf