23
Java/XML ETL Engine By Bob Timlin

Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

Embed Size (px)

Citation preview

Page 1: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

Java/XML ETL Engine

By

Bob Timlin

Page 2: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

Outline

• Data Extraction, Transformation, and Loading (ETL).

• Java & XML

• Meta-Data • Mapping Data from Source to Target

Page 3: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

Outline

• Proposed XML Usage.

• XML for Meta-Data

• Challenges/Issues

• Sample XML Data File

• Sample XML Meta-Data File

Page 4: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

Extract/Transform/Load (ETL):

The process of getting data from the source system(s) into the data-warehouse is easily 80% of the effort of the entire data-warehouse.  This is because of the complexity of the source systems, the cleansing or transformation process, and all of the prep work to get the detail operational data into summary data-warehouse data.  The more the source systems you have the harder this process is and this increases exponentially. 

Cleaning/Transforming  the data is probably the most complicated part of this process.

Transformations can either be done on the source system or the target system.

Page 5: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

Java & XML

Currently ETL processes are mostly written in Cobol and C with Embedded SQL. There are many GUI tools out there to streamline this process. These tools mostly generate proprietary code that is then executed by an scheduling program.

All of the big vendors in this field are pushing XML as a language to store transformation meta-data and all of the big plays, sans Microsoft, are backing Java as the language to implement transformations. For some weird reason Microsoft doesn’t seem to like Java.

The Major Vendors include: IBM, Oracle, and Microsoft.

Page 6: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

Meta-Data

Data about data. In terms of data warehouse it stores information about the structures of both source and destination data and how to extract, transform, and load data. It may also maintain network configuration information like ip-addresses and ports. The meta-data coalition http://www.mdcinfo.com/ recently merged with Object Management Group (OMG) http://www.omg.org.  They are backed by many heavy-hitters including Oracle, IBM, and Microsoft. The industry seems to be moving towards using XML for storing meta-data.  This makes the meta-data very standardized and portable.

Page 7: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

Mapping Data from Source to Target:

Target: Name: The name of the logical table in the data-warehouse.Source: table name in the xml data file.Driver: JDBC driver nameUrl: Path to the data-warehouse.Username: username to connect to the data-warehousePassword: password to connect to the data-warehouse

Page 8: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

Mapping (continued)

Column: Name: The name of the logical column in the dw.Type: The data type of the logical column in the data warehouse.Key: Is this a primary key, if so the engine will use it in the where clause.Source: The name of the column in the xml data file

Page 9: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

Proposed XML Usage

• For meta-data about the ETL processing. This will contain all information about mapping source to target, including transformation rules.

• As a data-file to store data from database’s.

Page 10: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

XML for Meta-Data

The specification is designed to be flexible enough to support many protocols, however for our project we will only implement two protocols. 1. XML Data File, 2. JDBC

The Protocol will be part of the url attribute of the target or source node. Every transformation will have a source and target.

<source url="xml://localhost/tmp/test.xml“>…<target url="jdbc:oracle:thin:@localhost:1521:timlin" driver="oracle.jdbc.driver.OracleDriver" username=“scott" password=“tiger" name="srctest">

Page 11: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

The basic construct of a XML meta-data file is:

<translation> <source url=“…”, etc > <column name=“…”> [<rule language=“…”> </rule>] </column> [<column name=“…”>[<rule></rule>]</column>] </source> <target url=“…”, etc.> <column name=“…”, etc.>[<rule></rule>]</column> [<column name=“…”, etc.>[<rule></rule>]</column>] </target></ translation >

Page 12: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

Challenges/Issues

• Mapping multiple sources to multiple targets.

• Transformations can involve very complex coding. Especially eliminating duplicates, merging, and purging of data. These transformations usually involve “fuzzy” logic.

Page 13: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

<target url="jdbc:oracle:thin:@64.130.33.125:1521:timlin" driver="oracle.jdbc.driver.OracleDriver" username=“scott" password=“tiger" name="srctest"> <! As the target, connect to the database using JDBC and

Insert the data from the source XML file and rules that follow> <table name="patients"> <column name = "lname" source="fullname"> <rule language="java"> source.replace("'", "") </rule>

<rule language="sql"> INITCAP(SUBSTR(source, 1, INSTR(source, ',') -1)) </rule> </column>

Page 14: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

<column name="fname" source="fullname"> <rule language="java"> source.replace("'", "") </rule>

<rule language="sql"> INITCAP(SUBSTR(source, INSTR(source, ',') +1)) </rule> </column>

<column name="dob" source="dob"> <rule langauge="sql"> TO_DATE(source, 'DD/MM/YYYY') </rule> </column> </table> </target></translation>

Page 15: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

<translation><!From Database to XML> <source url="jdbc:oracle:thin:@64.130.33.125:1521:timlin" driver="oracle.jdbc.driver.OracleDriver" username=“scott" password=“tiger" name=“targetTest">

<table name="patients"> <column name = "lname" source="fullname"> <rule language="java"> source.replace("'", "") </rule>

<rule language="sql"> INITCAP(SUBSTR(source, 1, INSTR(source, ',') -1)) </rule> </column>

Page 16: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

<column name="fname" source="fullname"> <rule language="java"> source.replace("'", "") </rule>

<rule language="sql"> INITCAP(SUBSTR(source, INSTR(source, ',') +1)) </rule> </column>

<column name="dob" source="dob“> <rule langauge="sql"> TO_DATE(source, 'DD/MM/YYYY') </rule> </column> </table> </source>

Page 17: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

<target url="xml://localhost/tmp/test.xml"> <table name="patients"> <column name="fullname"></column> <column name="street"></column> <column name="city"></column> <column name="state"></column> <column name="zip"></column> <column name="dob"></column> <column name="balance"></column> </table> </target></translation>

Page 18: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

<Record TableName=“table1”> <Column1>data for column 1</Column1>

<Column2>data for column 2</Column2></Record><Record TableName=“table1”> <Column1>data for column 1</Column1> <Column2>data for column 2</Column2></Record><Record TableName=“table2”> <Column1>data for column 1</Column1> <Column2>data for column 2</Column2></Record>

Sample XML Data File

Page 19: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

<column name="month_admitted" type="number" source="Month_Admitted"></column>

<column name="year_admitted" type="number" source="Year_Admitted"></column>

<column name="source_of_admission" type="number" source="Source_Of_Admissions"></column>

<column name="disposition" type="number" source="Disposition"></column>

<column name="charges" type="number" source="Charges"></column>

<column name="drg" type="number" source="Diagnosis_Related_Group"></column>

<column name="rec_link_no" type="varchar" source="Record_Linkage_Number" key="yes"></column>

</target>

Page 20: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

<Record TableName="patient"> <ID>1</ID> <Facility>10735</Facility> <Age>67</Age> <Sex>2</Sex> <Ethnicity>2</Ethnicity> <Race>2</Race> <ZIP>946</ZIP> <Length_Of_Stay>18</Length_Of_Stay> <Month_Admitted>12</Month_Admitted> <Year_Admitted>1995</Year_Admitted> <Source_of_admission>512</Source_of_admission> <Disposition>11</Disposition> <Charges>36948</Charges> <Diagnosis_Related_Group>202</Diagnosis_Related_Group> <Record_Linkage_Number>FRFSFEM1E</Record_Linkage_Number> </Record>

Page 21: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

<Record TableName="drg">

<Diagnosis_Related_Group>1</Diagnosis_Related_Group>

<Major_Diagnostic>1</Major_Diagnostic>

<Category>S</Category>

<Description><![CDATA[CRANIOTOMY, AGE >17 EXCEPT FOR TRAUMA]]></Description>

</Record>

Page 22: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

Sample XML Meta-Data<target name="admits" source="patient" driver="org.gjt.mm.mysql.Driver" url="jdbc:mysql://localhost:3306/test" username="test" password=""> <column name="id" type="number" key="yes" source="ID"></column> <column name="facility" type="number" key="yes" source="Facility"></column> <column name="age" type="number" source="Age"></column> <column name="gender" type="number" source="Sex"></column> <column name="ethnicity" type="number" source="Ethnicity"></column> <column name="race" type="number" source="Race"></column> <column name="length_of_stay" type="number" source="Length_Of_Stay"></column> <column name="day_admitted" type="number" source="day_admitted" ></column>

Page 23: Java/XML ETL Engine By Bob Timlin. Outline Data Extraction, Transformation, and Loading (ETL). Java & XML Meta-Data Mapping Data from Source to Target

Thank You