32
1 XML Web Services for Data Mining and Repository: US EPA Toxics Release Inventory Brand L. Niemann Computer Scientist & XML & Web Services Specialist US Environmental Protection Agency Data Mining Technology for Military and Government Applications, Hotel Watergate, February 25-26, 2003 Disclaimer: Any reference to or depiction of the commercial product of any vendor is for illustrative purposes only and does not constitute an endorsement by EPA or the author.

XML Web Services in Support of e-Gov and the EPA Geospatial

  • Upload
    tommy96

  • View
    591

  • Download
    0

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: XML Web Services in Support of e-Gov and the EPA Geospatial

1

XML Web Services for Data Mining and Repository:US EPA Toxics Release Inventory

Brand L. NiemannComputer Scientist & XML & Web Services Specialist

US Environmental Protection AgencyData Mining Technology for Military and Government Applications, Hotel Watergate, February 25-26, 2003

Disclaimer: Any reference to or depiction of the commercial product of any vendor is for illustrative purposes only and does not constitute an endorsement by EPA or the author.

Page 2: XML Web Services in Support of e-Gov and the EPA Geospatial

2

Abstract

• The eXtensible Markup Language (XML) promotes information sharing and reuse and enables enterprise integration in XML repositories. The Toxics Release Inventory (TRI), published by the U.S. EPA, is a valuable source of information regarding toxic chemicals that are being used, manufactured, treated, transported, or released into the environment. The TRI database is about 8 GB and requires industrial-strength tools and analyses for data mining, indexing, conversion to XML, and storage and retrieval with XML Web Services. This pilot demonstrated that large EPA databases can be “data mined” and repurposed into XML repositories.

Page 3: XML Web Services in Support of e-Gov and the EPA Geospatial

3

Overview

• 1. The eXtensible Markup Language (XML)

• 2. The Toxics Release Inventory (TRI)

• 3. Tools and Analyses

• 4. Questions and Answers

Page 4: XML Web Services in Support of e-Gov and the EPA Geospatial

4

1. The eXtensible Markup Language (XML)

• Five years ago, the World Wide Web Consortium (W3C) published XML 1.0 as a Recommendation on February 10, 1998.

• The eXtensible Markup Language (XML) has become pervasive nearly everywhere that information is managed and has changed not only the way people publish documents on the Web but also the way people manage information internal to their enterprise.

• XML has emerged as the standard platform for convergence of information.

See http://www.w3.org/2003/02/xml-at-5.html

Page 5: XML Web Services in Support of e-Gov and the EPA Geospatial

5

2. The Toxics Release Inventory (TRI)

• 2.1 Background

• 2.2 Web PDF and CD-ROM Raw DAT

• 2.3 Strategy:– Data mine large DAT files.– Index Web documents for searching.– Convert large DAT files to XML.– Store, search, and retrieve XML from a

repository.

Page 6: XML Web Services in Support of e-Gov and the EPA Geospatial

6

2. The Toxics Release Inventory (TRI)

• 2.1 Background:– The Toxics Release Inventory (TRI), published by the U.S. EPA,

is a valuable source of information regarding toxic chemicals that are being used, manufactured, treated, transported, or released into the environment.

– Two statutes, Section 313 of the Emergency Planning and Community Right-To-Know Act (EPCRA) and section 6607 of the Pollution Prevention Act (PPA), mandate that a publicly accessible toxic chemical database be developed and maintained by US EPA. This database, known as the Toxics Release Inventory (TRI), contains information concerning waste management activities and the release of toxic chemicals by facilities that manufacture, process, or otherwise use said materials. Using this information, citizens, businesses, and governments can work together to protect the quality of their land, air and water.

Page 7: XML Web Services in Support of e-Gov and the EPA Geospatial

7

2. The Toxics Release Inventory (TRI)

• 2.2 Web PDF (121 files/14.2 MB):– 2000 TRI Executive Summary -- a short overview of the 2000

TRI data that provides a national overview of reporting trends including summary tables and charts. (PDF Format, 351KB)

– Press Materials -- information that is provided to the press to quickly understand the TRI data, including a background document, charts, and tables. (45 PDF Format, 423KB)

– TRI Overview -- a general overview of the TRI Program, factors to consider in using the TRI data, and the scope of the data. (PDF Format, 93KB)

– Q&A's -- general and specific questions relating to the 2000 TRI data and certain data trends. (PDF Format, 50KB)

– Public Data Release Report Archive -- access to past TRI Public Data Release reports (64 PDF Format, 8.3 MB)

http://epa.gov/tri/tridata/tri00/index.htm

Page 8: XML Web Services in Support of e-Gov and the EPA Geospatial

8

2. The Toxics Release Inventory (TRI)• 2.2 Web EXE:

– File Type 1: Facility, Chemical, Releases and Other Waste Management Summary Information. This file contains facility information (Part I on Form R and Form A) as well as most chemical information (Part II on Form R and Form A). Data elements are reported individually. The information is also disaggregated based on Waste Management code (i.e., "M" code), and aggregated up to On-site Releases, Off-site Releases, Other On-site Waste Management, and Transfers Off-site for Further Waste Management categories. (84,079 records)

– File Type 2: Detailed Waste Management and Source Reduction Activities. This files contains facility information (Part I on Form R and Form A) as well as the detailed information regarding source reduction and recycling activities (Part II, Section 8 on Form R) and on-site waste treatment methods (Part II, Section 7 on Form R). (84,079 records)

– File Type 3A: Details of Transfers Off-site. This file contains facility information (Part I on Form R and Form A) as well as details of individual transfers off-site (Part II, Section 6.2 on Form R). (100,033 records)

– File Type 3B: Details of Transfers to Publicly Owned Treatment Works (POTW). This file contains facility information (Part I on Form R and Form A) as well as a list of POTWs (Part II, Section 6.1.B on Form R). (84,079 records)

http://epa.gov/tri/tridata/tri00/data/index.htm

Page 9: XML Web Services in Support of e-Gov and the EPA Geospatial

9

2. The Toxics Release Inventory (TRI)

• 2.2 CD-ROM Raw DAT:– public_2000 14 files/246.2 MB:

• TRI_CHEM_ACTIVITY 2.2 MB• TRI_ENERGY_RECOVERY 1.4 MB• TRI_OFF_SITE_TRANSFER_LOCATION 33.6 MB• TRI_ONSITE_WASTE_TREATMENT_MET 7.3 MB• TRI_ONSITE_WASTESTREAM 4.1 MB• TRI_POTW_LOCATION 15.5 MB• TRI_RECYCLING_PROCESS 1.4 MB• TRI_RELEASE_QTY 43.5 MB• TRI_REPORTING_FORM 52.6 MB• TRI_SOURCE_REDUCT_METHOD 3.8 MB• TRI_SOURCE_REDUCT_QTY 49.2 MB• TRI_SUBMISSION_SIC 5.7 MB• TRI_TRANSFER_QTY 18.0 MB• TRI_WATER_STREAM 7.9 MB

Page 10: XML Web Services in Support of e-Gov and the EPA Geospatial

10

2. The Toxics Release Inventory (TRI)

• 2.2 CD-ROM Raw DAT (continued):– Internal 30 files/4.4 GB:

• TRI_CHEM_ACTIVITY 35.2 MB• tri_chem_info 75KB• TRI_CODE_DESC 570 KB• TRI_COUNTY 1.5 MB• TRI_ENERGY_RECOVERY 16.0 MB• TRI_FACILITY 31.6 MB• TRI_FACILITY_DB 1.7 MB• TRI_FACILITY_DB_HISTORY 6.5 MB• TRI_FACILITY_HISTORY 124 MB• TRI_FACILITY_NPDES 1.1 MB• TRI_FACILITY_NPDES_HISTORY 4.2 MB• TRI_FACILITY_RCRA 1.8 MB• TRI_FACILITY_RCRA_HISTORY 7.1 MB• TRI_FACILITY_SIC 1.6 MB• TRI_FACILITY_SIC_HISTORY 6.8 MB

Page 11: XML Web Services in Support of e-Gov and the EPA Geospatial

11

2. The Toxics Release Inventory (TRI)

• 2.2 CD-ROM Raw DAT (continued)::– Internal 30 files/4.4 GB (continued):

• TRI_FACILITY_UIC 789 KB• TRI_FACILITY_UIC_HISTORY 2.1 MB• TRI_OFF_SITE_TRANSFER_LOCATION 603 MB• TRI_ONSITE_WASTE_TREATMENT_MET 106 MB• TRI_ONSITE_WASTESTREAM 73.1 MB• TRI_POTW_LOCATION 216 MB• TRI_RECYCLING_PROCESS 15.2 MB• TRI_RELEASE_QTY 773 MB• TRI_REPORTING_FORM 1.1 GB• TRI_SOURCE_REDUCT_METHOD 41.8 MB• TRI_SOURCE_REDUCT_QTY 800 MB• TRI_SUBMISSION_SIC 106 MB• TRI_TRANSFER_QTY 248 MB• TRI_WATER_STREAM 167 MB• TRI_ZIP_CODE 4.5 MB

Page 12: XML Web Services in Support of e-Gov and the EPA Geospatial

12

2. The Toxics Release Inventory (TRI)

• 2.2 CD-ROM Raw DAT (continued):– public_87_99 30 files/3.3 GB:

• TRI_CHEM_ACTIVITY 28.5 MB• tri_chem_info 75 KB• TRI_CODE_DESC 570 KB• TRI_COUNTY 1.5 MB• TRI_ENERGY_RECOVERY 11.8 MB• TRI_FACILITY 28.5 MB• TRI_FACILITY_DB 1.7 MB• TRI_FACILITY_DB_HISTORY 6.5 MB• TRI_FACILITY_HISTORY 111 MB• TRI_FACILITY_NPDES 1.1 MB• TRI_FACILITY_NPDES_HISTORY 4.2 MB• TRI_FACILITY_RCRA 1.8 MB• TRI_FACILITY_RCRA_HISTORY 7.1 MB• TRI_FACILITY_SIC 1.6 MB• TRI_FACILITY_SIC_HISTORY 6.8 MB

Page 13: XML Web Services in Support of e-Gov and the EPA Geospatial

13

2. The Toxics Release Inventory (TRI)

• 2.2 CD-ROM Raw DAT (continued):– public_87_99 30 files/3.3 GB (continued):

• TRI_FACILITY_UIC 789 KB• TRI_FACILITY_UIC_HISTORY 2.1 MB• TRI_OFF_SITE_TRANSFER_LOCATION 497 MB• TRI_ONSITE_WASTE_TREATMENT_MET 81.9 MB• TRI_ONSITE_WASTESTREAM 59.6 MB• TRI_POTW_LOCATION 177 MB• TRI_RECYCLING_PROCESS 11.0 MB• TRI_RELEASE_QTY 636 MB• TRI_REPORTING_FORM 622 MB• TRI_SOURCE_REDUCT_METHOD 31.0 MB• TRI_SOURCE_REDUCT_QTY 646 MB• TRI_SUBMISSION_SIC 86.1 MB• TRI_TRANSFER_QTY 200 MB• TRI_WATER_STREAM 142 MB• TRI_ZIP_CODE 4.5 MB

Page 14: XML Web Services in Support of e-Gov and the EPA Geospatial

14

3. Tools and Analyses

• 3.1 Insightful’s I-Miner

• 3.2 Next Page’s NXT 3

• 3.3 FileMaker’s FileMaker Pro 6

• 3.4 SoftwareAG’s Tamino 4.1

Page 15: XML Web Services in Support of e-Gov and the EPA Geospatial

15

3.1 Insightful’s I-Miner

• 3.1.1 I-Miner Supports Each Step of the Data Mining Life Cycle.

• 3.1.2 I-Miner Components.

• 3.1.3 I-Miner on TRI 2000 Public Release Data.

• 3.1.4 XML Features

Page 16: XML Web Services in Support of e-Gov and the EPA Geospatial

16

3.1 Insightful’s I-Miner

3.1.1

http://www.insightful.com/products/product.asp?PID=26

Page 17: XML Web Services in Support of e-Gov and the EPA Geospatial

17

3.1 Insightful’s I-Miner

• 3.1.2 I-Miner Components:– Full life-cycle, from data access through deployment,

data mining workbench.– Scalable, extensible and affordable toolset that

enables both new data miners and skilled modelers to solve their toughest analytic challenges with a best of breed approach.

– Advanced pipeline architecture and analytics are built to scale to the data size problems of today and far into the future.

– Embedded data analysis language that allows it to adapt to the changing business needs of its users.

Page 18: XML Web Services in Support of e-Gov and the EPA Geospatial

18

3.1 Insightful’s I-Miner3.1.3 I-Miner on TRI 2000 Public Release Data

Rationale: Toxic chemical releases to different media should be correlated – outliers suggest need to followup with reporting facilities.

Page 19: XML Web Services in Support of e-Gov and the EPA Geospatial

19

3.1 Insightful’s I-Miner

3.1.3 I-Miner on TRI 2000 Public Release Data

Page 20: XML Web Services in Support of e-Gov and the EPA Geospatial

20

3.1 Insightful’s I-Miner

• 3.1.4 XML Features:– Improving the Effectiveness of Statistical Computing

John Chambers, Lucent Technologies, 2001 S-PLUS International User Conference, October 18, 2001:

• We need to import other software effectively, and export our software to new users.

• Inter-system interfaces using an object-based model can contribute to both efforts, using our time efficiently.

• Object-oriented programming is a useful extension to the S language.

• Other efforts, such as data standardization using tools such as XML, support the same goals.

– See http://www.omegahat.org

Page 21: XML Web Services in Support of e-Gov and the EPA Geospatial

21

3.2 Next Page’s NXT 3

• NextPage NXT 3.4 - Integration of content navigation and searching with XML:– Proprietary (Word, PDF, etc.) to XIL*.– Native XML to XIL.– Relational to XML and then XIL.– Unstructured (HTML, Text, etc.) to XIL.

*eXtensible Indexing Language (XIL) - NextPage uses the standard simple object access protocol (SOAP) to exchange and normalize information between local content directories, assembling meta-indexes (XIL based on XSLT) so that users can search or manipulate content transparently, regardless of physical location.

http://www.nextpage.com/section.asp?f=toc&section=Products&path=Products/products/nxt3

Page 22: XML Web Services in Support of e-Gov and the EPA Geospatial

22

3.2 Next Page’s NXT 3Directory Structure: PDF files in hierarchical folders.

NXT 3 Content Network Manager:Build indexed collection of files.

Page 23: XML Web Services in Support of e-Gov and the EPA Geospatial

23

3.2 Next Page’s NXT 3

Advanced Search for: “toxic chemicals” Search Results: Hits highlighted withinshort list of words in over 500 PDF files.

Page 24: XML Web Services in Support of e-Gov and the EPA Geospatial

24

3.3 FileMaker’s FileMaker Pro 6

• Most data is in relational databases so need to XML-enable them:– Main players: Oracle, SQL Server, DB2, Sybase, Access,

Objectivity, FileMaker*, and FoxPro.• Native XML databases are beginning to come on strong:

– Main players: eXcelon and Tamino* (commercial) and Xindice, eXist, 4Suite, and ozone (Open Source).

• Also middleware products that transfer to/from relational databases:– Main players: JAXB, .NET, Delphi, and WebSphere

(commercial) and Castor, JXQuick, Zeus, and Zope (Open Source).

*Used in this paper.Source: Ronald Bourret, XML and Databases, XML 2002 Conference

Tutorial, December 9th. http://www.rpbourret.com

Page 25: XML Web Services in Support of e-Gov and the EPA Geospatial

25

3.3 FileMaker’s FileMaker Pro 6

• Leading workgroup database software (8.5 million units shipped worldwide) and runs cross-platform (Windows and Mac) and Linux (FileMaker Server only).

• XML import/export and free XSLT Stylesheet library.• Integrate with major desktop applications and with XML-

based instant messaging platforms (e.g. ServiceObjects - Web sites that contain real-time data drawn from different sources or systems that automatically update as the data changes) and deliver content to a wide range of devices.

• Serve data-centric Web pages at faster speed with less load on the server in XML and/or HTML format that is more customizable with stylesheets or JavaScript.

http://www.filemaker.com/xml/

Page 26: XML Web Services in Support of e-Gov and the EPA Geospatial

26

3.3 FileMaker’s FileMaker Pro 6

Page 27: XML Web Services in Support of e-Gov and the EPA Geospatial

27

3.3 FileMaker’s FileMaker Pro 6

Page 28: XML Web Services in Support of e-Gov and the EPA Geospatial

28

3.4 SoftwareAG’s Tamino 4.1

• Why store XML?:– Single source publishing.– Effective searching.– eBusiness messages.– XML-driven Web sites.– Web Services*.– Mobile Communication devices.– Office 11

*Longer message life-cycles – moving from simple service invocations to long-running stateful interactions and the need for message management.

Source: Mike Champion and Steve Hamby, Native XML Database Applications Development, XML 2002 Conference Tutorial, December 9th.

Page 29: XML Web Services in Support of e-Gov and the EPA Geospatial

29

3.4 SoftwareAG’s Tamino 4.1

• Native XML storage.• Store for any type of data.• Extensible by definition.• Consolidates data from various sources in one

place.• Find-Engine for fast retrieval of XML-based

content.• Built-in full-text retrieval at no extra cost.• Multi-channel output formatting capabilities.• Server extensions for custom functionality and

application integration.

Page 30: XML Web Services in Support of e-Gov and the EPA Geospatial

30

3.4 SoftwareAG’s Tamino 4.1

http://www.softwareag.com/tamino/

Page 31: XML Web Services in Support of e-Gov and the EPA Geospatial

31

3.4 SoftwareAG’s Tamino 4.1

Page 32: XML Web Services in Support of e-Gov and the EPA Geospatial

32

4. Questions and Answers

• Brand Niemann, Ph.D.:– Computer Scientist and XML and Web Services

Specialist, Office of Environmental Information, US Environmental Protection Agency:

• 202-566-1657• [email protected]• http://www.sdi.gov

– Chair, Federal Chief Information Officer Council’s XML Web Services Working Group:

[email protected]• http://web-services.gov