DATA CENTER - Massachusetts Institute of Technology · 2013-10-30 · kbml lacito landxml ledes...

Preview:

Citation preview

DATA CENTER AN MIT RESEARCH PROGRAM

David Brock The Data Center Program Massachusetts Institute of Technology

Optimizing Resources through Connected Data Systems

Integrating Information in Industry and Government

Data Explosion

Data doubles every 18 months1

1. Gantz, John and Reinsel, David, “As the Economy Contracts, the Digital Universe Expands,” IDC – Multimedia White Paper, May 2009. (http://www.emc.com/digital_universe)

2009 2010 2011 2012 2013

Exabytes

0

500

1,000

1,500

2,000

2,500

Data Explosion

Data Explosion

The world's digital content is equivalent to a stack of books stretching from the Earth to Pluto and back.1

Data Explosion

30 Tons

If we stick with the book analogy, then the digital universe in 2010 is equivalent to 30 tons of books for every man, woman and child on the Earth. 1

1. Gantz, John and Reinsel, David, “As the Economy Contracts, the Digital Universe Expands,” IDC – Multimedia White Paper, May 2009. (http://www.emc.com/digital_universe)

Internet Usage

Online banking

Electronic Commerce

Social Networking

500 million active users on Facebook 1

$32.4 Billion 2

1. ComScore.com 2. U.S Census Bureau's Quarterly Retail E-Commerce Sales 2nd Quarter 2009

75% US customers use online banking 1

Academic Research

“. . . people were drowning in scholarly information, and drowning in information in general. So it takes twice as much time for people to begin their research.“ - Damon Zucca, Oxford University Press, Executive Editor

Next Generation

•  The average teen texter exchanges 1,500 texts each month, with an average of 50 per day. 1

•  21% of teenagers access the Internet only from their cell phones.

1. Pew Research Centre's Internet and American Life Project.

Machine Understanding

Yet almost none of this information is understandable by a computer.

Extensible Markup Language

<?xml version="1.0" encoding="ISO-8859-1"?> <bookstore> <book> <title lang="en">Twilight</title> <price>29.99</price> </book> <book> <title lang="en">Learning XML</title> <price>39.95</price> </book> </bookstore>

Extensible Markup Language

<?xml version="1.0" encoding="ISO-8859-1"?> <bookstore> <book> <title lang="en">Twilight</title>

<price>29.99</price> </book> <book> <title lang="en">Learning XML</title> <price>39.95</price> </book> </bookstore>

4ML AML AML AML AML AML AML ABML ABML ACML ACML ACAP ACS X12 ADML AECM AFML AGML AHML AIML AIML AIF AL3 ANML ANNOTEA ANATML APML APPML AQL APPEL ARML ARML ASML ASML ASTM ARML ARML ASML

ARML ARML ASML ASML ASTM ATML ATML ATML ATML AWML AXML AXML AXML AXML BML BML BML BML BML BML BannerML BCXML BEEP BGML BHTML BIBLIOML BIOML BIPS BizCodes BLM XML BPML BRML BSML BCXML BEEP BGML BHTML

BiblioML BCXML BEEP BGML BHTML BIBLIOML BIOML BIPS BizCodes BLM XML BPML BRML BSML CML xCML CaXML CaseXML xCBL CBML CDA CDF CDISC CELLML ChessGML ChordML ChordQL CIM CIML CIDS CIDX xCIL CLT CNRP ComicsML CIM CIML CIDS

CIDX xCIL CLT CNRP ComicsML Covad xLink CPL CP eXchange CSS CVML CWMI CycML DML DAML DaliML DaqXML DAS DASL DCMI DOI DeltaV DIG35 DLML DMML DocBook DocScope DoD XML DPRL DRI DSML DSD DXS EML EML DLML EAD ebXML

eBIS-XML ECML eCo EcoKnow edaXML EMSA eosML ESML ETD-ML FieldML FINML FITS FIXML FLBC FLOWML FPML FSML GML GML GML GXML GAME GBXML GDML GEML GEDML GEN GeoLang GIML GXD GXL Hy XM HITIS HR-XML HRMML HTML HTTPL

HTTP-DRP HumanML HyTime IML ICML IDE IDML IDWG IEEE DTD IFX IMPP IMS Global InTML IOTP IRML IXML IXRetail JabberXML JDF JDox JECMM JLife JSML JSML JScoreML KBML LACITO LandXML LEDES LegalXML Life Data LitML LMML LogML LogML LTSC XML MAML

MatML MathML MBAM MISML MCF MDDL MDSI-XML Metarule MFDX MIX MMLL MML MML MML MoDL MOS MPML MPXML MRML MSAML MTML MTML MusicXML NAML xNAL NAA Ads Navy DTD NewsML NML NISO DTB NITF NLMXML NVML OAGIS OBI OCF ODF

ODRL OeBPS OFX OIL OIM OLifE OML ONIX DTD OOPML OPML OpenMath Office XML OPML OPX OSD OTA PML PML PML PML PML PML PML PML P3P PDML PDX PEF XML PetroML PGML PhysicsML PICS PMML PNML PNML PNG PrintML

PrintTalk ProductionML PSL PSI QML QAML QuickData RBAC RDDl RDF RDL RecipeML RELAX RELAX NG REXML REPML ResumeXML RETML RFML RightsLang RIXML RoadmOPS RosettaNet PIP RSS RuleML SML SML SML SML SAML SABLE SAE J2008 SBML Schemtron SDML SearchDM-XML SGML

SHOE SIF SMML SMBXML SMDL SDML SMIL SOAP SODL SOX SPML SpeechML SSML STML STEP STEPML SVG SWAP SWMS SyncML TML TML TML TalkML TaxML TDL TDML TEI ThML TIM TIM TMML TMX TP TPAML TREX TxLife

UML UBL UCLP UDDI UDEF UIML ULF UMLS UPnP URI/URL UXF VML vCalendar vCard VCML VHG VIML VISA XML VMML VocML VoiceXML VRML WAP WDDX WebML WebDAV WellML WeldingXML Wf-XML WIDL WITSML WorldOS WSML WSIA XML XML Court XML EDI

XML F XML Key XMLife XML MP XML News XML RPC XML Schema XML Sign XML Query XML P7C XML TP XMLVoc XML XCI XAML XACML XBL XSBEL XBN XBRL XCFF XCES Xchart Xdelta XDF XForms XGF XGL XGMML XHTML XIOP XLF XLIFF XLink XMI XMSG XMTP XNS

Standards

Thousands of incompatible standards make interoperability practically impossible!

<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>hsas@dhs.gov</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status> <msgType>Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc> <circle>32.1995,-110.8925 0</circle> </area> </info> </alert>

Common Alerting Protocol (CAP)

The Common Alerting Protocol (CAP) is an XML-based data format for exchanging public warnings and emergencies between alerting technologies. CAP implementations have been demonstrated by agencies and companies including: United States Department of Homeland Security; National Weather Service; United States Geological Survey; California Office of Emergency Services; and many others.

CAP is the foundation technology for the proposed "Integrated Public Alert and Warning System," an all-hazard, all-media national warning architecture being developed by DHS, the National Weather Service and the Federal Communications Commission.

<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>hsas@dhs.gov</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status> <msgType>Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty>

<senderName> Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc> <circle>32.1995,-110.8925 0</circle> </area> </info> </alert>

Compound Tag Names

<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>hsas@dhs.gov</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status>

<msgType> Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc> <circle>32.1995,-110.8925 0</circle> </area> </info> </alert>

Abbreviations

<alert xmlns="http://www.incident.com/cap/1.1"> ... <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc>

<circle> 32.1995,-110.8925 0 </circle> </area> </info> </alert>

Embedded Data Structures

<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>hsas@dhs.gov</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status> <msgType>Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName>

<headline> Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area>

Natural Language

<Event xmlns:gml="http://www.opengis.net/gml/3.2" id="ForestFire“> <Descriptor>Forest Fire</Descriptor> <Identifier label="Type">Pine Forest</Identifier> <SimpleProperty label="Containment">5%</SimpleProperty> <Location id="FF-Location"> <Boundary><Polygon> <gml:LinearRing> <gml:pos>39.09 -105.91</gml:pos> . . . <gml:pos>39.09 -105.91</gml:pos> </gml:LinearRing></exterior> </Polygon></Boundary> </Location> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause>Believed natural; arson unlikely</Cause> </FireInformation> <Conditions> <ReportedAt>2008-04-14T18:00:00Z</ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>

DoD Universal Core (UCore)

Universal Core (UCore) is a national security and preparedness initiative led jointly by the Department of Defense (DoD), Department of Justice (DOJ), Department of Homeland Security (DHS), and the Office of the Director of National Intelligence (ODNI).

<Event xmlns:gml="http://www.opengis.net/gml/3.2" id="ForestFire“> <Descriptor>Forest Fire</Descriptor> <Identifier label="Type">Pine Forest</Identifier> <SimpleProperty label="Containment">5%</SimpleProperty> <Location id="FF-Location"> <Boundary><Polygon>

<gml:LinearRing> <gml:pos>39.09 -105.91</gml:pos> . . . <gml:pos>39.09 -105.91</gml:pos> </gml:LinearRing> </Polygon></Boundary> </Location> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause>Believed natural; arson unlikely</Cause> </FireInformation> <Conditions>

Composite Schema

<Event xmlns:gml="http://www.opengis.net/gml/3.2" id="ForestFire“> <Descriptor>Forest Fire</Descriptor> <Identifier label="Type">Pine Forest</Identifier> <SimpleProperty label="Containment">5%</SimpleProperty> <Location id="FF-Location"> <Boundary><Polygon> <gml:LinearRing> <gml:pos>39.09 -105.91</gml:pos> . . . <gml:pos>39.09 -105.91</gml:pos> </gml:LinearRing></exterior> </Polygon></Boundary> </Location> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause>Believed natural; arson unlikely</Cause> </FireInformation> <Conditions> <ReportedAt>2008-04-14T18:00:00Z</ReportedAt>

<Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure>

Embedded Units

<current_observation version="1.0"> <location>Boston, Logan International Airport, MA</location> <latitude>42.38</latitude> <longitude>-71.03</longitude> <observation_time>Jul 25 2009, 3:54 pm EDT</observation_time> <weather>Partly Cloudy</weather>

<temp_f>84.0</temp_f> <temp_c>28.9</temp_c> <relative_humidity>44</relative_humidity> <wind_dir>West</wind_dir> <wind_degrees>270</wind_degrees> <wind_mph>11.5</wind_mph> <wind_kt>10</wind_kt> <pressure_mb>1013.1</pressure_mb> <pressure_in>29.92</pressure_in> <dewpoint_f>60.1</dewpoint_f> <dewpoint_c>15.6</dewpoint_c> <visibility_mi>10.00</visibility_mi> </current_observation>

National Weather Service (NOAA NWS)

NWS offers hourly weather observations formatted with xml tags to aid in the parsing of the information by automated programs used to populate databases, display information on web pages or other similar applications

M Language

M Language

Common Vocabulary and Data Format

M Language

M Language

•  Dictionary – Repository of well-defined terms

•  Ontologies – Connections between terms

•  Grammar – Rules for composing terms into documents and messages

Dictionary

cell

Repository of “concepts”

- the basic structural and functional unit of all organisms.

token definition

Dictionary

cell.0

Keyed concepts

- the basic structural and functional unit of all organisms.

cell.1

“cell.biology”

“cell.phone”

- A hand-held mobile radiotelephone for use in an area divided into small sections, each with its own short-range transmitter/receiver.

cell.2

“cell.manufacturing”

- group of workers and/or machines work together as a team to produce dedicated set of products or assemblies.

Compose Concepts - Words

cell.0

cell.0 plural.0

cell

cells +

cell.0 modifier.0 cellular +

Compose concepts Words

Compose Concepts - Phrases

Phrases – sequences of concepts

design.0 cell.1 + research.0 +

company.0 chemical.0 + description.0 +

annual.0 maximum.0 + temperature.0 +

cellular phone design research

chemical company descriptions

maximum annual temperature

Compose Concepts – Structured Documents

<Event id="ForestFire"> <Description>Forest Fire</Description> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause> Believed natural; arson unlikely </Cause> </FireInformation> <Conditions> <ReportedAt> 2008-04-14T18:00:00Z </ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>

event

forest.trees fire.event

containment.bound

5 percent.unit

cause.effect

weather.conditions

2008-04-14T18:00:00Z

wind.air speed.velocity

northwest.direction

3 mph.unit

wind.air direction.angle

date.time

Belived natural

Ontology

iron.0

Connections between concepts

heavy metal.0

cobalt.0

manganese.0

mecury.0

lead.0 type-of

M Language

Challenge

Transform disparate structured and unstructured data into the M Language

Transform Structured and Unstructured Data

<Event id="ForestFire"> <Description>Forest Fire</Description> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause> Believed natural; arson unlikely </Cause> </FireInformation> <Conditions> <ReportedAt> 2008-04-14T18:00:00Z </ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>

event

fire information

cause

believed natural

Natural Language Processing

Pre-process Text

- load text - spell check -  clean white space -  remove meta characters -  process capitals letters -  process apostrophes -  tokenize

Generic Token Labeler

Tokens

- adjectives - adverbs -  articles -  auxiliary verbs -  conjunctions -  idioms -  interrogatives

Specialized Token Labeler

Labeled Tokens

- times / dates - locations -  personalities -  organizations -  products -  numbers -  equations/formulas -  acronyms / codes

Domain Specific Token Labeler

Labeled Tokens

-  organizational labels -  proprietary labels

Labeled Tokens

Token Process

Labeled Tokens

- names (first) -  names (last) -  nouns -  numbers -  prepositions -  pronouns -  punctuation -  verbs

-  compound words -  word function culling -  token ranking

Processed Tokens

Token Cluster

-  noun phrases -  verb phrases -  prepositional phrases -  conjunction -  grammatical partitions -  tree formation

Token graph Meta-data Analysis

-  source analysis -  context analysis -  tone analysis

Disambiguation

-  meta-data -  word association -  context analysis -  concept abstraction -  pattern abstraction

Token graph + meta-data

Link Process

Concept graph

Concept graph

-  modifier references -  prepositional phrase references -  pronoun references

Knowledge Storage

-  Knowledge storage

Concept graph

M Language Dictionary -  generic concepts and representations

M Language Dictionary -  specialized concepts and representations

M Language Dictionary -  proprietary concepts and representations

M Language Knowledge Database

M Language Knowledge Database

RDF Triple Extraction

M Language to RDF Concept Map

RDF Triples

M Language

past.tense state.0

Fire natural believed

natural.state

thin perception. attribute

attribute.0 attribute.0

attribute.0 + attribute.0

Attribute.0

is

past.tense + be.state

past.tense + concept.state

fire.event

event.object

Emergency. event

object.0

object.0 → state.0 → attribute.0

statement.0

past.tense + believe.state

Transform Structured and Unstructured Data

<Event id="ForestFire"> <Description>Forest Fire</Description> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause> Believed natural; arson unlikely </Cause> </FireInformation> <Conditions> <ReportedAt> 2008-04-14T18:00:00Z </ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>

event

fire information

cause

believed natural

event.0

fire.1

cause.0

believe.0 past.0 natural.0

M Language

Applications

GMT M

BFA M

DIS M

TDL M

Network

M

M

M

M

GMT

BFA

DIS

TDL

Interoperation of existing government data protocols

Real-time Data Translation

From point-to-point to source translation - “n2 to n”

DOJ

Network

M

DHS

M

FBI

M

IBIS

M

Fusion of existing government data sources

Data Fusion

Integrate and fuse data without changing database

Data System Integration

Connect disparate systems without changing protocol

Parts

Gas

Food M-EDI

Adapter

Processing Flow in M

M-KML Adapter

Google Earth

Request Processor

Inventory Service

Requisition Algorithm

Procurement Processing System

Order Service

Fulfillment Report

EDI

KML

Vendors DC Local Govt /

Vendor System

s

M A

dapters

Data Flow Management

Procurement Tracking System

M Adapters

M Adapters

M Dict.

& Svcs

Enterprise Integration

Dedicated Solution Network

Command Centers

Oracle Fusion Middleware / Enterprise Service Bus Services: Security, Identity Management, SSO, UDDI, Others

Network

•  COP / SA •  Portals •  C2I Apps •  Intelligence

ISR Sensors Network

Mobile Users

External Users

External Organizations, Networks & Simulators

SOA

Web Portal

External Interfaces Public & Private Feeds

RSS CAP etc.

Public & Private Data Sources

ISR Sensors Network

Data Sources

Data Integration Layer (Oracle / AquaLogic)

Adaptors

Web Services Web Services

Web Portal Enhanced Data Interfaces

Simulated Sensor Network

Simulated Data

Sources

Smart Data Archives

Smart Dynamic Repositories

•  Portals •  Mash-ups •  Wiki •  Web 2

Web Center

Pre-built Component

Collaboration •  IM •  Online

Meetings

Intelligent Services (M Language Apps.) •  Information Fusion •  Monitors & Triggers

Rapid Composition Simulation Subsystem •  Scenario Generation •  Data Generation •  Simulation Control •  Monitors •  Training

Business Intelligence

•  Dashboards •  Ad-Hoc

BI Date Warehouse

Operations •  C2I •  Decision Support •  Operations Mgmt.

Business Systems •  HRMS / Training •  Performance Based

Logistics

Enterprise Search •  COTS Tools

Future

Future

Development

•  Vocabulary acquisition – automatically build vocabulary, definitions and relationships from parsed text

•  Context - use document context to improve parsing

•  Pattern extraction – identify common structural patterns in parsed data to improve processing

•  Semi-Structured data – apply parsing approach to semi-structured data such as generic web pages and scripting languages

Thank you

M Language

cell.biology -- the basic structural and functional unit of all organisms; they may exist as independent units of life (as in monads) or may form colonies or tissues as in.

cell.electric, electric_cell.1 -- A device that delivers an electric current as the result of a chemical reaction.

cell.jail, jail_cell.1, prison_cell.1 -- A room where a prisoner is kept.

cell.room, cubicle.3 -- Small room is which a monk or nun lives.

cell.compartment -- Any small compartment; the cells of a honeycomb.

cell.telephone, cellular_telephone.1, cellular_phone.1, cellphone.1, mobile_phone.1 -- A hand-held mobile radiotelephone for use in an area divided into small sections, each with its own short-range transmitter/receiver.

cell.political, cadre.2 -- A small unit serving as part of or as the nucleus of a larger political movement.

cell.manufacturing -- Manufacturing cell, in which a group of workers and/or machines work together as a team to produce dedicated set of products or assemblies.

Dictionary Data Structure

computer

IP Address

18.78.14.156

Location

Latitude

145.681

Unstructured Data

<Shipping_Incidents> <Incident> <Date>04/26/2007</Date> <Reference_No>2007-130</Reference_No> <Subregion>93</Subregion> <Geolocation>9.566666667,111.9166667</Geolocation> <Aggressor>PIRATES</Aggressor> <Victim>FISHING VESSEL</Victim> <Description>Sprattly Islands, SOUTH CHINA SEA: Pirates boarded a large Japanese fishing vessel. The vessel was robbed of its catch, while it was taking shelter due to engine trouble. The master informed his family; about the robbery and that another vessel was approaching it. All contact with the fishing vessel was lost, since the master'apos;s last call. The fate of the vessel and its crewmembers are unknown.</Description> </Incident>

M Query

[e.g. relation abbreviation:depth:+/-,relation abbreviation:depth:+/-,...] XML Query Language + M Language Predicates

<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>hsas@dhs.gov</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status> <msgType>Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc> <circle>32.1995,-110.8925 0</circle> </area> </info> </alert>

Common Alerting Protocol (CAP)

The Common Alerting Protocol (CAP) is an XML-based data format for exchanging public warnings and emergencies between alerting technologies. CAP implementations have been demonstrated by agencies and companies including: United States Department of Homeland Security; National Weather Service; United States Geological Survey; California Office of Emergency Services; and many others.

CAP is the foundation technology for the proposed "Integrated Public Alert and Warning System," an all-hazard, all-media national warning architecture being developed by DHS, the National Weather Service and the Federal Communications Commission.

2

Abbreviations

CamelCase Unstructured data

Natural language

Abbreviations

Embedded Data Structures

<Event id="ForestFire"> <Descriptor>Forest Fire</Descriptor> <Identifier label="Type">Pine Forest</Identifier> <SimpleProperty label="Containment">5%</SimpleProperty> <Location id="FF-Location"> <GeoLocation><Polygon><exterior><LinearRing> <pos>39.09 -105.91</pos> . . . <pos>39.09 -105.91</pos> </LinearRing></exterior></Polygon> </GeoLocation> </Location> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause>Believed natural; arson unlikely</Cause> </FireInformation> <Conditions> <ReportedAt>2008-04-14T18:00:00Z</ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>

DoD Universal Core (UCore)

Universal Core (UCore) is a national security and preparedness initiative led jointly by the Department of Defense (DoD), Department of Justice (DOJ), Department of Homeland Security (DHS), and the Office of the Director of National Intelligence (ODNI).

2

Abbreviations

CamelCase

Unstructured data

Natural language

Phrases Embedded units of measure

<current_observation version="1.0"> <location>Boston, Logan International Airport, MA</location> <latitude>42.38</latitude> <longitude>-71.03</longitude> <observation_time>Jul 25 2009, 3:54 pm EDT</observation_time> <weather>Partly Cloudy</weather> <temp_f>84.0</temp_f> <temp_c>28.9</temp_c> <relative_humidity>44</relative_humidity> <wind_dir>West</wind_dir> <wind_degrees>270</wind_degrees> <wind_mph>11.5</wind_mph> <wind_kt>10</wind_kt> <pressure_mb>1013.1</pressure_mb> <pressure_in>29.92</pressure_in> <dewpoint_f>60.1</dewpoint_f> <dewpoint_c>15.6</dewpoint_c> <visibility_mi>10.00</visibility_mi> </current_observation>

National Weather Service (NOAA NWS)

NWS offers hourly weather observations formatted with xml tags to aid in the parsing of the information by automated programs used to populate databases, display information on web pages or other similar applications

1

Phrases

Unstructured data

4 Heterogeneous data structures

5 Phrases Embedded units of measure

5 Embedded units of measure

Unstructured Data – Grammatical Representation

“Pirates boarded a large Japanese fishing vessel.” Pirates boarded a large Japanese fishing vessel

pirate

board

fishing vessel

a

noun phrase

noun

Indefinite article

verb phrase

verb

past tense

noun phrase

noun

plural

adjective

adjective large

Japanese

M Language

action.0

Jeff red ball the

red.color ball.object the.article

color.attribute thing.object reference.article

attribute.0 attribute.0 object.0

attribute.0+ → object.0

object.0

threw

past.tense + throw.move

past.tense + action.0

boy.person

entity.object

person.entity

object.0

object.0 → action.0 → object.0

statement.0

Outline

•  Data Volume

•  Information Systems

•  on-line banking, commerce, social networking

•  XML and Web Services

•  Standardization

•  Mashups

•  Issues

•  M Language (un)-(semi)-structured data

•  Future

Recommended