View
0
Download
0
Category
Preview:
Citation preview
DATA CENTER AN MIT RESEARCH PROGRAM
David Brock The Data Center Program Massachusetts Institute of Technology
Optimizing Resources through Connected Data Systems
Integrating Information in Industry and Government
Data Explosion
Data doubles every 18 months1
1. Gantz, John and Reinsel, David, “As the Economy Contracts, the Digital Universe Expands,” IDC – Multimedia White Paper, May 2009. (http://www.emc.com/digital_universe)
2009 2010 2011 2012 2013
Exabytes
0
500
1,000
1,500
2,000
2,500
Data Explosion
Data Explosion
The world's digital content is equivalent to a stack of books stretching from the Earth to Pluto and back.1
Data Explosion
30 Tons
If we stick with the book analogy, then the digital universe in 2010 is equivalent to 30 tons of books for every man, woman and child on the Earth. 1
1. Gantz, John and Reinsel, David, “As the Economy Contracts, the Digital Universe Expands,” IDC – Multimedia White Paper, May 2009. (http://www.emc.com/digital_universe)
Internet Usage
Online banking
Electronic Commerce
Social Networking
500 million active users on Facebook 1
$32.4 Billion 2
1. ComScore.com 2. U.S Census Bureau's Quarterly Retail E-Commerce Sales 2nd Quarter 2009
75% US customers use online banking 1
Academic Research
“. . . people were drowning in scholarly information, and drowning in information in general. So it takes twice as much time for people to begin their research.“ - Damon Zucca, Oxford University Press, Executive Editor
Next Generation
• The average teen texter exchanges 1,500 texts each month, with an average of 50 per day. 1
• 21% of teenagers access the Internet only from their cell phones.
1. Pew Research Centre's Internet and American Life Project.
Machine Understanding
Yet almost none of this information is understandable by a computer.
Extensible Markup Language
<?xml version="1.0" encoding="ISO-8859-1"?> <bookstore> <book> <title lang="en">Twilight</title> <price>29.99</price> </book> <book> <title lang="en">Learning XML</title> <price>39.95</price> </book> </bookstore>
Extensible Markup Language
<?xml version="1.0" encoding="ISO-8859-1"?> <bookstore> <book> <title lang="en">Twilight</title>
<price>29.99</price> </book> <book> <title lang="en">Learning XML</title> <price>39.95</price> </book> </bookstore>
4ML AML AML AML AML AML AML ABML ABML ACML ACML ACAP ACS X12 ADML AECM AFML AGML AHML AIML AIML AIF AL3 ANML ANNOTEA ANATML APML APPML AQL APPEL ARML ARML ASML ASML ASTM ARML ARML ASML
ARML ARML ASML ASML ASTM ATML ATML ATML ATML AWML AXML AXML AXML AXML BML BML BML BML BML BML BannerML BCXML BEEP BGML BHTML BIBLIOML BIOML BIPS BizCodes BLM XML BPML BRML BSML BCXML BEEP BGML BHTML
BiblioML BCXML BEEP BGML BHTML BIBLIOML BIOML BIPS BizCodes BLM XML BPML BRML BSML CML xCML CaXML CaseXML xCBL CBML CDA CDF CDISC CELLML ChessGML ChordML ChordQL CIM CIML CIDS CIDX xCIL CLT CNRP ComicsML CIM CIML CIDS
CIDX xCIL CLT CNRP ComicsML Covad xLink CPL CP eXchange CSS CVML CWMI CycML DML DAML DaliML DaqXML DAS DASL DCMI DOI DeltaV DIG35 DLML DMML DocBook DocScope DoD XML DPRL DRI DSML DSD DXS EML EML DLML EAD ebXML
eBIS-XML ECML eCo EcoKnow edaXML EMSA eosML ESML ETD-ML FieldML FINML FITS FIXML FLBC FLOWML FPML FSML GML GML GML GXML GAME GBXML GDML GEML GEDML GEN GeoLang GIML GXD GXL Hy XM HITIS HR-XML HRMML HTML HTTPL
HTTP-DRP HumanML HyTime IML ICML IDE IDML IDWG IEEE DTD IFX IMPP IMS Global InTML IOTP IRML IXML IXRetail JabberXML JDF JDox JECMM JLife JSML JSML JScoreML KBML LACITO LandXML LEDES LegalXML Life Data LitML LMML LogML LogML LTSC XML MAML
MatML MathML MBAM MISML MCF MDDL MDSI-XML Metarule MFDX MIX MMLL MML MML MML MoDL MOS MPML MPXML MRML MSAML MTML MTML MusicXML NAML xNAL NAA Ads Navy DTD NewsML NML NISO DTB NITF NLMXML NVML OAGIS OBI OCF ODF
ODRL OeBPS OFX OIL OIM OLifE OML ONIX DTD OOPML OPML OpenMath Office XML OPML OPX OSD OTA PML PML PML PML PML PML PML PML P3P PDML PDX PEF XML PetroML PGML PhysicsML PICS PMML PNML PNML PNG PrintML
PrintTalk ProductionML PSL PSI QML QAML QuickData RBAC RDDl RDF RDL RecipeML RELAX RELAX NG REXML REPML ResumeXML RETML RFML RightsLang RIXML RoadmOPS RosettaNet PIP RSS RuleML SML SML SML SML SAML SABLE SAE J2008 SBML Schemtron SDML SearchDM-XML SGML
SHOE SIF SMML SMBXML SMDL SDML SMIL SOAP SODL SOX SPML SpeechML SSML STML STEP STEPML SVG SWAP SWMS SyncML TML TML TML TalkML TaxML TDL TDML TEI ThML TIM TIM TMML TMX TP TPAML TREX TxLife
UML UBL UCLP UDDI UDEF UIML ULF UMLS UPnP URI/URL UXF VML vCalendar vCard VCML VHG VIML VISA XML VMML VocML VoiceXML VRML WAP WDDX WebML WebDAV WellML WeldingXML Wf-XML WIDL WITSML WorldOS WSML WSIA XML XML Court XML EDI
XML F XML Key XMLife XML MP XML News XML RPC XML Schema XML Sign XML Query XML P7C XML TP XMLVoc XML XCI XAML XACML XBL XSBEL XBN XBRL XCFF XCES Xchart Xdelta XDF XForms XGF XGL XGMML XHTML XIOP XLF XLIFF XLink XMI XMSG XMTP XNS
Standards
Thousands of incompatible standards make interoperability practically impossible!
<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>hsas@dhs.gov</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status> <msgType>Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc> <circle>32.1995,-110.8925 0</circle> </area> </info> </alert>
Common Alerting Protocol (CAP)
The Common Alerting Protocol (CAP) is an XML-based data format for exchanging public warnings and emergencies between alerting technologies. CAP implementations have been demonstrated by agencies and companies including: United States Department of Homeland Security; National Weather Service; United States Geological Survey; California Office of Emergency Services; and many others.
CAP is the foundation technology for the proposed "Integrated Public Alert and Warning System," an all-hazard, all-media national warning architecture being developed by DHS, the National Weather Service and the Federal Communications Commission.
<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>hsas@dhs.gov</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status> <msgType>Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty>
<senderName> Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc> <circle>32.1995,-110.8925 0</circle> </area> </info> </alert>
Compound Tag Names
<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>hsas@dhs.gov</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status>
<msgType> Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc> <circle>32.1995,-110.8925 0</circle> </area> </info> </alert>
Abbreviations
<alert xmlns="http://www.incident.com/cap/1.1"> ... <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc>
<circle> 32.1995,-110.8925 0 </circle> </area> </info> </alert>
Embedded Data Structures
<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>hsas@dhs.gov</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status> <msgType>Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName>
<headline> Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area>
Natural Language
<Event xmlns:gml="http://www.opengis.net/gml/3.2" id="ForestFire“> <Descriptor>Forest Fire</Descriptor> <Identifier label="Type">Pine Forest</Identifier> <SimpleProperty label="Containment">5%</SimpleProperty> <Location id="FF-Location"> <Boundary><Polygon> <gml:LinearRing> <gml:pos>39.09 -105.91</gml:pos> . . . <gml:pos>39.09 -105.91</gml:pos> </gml:LinearRing></exterior> </Polygon></Boundary> </Location> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause>Believed natural; arson unlikely</Cause> </FireInformation> <Conditions> <ReportedAt>2008-04-14T18:00:00Z</ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>
DoD Universal Core (UCore)
Universal Core (UCore) is a national security and preparedness initiative led jointly by the Department of Defense (DoD), Department of Justice (DOJ), Department of Homeland Security (DHS), and the Office of the Director of National Intelligence (ODNI).
<Event xmlns:gml="http://www.opengis.net/gml/3.2" id="ForestFire“> <Descriptor>Forest Fire</Descriptor> <Identifier label="Type">Pine Forest</Identifier> <SimpleProperty label="Containment">5%</SimpleProperty> <Location id="FF-Location"> <Boundary><Polygon>
<gml:LinearRing> <gml:pos>39.09 -105.91</gml:pos> . . . <gml:pos>39.09 -105.91</gml:pos> </gml:LinearRing> </Polygon></Boundary> </Location> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause>Believed natural; arson unlikely</Cause> </FireInformation> <Conditions>
Composite Schema
<Event xmlns:gml="http://www.opengis.net/gml/3.2" id="ForestFire“> <Descriptor>Forest Fire</Descriptor> <Identifier label="Type">Pine Forest</Identifier> <SimpleProperty label="Containment">5%</SimpleProperty> <Location id="FF-Location"> <Boundary><Polygon> <gml:LinearRing> <gml:pos>39.09 -105.91</gml:pos> . . . <gml:pos>39.09 -105.91</gml:pos> </gml:LinearRing></exterior> </Polygon></Boundary> </Location> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause>Believed natural; arson unlikely</Cause> </FireInformation> <Conditions> <ReportedAt>2008-04-14T18:00:00Z</ReportedAt>
<Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure>
Embedded Units
<current_observation version="1.0"> <location>Boston, Logan International Airport, MA</location> <latitude>42.38</latitude> <longitude>-71.03</longitude> <observation_time>Jul 25 2009, 3:54 pm EDT</observation_time> <weather>Partly Cloudy</weather>
<temp_f>84.0</temp_f> <temp_c>28.9</temp_c> <relative_humidity>44</relative_humidity> <wind_dir>West</wind_dir> <wind_degrees>270</wind_degrees> <wind_mph>11.5</wind_mph> <wind_kt>10</wind_kt> <pressure_mb>1013.1</pressure_mb> <pressure_in>29.92</pressure_in> <dewpoint_f>60.1</dewpoint_f> <dewpoint_c>15.6</dewpoint_c> <visibility_mi>10.00</visibility_mi> </current_observation>
National Weather Service (NOAA NWS)
NWS offers hourly weather observations formatted with xml tags to aid in the parsing of the information by automated programs used to populate databases, display information on web pages or other similar applications
M Language
M Language
Common Vocabulary and Data Format
M Language
M Language
• Dictionary – Repository of well-defined terms
• Ontologies – Connections between terms
• Grammar – Rules for composing terms into documents and messages
Dictionary
cell
Repository of “concepts”
- the basic structural and functional unit of all organisms.
token definition
Dictionary
cell.0
Keyed concepts
- the basic structural and functional unit of all organisms.
cell.1
“cell.biology”
“cell.phone”
- A hand-held mobile radiotelephone for use in an area divided into small sections, each with its own short-range transmitter/receiver.
cell.2
“cell.manufacturing”
- group of workers and/or machines work together as a team to produce dedicated set of products or assemblies.
Compose Concepts - Words
cell.0
cell.0 plural.0
cell
cells +
cell.0 modifier.0 cellular +
Compose concepts Words
Compose Concepts - Phrases
Phrases – sequences of concepts
design.0 cell.1 + research.0 +
company.0 chemical.0 + description.0 +
annual.0 maximum.0 + temperature.0 +
cellular phone design research
chemical company descriptions
maximum annual temperature
Compose Concepts – Structured Documents
<Event id="ForestFire"> <Description>Forest Fire</Description> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause> Believed natural; arson unlikely </Cause> </FireInformation> <Conditions> <ReportedAt> 2008-04-14T18:00:00Z </ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>
event
forest.trees fire.event
containment.bound
5 percent.unit
cause.effect
weather.conditions
2008-04-14T18:00:00Z
wind.air speed.velocity
northwest.direction
3 mph.unit
wind.air direction.angle
date.time
Belived natural
Ontology
iron.0
Connections between concepts
heavy metal.0
cobalt.0
manganese.0
mecury.0
lead.0 type-of
M Language
Challenge
Transform disparate structured and unstructured data into the M Language
Transform Structured and Unstructured Data
<Event id="ForestFire"> <Description>Forest Fire</Description> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause> Believed natural; arson unlikely </Cause> </FireInformation> <Conditions> <ReportedAt> 2008-04-14T18:00:00Z </ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>
event
fire information
cause
believed natural
Natural Language Processing
Pre-process Text
- load text - spell check - clean white space - remove meta characters - process capitals letters - process apostrophes - tokenize
Generic Token Labeler
Tokens
- adjectives - adverbs - articles - auxiliary verbs - conjunctions - idioms - interrogatives
Specialized Token Labeler
Labeled Tokens
- times / dates - locations - personalities - organizations - products - numbers - equations/formulas - acronyms / codes
Domain Specific Token Labeler
Labeled Tokens
- organizational labels - proprietary labels
Labeled Tokens
Token Process
Labeled Tokens
- names (first) - names (last) - nouns - numbers - prepositions - pronouns - punctuation - verbs
- compound words - word function culling - token ranking
Processed Tokens
Token Cluster
- noun phrases - verb phrases - prepositional phrases - conjunction - grammatical partitions - tree formation
Token graph Meta-data Analysis
- source analysis - context analysis - tone analysis
Disambiguation
- meta-data - word association - context analysis - concept abstraction - pattern abstraction
Token graph + meta-data
Link Process
Concept graph
Concept graph
- modifier references - prepositional phrase references - pronoun references
Knowledge Storage
- Knowledge storage
Concept graph
M Language Dictionary - generic concepts and representations
M Language Dictionary - specialized concepts and representations
M Language Dictionary - proprietary concepts and representations
M Language Knowledge Database
M Language Knowledge Database
RDF Triple Extraction
M Language to RDF Concept Map
RDF Triples
M Language
past.tense state.0
Fire natural believed
natural.state
thin perception. attribute
attribute.0 attribute.0
attribute.0 + attribute.0
Attribute.0
is
past.tense + be.state
past.tense + concept.state
fire.event
event.object
Emergency. event
object.0
object.0 → state.0 → attribute.0
statement.0
past.tense + believe.state
Transform Structured and Unstructured Data
<Event id="ForestFire"> <Description>Forest Fire</Description> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause> Believed natural; arson unlikely </Cause> </FireInformation> <Conditions> <ReportedAt> 2008-04-14T18:00:00Z </ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>
event
fire information
cause
believed natural
event.0
fire.1
cause.0
believe.0 past.0 natural.0
M Language
Applications
GMT M
BFA M
DIS M
TDL M
Network
M
M
M
M
GMT
BFA
DIS
TDL
Interoperation of existing government data protocols
Real-time Data Translation
From point-to-point to source translation - “n2 to n”
DOJ
Network
M
DHS
M
FBI
M
IBIS
M
Fusion of existing government data sources
Data Fusion
Integrate and fuse data without changing database
Data System Integration
Connect disparate systems without changing protocol
Parts
Gas
Food M-EDI
Adapter
Processing Flow in M
M-KML Adapter
Google Earth
Request Processor
Inventory Service
Requisition Algorithm
Procurement Processing System
Order Service
Fulfillment Report
EDI
KML
Vendors DC Local Govt /
Vendor System
s
M A
dapters
Data Flow Management
Procurement Tracking System
M Adapters
M Adapters
M Dict.
& Svcs
Enterprise Integration
Dedicated Solution Network
Command Centers
Oracle Fusion Middleware / Enterprise Service Bus Services: Security, Identity Management, SSO, UDDI, Others
Network
• COP / SA • Portals • C2I Apps • Intelligence
ISR Sensors Network
Mobile Users
External Users
External Organizations, Networks & Simulators
SOA
Web Portal
External Interfaces Public & Private Feeds
RSS CAP etc.
Public & Private Data Sources
ISR Sensors Network
Data Sources
Data Integration Layer (Oracle / AquaLogic)
Adaptors
Web Services Web Services
Web Portal Enhanced Data Interfaces
Simulated Sensor Network
Simulated Data
Sources
Smart Data Archives
Smart Dynamic Repositories
• Portals • Mash-ups • Wiki • Web 2
Web Center
Pre-built Component
Collaboration • IM • Online
Meetings
Intelligent Services (M Language Apps.) • Information Fusion • Monitors & Triggers
Rapid Composition Simulation Subsystem • Scenario Generation • Data Generation • Simulation Control • Monitors • Training
Business Intelligence
• Dashboards • Ad-Hoc
BI Date Warehouse
Operations • C2I • Decision Support • Operations Mgmt.
Business Systems • HRMS / Training • Performance Based
Logistics
Enterprise Search • COTS Tools
Future
Future
Development
• Vocabulary acquisition – automatically build vocabulary, definitions and relationships from parsed text
• Context - use document context to improve parsing
• Pattern extraction – identify common structural patterns in parsed data to improve processing
• Semi-Structured data – apply parsing approach to semi-structured data such as generic web pages and scripting languages
Thank you
M Language
cell.biology -- the basic structural and functional unit of all organisms; they may exist as independent units of life (as in monads) or may form colonies or tissues as in.
cell.electric, electric_cell.1 -- A device that delivers an electric current as the result of a chemical reaction.
cell.jail, jail_cell.1, prison_cell.1 -- A room where a prisoner is kept.
cell.room, cubicle.3 -- Small room is which a monk or nun lives.
cell.compartment -- Any small compartment; the cells of a honeycomb.
cell.telephone, cellular_telephone.1, cellular_phone.1, cellphone.1, mobile_phone.1 -- A hand-held mobile radiotelephone for use in an area divided into small sections, each with its own short-range transmitter/receiver.
cell.political, cadre.2 -- A small unit serving as part of or as the nucleus of a larger political movement.
cell.manufacturing -- Manufacturing cell, in which a group of workers and/or machines work together as a team to produce dedicated set of products or assemblies.
Dictionary Data Structure
computer
IP Address
18.78.14.156
Location
Latitude
145.681
Unstructured Data
<Shipping_Incidents> <Incident> <Date>04/26/2007</Date> <Reference_No>2007-130</Reference_No> <Subregion>93</Subregion> <Geolocation>9.566666667,111.9166667</Geolocation> <Aggressor>PIRATES</Aggressor> <Victim>FISHING VESSEL</Victim> <Description>Sprattly Islands, SOUTH CHINA SEA: Pirates boarded a large Japanese fishing vessel. The vessel was robbed of its catch, while it was taking shelter due to engine trouble. The master informed his family; about the robbery and that another vessel was approaching it. All contact with the fishing vessel was lost, since the master'apos;s last call. The fate of the vessel and its crewmembers are unknown.</Description> </Incident>
M Query
[e.g. relation abbreviation:depth:+/-,relation abbreviation:depth:+/-,...] XML Query Language + M Language Predicates
<alert xmlns="http://www.incident.com/cap/1.1"> <identifier>23e090317621</identifier> <sender>hsas@dhs.gov</sender> <sent>2007-010-09T14:39:01-05:00</sent> <status>Actual</status> <msgType>Alert</msgType> <scope>Public</scope> <info> <category>Security</category> <event>Security Update</event> <urgency>Past</urgency> <severity>Moderate</severity> <certainty>Observed</certainty> <senderName>Department of Homeland Security</senderName> <headline>Suspicious car spotted / Tucson, AZ </headline> <description>Blue sedan, license plate number 61SN52, caught on red light camera on intersection of Main St. and Pleasant St.</description> <area> <areaDesc>Tucson, AZ</areaDesc> <circle>32.1995,-110.8925 0</circle> </area> </info> </alert>
Common Alerting Protocol (CAP)
The Common Alerting Protocol (CAP) is an XML-based data format for exchanging public warnings and emergencies between alerting technologies. CAP implementations have been demonstrated by agencies and companies including: United States Department of Homeland Security; National Weather Service; United States Geological Survey; California Office of Emergency Services; and many others.
CAP is the foundation technology for the proposed "Integrated Public Alert and Warning System," an all-hazard, all-media national warning architecture being developed by DHS, the National Weather Service and the Federal Communications Commission.
2
Abbreviations
CamelCase Unstructured data
Natural language
Abbreviations
Embedded Data Structures
<Event id="ForestFire"> <Descriptor>Forest Fire</Descriptor> <Identifier label="Type">Pine Forest</Identifier> <SimpleProperty label="Containment">5%</SimpleProperty> <Location id="FF-Location"> <GeoLocation><Polygon><exterior><LinearRing> <pos>39.09 -105.91</pos> . . . <pos>39.09 -105.91</pos> </LinearRing></exterior></Polygon> </GeoLocation> </Location> <FireInformation> <Type>Pine Forest</Type> <Containment>5%</Containment> <Cause>Believed natural; arson unlikely</Cause> </FireInformation> <Conditions> <ReportedAt>2008-04-14T18:00:00Z</ReportedAt> <Temperature>90.2F</Temperature> <Wind>NW 3MPH</Wind> <Humidity>15%</Humidity> <Pressure>1014.8 hPa</Pressure> <Visibility>5 mi</Visibility> </Conditions>
DoD Universal Core (UCore)
Universal Core (UCore) is a national security and preparedness initiative led jointly by the Department of Defense (DoD), Department of Justice (DOJ), Department of Homeland Security (DHS), and the Office of the Director of National Intelligence (ODNI).
2
Abbreviations
CamelCase
Unstructured data
Natural language
Phrases Embedded units of measure
<current_observation version="1.0"> <location>Boston, Logan International Airport, MA</location> <latitude>42.38</latitude> <longitude>-71.03</longitude> <observation_time>Jul 25 2009, 3:54 pm EDT</observation_time> <weather>Partly Cloudy</weather> <temp_f>84.0</temp_f> <temp_c>28.9</temp_c> <relative_humidity>44</relative_humidity> <wind_dir>West</wind_dir> <wind_degrees>270</wind_degrees> <wind_mph>11.5</wind_mph> <wind_kt>10</wind_kt> <pressure_mb>1013.1</pressure_mb> <pressure_in>29.92</pressure_in> <dewpoint_f>60.1</dewpoint_f> <dewpoint_c>15.6</dewpoint_c> <visibility_mi>10.00</visibility_mi> </current_observation>
National Weather Service (NOAA NWS)
NWS offers hourly weather observations formatted with xml tags to aid in the parsing of the information by automated programs used to populate databases, display information on web pages or other similar applications
1
Phrases
Unstructured data
4 Heterogeneous data structures
5 Phrases Embedded units of measure
5 Embedded units of measure
Unstructured Data – Grammatical Representation
“Pirates boarded a large Japanese fishing vessel.” Pirates boarded a large Japanese fishing vessel
pirate
board
fishing vessel
a
noun phrase
noun
Indefinite article
verb phrase
verb
past tense
noun phrase
noun
plural
adjective
adjective large
Japanese
M Language
action.0
Jeff red ball the
red.color ball.object the.article
color.attribute thing.object reference.article
attribute.0 attribute.0 object.0
attribute.0+ → object.0
object.0
threw
past.tense + throw.move
past.tense + action.0
boy.person
entity.object
person.entity
object.0
object.0 → action.0 → object.0
statement.0
Outline
• Data Volume
• Information Systems
• on-line banking, commerce, social networking
• XML and Web Services
• Standardization
• Mashups
• Issues
• M Language (un)-(semi)-structured data
• Future
Recommended