Upload
danghuong
View
213
Download
0
Embed Size (px)
Citation preview
IBM - Dublin Research Lab
City SearchAn evolution of search to incorporate city data
Veli Bicer
IBM Research
IBM - Dublin Research Lab
Outline
• A Planet of Smarter Cities
• City Data and Information
• Making City Search Smarter
• Managing City Data
• Searching City Data
• Conclusion
IBM - Dublin Research Lab
A Planet of Smarter Cities
“Cities have the capability of providing something for everybody, only because, and only when, they
are created by everybody.”Jane Jacobs
IBM - Dublin Research Lab
A planet of smarter cities: In 2007, for the first time in history, the majority of the world’s population—3.3 billion people—lived in cities. By 2050, city dwellers are expected to make up 70% of
Earth’s total population, or 6.4 billion people.
IBM - Dublin Research Lab
Instru
men
ted
Inte
rcon
ne
cte
dIn
tellig
en
t
Dublin
Test B
ed
Energy Movement Water
Seed ProjectsReal World Insight | Data Sets | Devices
Optimization
Predictive Modelling
Forecasting
Simulation
Solu
tions th
at S
usta
in E
conom
ic D
evelo
pm
ent
Driving New Economic Models
Significant Collaborative R&D
Skills Development & Growth
Competitive Advantage
Collaboration and Access to Local, Regional & Worldwide NetworkSME’s | MNC’s | Universities | Public Sector | VC Community
Intelligent Urban and Environmental Analytics and Systems
Sm
art C
ity Solu
tions
Integrated Cross Domain Solutions
City Fabric
Smarter Cities Technology Centre
IBM - Dublin Research Lab
City Data and Information
“The country places and the trees don’t teach me anything, but the people in the
city do”
Socrates
IBM - Dublin Research Lab
Transportation Social MediaEnergy Management City Management
RegionSupply Chain Food System HealthCare
• Large, open and continuous data environment from heterogeneous domains:
and even more�
City of Data and Information: Many Areas
Water
Management
IBM - Dublin Research Lab
Some Traffic-related Data Sets
from Dublin
� Big data
� Heterogeneous data
� Static, Continuous data
� Not all open yet,
� Not linked yet
� Noisy data (inconsistent, imprecise)
IBM - Dublin Research Lab
Dublinked - outcomes
• Publish and put into context (100’s datasets, 1000’s of files)
• Create innovation ecosystem
Waste Collection
Property management
Environment
Demographics
Business & Retail
Commercial valuations
and rates
Tourism
Transport & Access
Crime
Heritage
Mapping
Housing
WaterFault Reporting
Events
Health
Planning
Pool resources Share results
IBM - Dublin Research Lab
Making City Search Smarter
“We cannot afford merely to sit down and deplore the evils of city life as inevitable, when cities are constantly growing,
both absolutely and relatively. We must set ourselves vigorously about the task of improving them; and this task
is now well begun.”
Theodore Roosevelt
IBM - Dublin Research Lab
City SearchNot a revolution, but evolution
City SearchConcerns all type of (complex) queries encountered in everyday city life, e.g. city events.
City SearchRelevance is highly dimensional, context-dependent and leverage more city-specific information than Web information.
Genetic D
rift
Web SearchOrdinary Web users want to locate information (e.g. documents) on the Web
Genes
Information Need Search Relevance Information Source
Local SearchMainly targeting queries to locate businesses within a geographic area
Local SearchExtending Web-based relevance with spatiotemporal relevance.
City SearchCity data provides a unique source of information to understand city context
Web SearchModels relevance using IR models based on content, Web popularity, clickthrough data etc.
Local SearchEnhanced with location and time information collected from mobile sensors, IP address etc.
Web SearchUtilizing Web-based information such as document content, Web graph, search engine logs, etc.
IBM - Dublin Research Lab
What do people search for?Top 8 categories according to user scores [Kukka, PUC, 2013]
IBM - Dublin Research Lab
Relevance
• Coffee? ☺
• Costa? Starbucks?
– Does the distance really matter?
IBM - Dublin Research Lab
Relevance
• Local vs. Web popularity
• What else?
• More information�
More relevance to the
user!!
Source: Foursquare, July 2013
IBM - Dublin Research Lab
Relevance
• Dublin Trips Data:
– Journey times throughout the city
– Real-time data with updates in every minute
– Historical data is available for every day since 9/7/2012
– Mined from SCATS-based (Sydney Coordinated Adaptive Traffic
System) intelligent transportation system for 500+ sites around Dublin
• Accessible from:
– http://dublinked.ie/datastore/datasets/dataset-215.php
• Visualization
– http://www.dublinked.ie/traffic/
IBM - Dublin Research Lab
Relevance
• More transportation data
– Public Transport Route Networks
• http://dublinked.ie/datastore/datasets/dataset-258.php
– Dublin Bus GPS Data• http://dublinked.com/datastore/datasets/dataset-304.php
– Dublin Bus GTFS data • http://dublinked.ie/datastore/datasets/dataset-254.php
– Accessible Parking Places • http://dublinked.com/datastore/datasets/dataset-049.php
– Roads and Streets in Dublin City • http://dublinked.com/datastore/datasets/dataset-123.php
IBM - Dublin Research Lab
RelevanceBuying your dream house
Finding the houses?
Is the price reasonable?How is the neighborhood?
Perfect match!!
IBM - Dublin Research Lab
Relevance
• Property Register Index : ~52000 property sales
Available at http://kdeg.cs.tcd.ie/propertyPriceMap/
IBM - Dublin Research Lab
Relevance
• More city data:
– Amenities & Recreation
• http://dublinked.ie/datastore/by-category/amenities-
recreation.php
– Schools
• http://dublinked.com/datastore/datasets/dataset-099.php
– Key developing areas
• http://dublinked.ie/datastore/datasets/dataset-134.php
– Air pollution monitoring data
• http://dublinked.ie/datastore/datasets/dataset-185.php
IBM - Dublin Research Lab
We actually
have this
information.
But we require
novel search
and
exploration
methods
IBM - Dublin Research Lab
Business case: traffic diagnosisProblem: diagnosis and reasoningHow can we provide City decision makers with explanations and diagnoses for events by applying machine reasoning techniques to a fusion of massive, rich, complex and dynamic data? How can we move from explanation to prediction?
Challenges
• Identifying relevant data and information• Capturing and representing anomalies
• Correlating time-evolving knowledge on heterogeneous data sources• Advanced fusion of data
Anomaly Detected:
Delayed buses, congested
roads
Anomaly Detected:
Delayed buses, congested
roads
Detection to
Diagnosis?
Detection to
Diagnosis?
Diagnosis:
A music concert next to
Canal Road at 3PM
Diagnosis:
A music concert next to
Canal Road at 3PM
IBM - Dublin Research Lab
Information Sources
Web Context City ContextWeb documents Structured City Data
User Queries City-specific Web Documents
Clickthrough Data Sensor Data
Hyperlinks Social Media, Check-ins
Road Network
Transportation Data
City Events
Regional Information
Municipality
Crime and safety and much more…
IBM - Dublin Research Lab
Search Indexes
Search IndexesSearch
Indexes
Search IndexesSearch
Indexes
Search Indexes
Semantic Virtual Views on Data
Semantic Virtual Views on Data
Stream Data
Stream Data
Search Logs
Search Logs
Retrieval and RankingRetrieval and Ranking
Structured Data
Structured Data
Textual Data
Textual Data
Multiple Indexes needed to support different lookup
patterns, e.g. text, structured, real-time and
spatial indexes
Queries, ClickthroughData
Social, Web, Businesses,Events, M
Search Interface(Traditional)
Search Interface(Traditional)
Search Interface(Map)
Search Interface(Map)
Understanding the query characteristics via
classification, rewriting, expansion and semantic
translation
Contextual Access for Users and
Applications
Contextual Access for Users and
Applications
Model LearningModel Learning
IndexingIndexing
IRIR
DMDM
Legend
OtherOtherGeospatial Data
Geospatial Data
Sensors, Transportation,
Weather, GPS,M
Municipality, Public Services,
M
Road Network, POI,M
Data Access and Management LayerData Access and Management Layer
Query ProcessingQuery Processing
Search Interface(Mobile)
Search Interface(Mobile)
Overview of Components in City Search
Retrieval and RankingRetrieval and Ranking
Multiple aspects of relevance for various ranking
signals and aggregation
[Russell-Rose, SIGIR 2013]
[Liu, SIGIR 2008]
[Zhang & Jin, SIGIR 2008]
[Mika & Tran, SemTech 2013]
Geographic Crawling
Geographic Crawling
IBM - Dublin Research Lab
Managing City Data
“True genius resides in the capacity for
evaluation of uncertain, hazardous, and
conflicting information.”
Winston Churchill
IBM - Dublin Research Lab
Smarter Cities share data … Open Urban Data is at the center of a
new wave of opportunity (*)
(*) “Driving Innovation with Open Data”,Jeanne Holm, Data.gov,
Feb. 9th, 2012 (Presentation to Ontology 2012)
• More than 150 city agencies and authorities, worldwide, have already made over 1M datasets available through open data portals.
• Open data are generating new business: McKinsey & Associates estimate the economic value of big, open health data, at approximately $350B annually.
IBM - Dublin Research Lab
Technical challenges: outline
• Creating good quality Linked Data at enterprise scale from silos of raw city data with minimal entry cost
• Next-generation analytics for diagnosis, based on Semantic Reasoning
• Understanding context and human interaction with the system and data and aggregating relevant information into views
• Discovering relevant connections across data
IBM - Dublin Research Lab
Approach – Graph based model[QuerioCity – Lopez et. al, ISWC’12]
Pay-as-you-go –
Scale in terms of gain for the effort spent
• Publish, discovery and search of structured metadata -> Queries over metadata
• Files into a standard representation and entity extraction > Queries over data
• Link and partially integrate schemata -> Queries across datasets
• Integrate globally -> Queries across Web data
• Create meaningful views -> merging data views by users
Documents +
MetadataStructure Entities Links Views Insight
Tabular GraphC1 a CellC1 inRow r1C1 value “name”
M
Entity Graphe1 a Entitye1 inRow r1e1 inCol c2
M
Annotation Graphe1 a Entitye1 rdfs:label “name”e1 addr “X st”e1 lat :53.23” M
Mapping Graphe1 a Entitye1 sameAs e2M
IBM - Dublin Research Lab
SPUD architecture
Internet
External Dataset
External Catalog
External Ontology / vocab.
External SPARQL Endpoint
Social Media
Publishing ContainerIBM Tivoli
Access
Manager
and
WebSEAL
IBM WebSphere Application Server
IBM Storage
(SAN)
Datasets
Upload
SFTP ServerPublish &
Sync
IBM Tivoli
Directory Server
Datasets
Files Index
RDF Index
Download
Search
Query
Browse
Catalog
Publish
GUIQuerioCity Enterprise Application
Data Layer
SPARQL
Endpoint
REST APIs Lucene
Code / Jar
Apache
PFDBox
Custom
Code
Apache
POI
DB2 Enterprise Edition
IBM DB2 Enterprise Edition
VocabulariesTripleStore
Enterprise Apps
DiagnoserSocial Media
Ingester
Trajectory
Miner
Privacy and
AnonymizerTaxonomizer
IBM HTTP
Server
Data
Fusion
And
Semantic
Annotation
Co
nsu
me
r
….
Pu
blish
er
[Kotoulas et al, SWC @ISWC’12 ]
IBM - Dublin Research Lab
SPUD
SPUD Demonstrator Won 2nd prize in ISWC Semantic Web Challenge 2012!
[Kotoulas et al, SWC @ISWC’12 ]
IBM - Dublin Research Lab
Other systems of City Data
Management
• STAR-CITY: Semantic Traffic Analytics and Reasoning for CITY
– http://dublinked.ie/sandbox/star-city/
• EXCED: Spatio-temporal topic detection for cities
– http://www.youtube.com/watch?v=Q_orwDfU8Yw&feature=youtu.be
• Link2Outcome: Coordinating Social Care and Health Care
– http://researcher.ibm.com/researcher/view_project.php?id=5034
…and many more at:
– http://dublinked.ie/sandbox/
IBM - Dublin Research Lab
Searching City Data
“We are searching for some kind of harmony between two intangibles: a form which we have not yet designed and a context which we cannot
properly describe.”
Christopher Alexander
IBM - Dublin Research Lab
•Relevance is a tradeoff between multiple relevance aspects
•Understanding query context for different domains•Query semantics
Technical challenges: outline
• City data comes in different forms and shapes• Standard indexing techniques (e.g. Lucene) is useful to handle
textual content• Multiple Indexing for supporting different lookup patterns
• How to classify/rewrite/expand city query?• How to understand query context/semantics?• How to model relevance: e.g. “reputation” to be more important for
“coffee shops” but distance for “bank office".
IndexingRetrieval &
RankingQuery
Processing
•Real-time data•Spatial Indexing•Handling structured data
Indexing
IBM - Dublin Research Lab
State-of-the-art
• Indexing:
– Real-time: Partial Indexing
• Indexing only the records with high chance of being queried [Chen et al,
SIGMOD’11]
– Spatial: Hybrid index (IR-Tree) combining R-tree for spatial DBs with
inverted index [Cong et al, VLDB’09]
– Structure: Data partitioning, structural similarity [Tran et al, TKDE’12]
• Query Processing:
– Classification: Using query logs [Jones et al, WWW’06]
– Context: Model learning for different domains [Bai et al, SIGIR’07]
– Semantics: Top-k graph exploration over RDF [Tran et al, ICDE’09]
IBM - Dublin Research Lab
State-of-the-art
• Relevance and ranking:
– Content:
• Structured relevance models [Bicer et al, CIKM11]
– Structure: ObjectRank [Hristidis et al, TDS08]
– Local popularity: Checkins [Cheng et al, CIKM’11]
– Distance: Relative distance [Berberich, SIGIR’11]
– Context (Temporal, Weather, Spatial, Personal):
• Hapori [Lane et al, Ubicomp’10]
• Aggregation of multiple aspects:
– [Kang et al, WSDM’12]
IBM - Dublin Research Lab
STAR-CITY core features
• Analysis: Real-time interpretation of traffic data
• Diagnosis: Identifying nature and cause of anomalies
• Exploration: Contextual search and navigation
• Prediction: Context-aware traffic prediction
http://dublinked.ie/sandbox/star-city/
IBM - Dublin Research Lab
Exploration
• Search anomalies and events in similar context:– Event types, e.g. roadworks vs.social
– Location, time
– Weather conditions
– Venues, tags and more…
• Multiple relevance aspects
IBM - Dublin Research Lab
Wrap Up
•Majority of World population live in cities
•Cities are dynamic entities combining people,
systems, infrastructure, businesses
•More and more city data becomes available
enabling more insight
•City data is heterogeneous, multi-domain,
noisy and big
Cities and City Data
•Managing City Data
• Characteristics and types of city data
• Semantic Processing and lifting
• Analytics and insight
•Searching City Data
• Challenges for IR community
• Synergy of different approaches
• Future directions
Building the solution
•City search as an evolution of search into city
context
•Characterized by specific information needs of
people in everyday city life
•Drift in search relevance from Web context to
the city context
•Drift in information sources used to drive
search process
City Search
IBM - Dublin Research Lab
References
• Marty Himmelstein, Local search: The internet is the yellow pages, IEEE Computer, 2005
• Klaus Berberich, Arnd C. Konig, Dimitrios Lymberopoulos, Peixiang Zhao, Improving local search ranking through external logs, SIGIR 2011.
• Hannu Kukka, Vassilis Kostakos, Timo Ojala, Johanna Ylipulli, Tiina Suopajarvi, Marko Jurmu, Simo Hosio, This is not classified: everyday information seeking and encountering in smart urban spaces, Personal and Ubiquitous Computing, 2013
• Spink, A., Wolfram, D., Jansen, M. B., & Saracevic, T. (2001). Searching the web: The public and their queries. Journal of the American society for information
science and technology, 52(3), 226-234.
• Zhang, Wei Vivian, Benjamin Rey, Eugene Stipp, and Rosie Jones. Geomodificationin Query Rewriting. In GIR. 2006.
IBM - Dublin Research Lab
ReferencesQuerio City / Urban Data
• V. Lopez, S. Kotoulas, M. L. Sbodio, M. Stephenson, A. Gkoulalas-Divanis, P. Mac Aonghusa. QuerioCity: A Linked Data Platform for Urban Information Management. In Use track at ISWC 2012.
• V.Lopez, S.Kotoulas, M.L.Sbodio, R.Lloyd. Guided exploration and integration of urban data. Hypertext’13.
Reasonable City• Freddy Lecue, Jeff Z, Pan. Predicting Knowledge in an Ontology Stream. In Proc. of IJCAI 2013• Elizabeth M. Daly, Freddy Lecue, Veli Bicer. Westland Row Why So Slow? Fusing Social Media and Linked Data
Sources for Understanding Real-Time Traffic Conditions. In Proc. IUI 2013• Freddy Lecue, Anika Schumann, Marco Luca Sbodio. Applying Semantic Web Technologies for Diagnosing Road
Traffic Congestions. In Proc. of ISWC 2012.
Social City• Elizabeth M. Daly, Giusy Di Lorenzo, Daniele Quercia, Michael Muller. When the City Meets the Citizen. In Proc.
of ICWSM 2012.• Giusy Di Lorenzo, Marco Luca Sbodio, Vanessa Lopez, Raymond Lloyd. EXSED: an intelligence tool for
Exploration of Social Event Dynamics. In Proc. of MDM 2013.
Stream City• Simone Tallevi, Spyros Kotoulas, Luca Foshini, Freddy Lecue, Antonio Corradi. Real-time Urban Monitoring in
Dublin using Semantic and Stream Technologies. In Use track at ISWC 2013
Care City• Spyros Kotoulas, Vanesa Lopez, Martin Stephenson et al. Coordinating social care and health care using
Semantic Web technologies. Demo session at ISWC 2013 (submitted)
SPUD: Semantic Processing of Urban Data – Demo: www.dublinked.ie/sandbox/SemanticWebChallKotoulas, Vanessa Lopez, Raymond Lloyd, Marco Luca Sbodio, Freddy Lecue, Martin Stephenson, Elizabeth Daly,
Veli Bicer, Aris Gkoulalas-Divanis, Giusy Di Lorenzo, Anika Schumann, Denis Patterson, and Pol Mac Aonghusa
IBM - Dublin Research Lab
Processing and publishing Linked urban Data• [Maali, ESWC’12] Maali, F., Cyganiak, R., Peristeras, V.: A publishing pipeline for linked government data. Proc. of
ESWC, 2012.• [Datalift] Schar_e, F., Atemezing, G., R., T., Gandon, F.e.a.: Enabling linked-data publication with the datalift
platform. In (AAAI'12) Workshop on Semantic Cities, 2012• [TWC LOGD] Ding. ,L., Lebo., T., Erickson, J.S. et al.: Twc logd: A portal for linked open government data
ecosystems. Web Semantics, 2011.• IBM City Forward: http://cityforward.org
Semantic Lifting• [RF123] Han, L., Finin, T., Parr, C., Sachs, J., Joshi, A.: RDF123: From Spreadsheets to RDF. Proc. of ISWC
2008• [Csv2rdf4lod] http://data-gov.tw.rpi.edu/wiki/Csv2rdf4lod• Skjæveland, M.G., Lian, E. H., Horrocks, I. Publishing the Norwegian Petroleum Directorate’s FactPages as
Semantic Web Data. In use ISWC’13
Web Tables• Cafarella, M.J., Halevy, A., Madhavan, J.: Structured data on the Web. Communications of the ACM, 2011• Sarma A., Fang, L., Gupta, N., Halevy, A., et al.: Finding Related Tables, SIGMOD '12
Urban Dynamics• Kling f., Pozdnoukhov, A.: When a city tells a story. In ACM SIGSPATIAL GIS, 2012
Evaluation Campaigns• Blanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR 2011• [QALD, JSW’13] Lopez, V., Unger, C., Cimiano P., Motta, E.: Evaluating Question Answering over Linked Data,
Journal Web Semantics 2013, http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/
References
IBM - Dublin Research Lab
References
• Indexing– Chun Chen, Feng Li, Beng C. Ooi, Sai Wu, Ti: an efficient indexing mechanism for real-
time search on tweets, SIGMOD 2011
– Thanh Tran, Günter Ladwig, Sebastian Rudolph, RDF Data Partitioning and Query Processing Using Structure Indexes, Transactions on Knowledge and Data Engineering, 2012.
– Cong, Gao, Christian S. Jensen, and Dingming Wu. Efficient retrieval of the top-k most relevant spatial web objects. Proceedings of the VLDB Endowment 2.1 (2009): 337-348.
• Query Processing– Zhang, Wei Vivian, Benjamin Rey, Eugene Stipp, and Rosie Jones. Geomodification in
Query Rewriting. In GIR. 2006.
– Gravano, Luis, Vasileios Hatzivassiloglou, and Richard Lichtenstein. Categorizing web queries according to geographical locality. Proceedings of the twelfth international conference on Information and knowledge management. ACM, 2003.
– Jones, Rosie, Benjamin Rey, Omid Madani, and Wiley Greiner. Generating query substitutions. In Proceedings of the 15th international conference on World Wide Web, pp. 387-396. ACM, 2006.
– Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano, Top-k exploration of query candidates for efficient keyword search on graph-shaped (rdf) data, ICDE, 2009.
– Bai, Jing, Jian-Yun Nie, Guihong Cao, and Hugues Bouchard. Using query contexts in information retrieval. In SIGIR, 2007.
IBM - Dublin Research Lab
References
• Retrieval– Changsung Kang, Xuanhui Wang, Yi Chang, Belle Tseng, Learning to rank with
multi-aspect relevance for vertical search, WSDM 2012
– Nicholas D Lane, Dimitrios Lymberopoulos, Feng Zhao, Andrew T. Campbell, Hapori: context-based local search for mobile phones using community behavioral modeling and similarity, Ubicomp,2010.
– Klaus Berberich, Arnd C. Konig, Dimitrios Lymberopoulos, Peixiang Zhao, Improving local search ranking through external logs, SIGIR 2011.
– Cheng, Zhiyuan, et al. Toward traffic-driven location-based Web search.CIKM, 2011.
– Hristidis, Vagelis, Heasoo Hwang, and Yannis Papakonstantinou. Authority-based keyword search in databases. ACM Transactions on Database Systems (TODS) 33, no. 1 (2008): 1.
– Li, Guoliang, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, and Lizhu Zhou. EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In SIGMOD, 2008.
– Guo, Lin, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. XRANK: ranked keyword search over XML documents. In SIGMOD, 2003.
– Bicer, Veli, Thanh Tran, and Radoslav Nedkov. Ranking support for keyword search on structured data using relevance models. In CIKM, 2011.
IBM - Dublin Research Lab
References
• Other Tutorials– Tony Russell-Rose, Designing Search Usability, Tutorial at
SIGIR 2013.
– Tie-Yan Liu, Learning to Rank for Information Retrieval, Tutorial at SIGIR 2008
– Yi Zhang and Rong Jin, Supervised and Semi-Supervised Learning for IR, Tutorial at SIGIR 2008
– Peter Mika, Thanh Tran, Semantic Search on the Rise, Tutorial at SemTech 2013
– Thanh Tran, Semantic Search - Focus: IR on Structured Data, ESSIR, 2011.
– Yi Chen, Keyword Search on Structured and Semi-structured Data. Tutorial in SIGMOD, 2009