View
1.598
Download
3
Category
Preview:
DESCRIPTION
Presentation at the Semantic Web meetup in Seattle, WA, USA, in March 2012: http://www.meetup.com/Semantically-Webbed-Seattle-Meetup-Group/events/52635992/
Citation preview
© Copyright 2012 SEEKDA GmbH – www.seekda.com
seekda‘s Web Service Search Engine
1
Nathalie Steinmetz
seekda GmbH
© Copyright 2012 SEEKDA GmbH – www.seekda.com
seekda Web Service Search Engine
2
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Motivation
“Web of services” Growing amount of public services & data on the Web Problem: How do I find the service I need?
General search engine: services hard to identify, not much information on results page
Specific portals: access to restricted sets of registered and editorially maintained services
Use semantic technologies for better search experience No to heavy-weight, expressive semantic web service languages
such as OWL-S or WSML Yes to simple light-weight semantic annotations in RDF Scalability!
3
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Outline
Web Service search engine - basics Focused Crawling WSDL-based services Web APIs
Seekda‘s search engine & experimental prototype
Crowdsourcing Web Service annotations Web Service Annotation wizard Amazon Mechanical Turk crowdsourcing
Service ontologies
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Location
Locating Web Services on the Web (Approach adopted by European projects Service-Finder & SOA4All)
Crawling the Web for services Aggregate information Annotate services
Supported services: WSDL descriptions Web APIs (a.k.a. RESTful services)
5
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Crawler Architecture
6
Crawling
DataPost-Processing
Collecting SeedsCrawl Operator
ARCs Index
Co
nfig
uration
& M
onitorin
g
RDFmeta-data
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Crawling the Web for Services
Basic crawling process: Start with a set of seed URLs Check whether a page should be fetched or not Fetch the document the URL points to Extract links from the fetched document Decide whether or not to store fetched documents Feed crawler queues with newly extracted links Assign costs/priorities to single URLs and queues
7
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Focused Crawling Techniques
Seed Collection Collecting seeds from specialized portals Reuse known Web Service descriptions and related documents
URL Scheduling Use clever means to prioritize URLs to focus the crawls to the relevant part of
the Web Assign costs that influence the priority of a URL in a queue Based on:
Building term vectors of pages to assess similarity to WS domain URL characteristics
Queue Scheduling One queue per host Prioritize queues with low-cost URLs
8
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Identify WSDLs and Related Information
WSDL identification Check whether a fetched page is XML and valid WSDL
Related documents identification Definition of related document
Inlink to the WSDL Outlink from the WSDL Associated by term vector similarity
Task split between crawl run-time and post-processing of the crawl data
Task implies the deeper crawling of service provider domains
9
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Unique Service Objects
Building unique service objects Collect all similar WSDLs deduplication
One service = all WSDLs with same provider and service Example:
Unique Service: http://seekda.com/providers/cdyne.com/IP2Geo Endpoint: http://ws.cdyne.com/ip2geo/ip2geo.asmx Provider: cdyne.com Service: IP2Geo WSDLs:
http://ws.cdyne.com/ip2geo/ip2geo.asmx?wsdlhttp://miki2005.uda.ad/p1net/Web%20References/com.cdyne.ws/ip2geo.wsdl...
Create uniqe service identifiers: http://seekda.com/providers/<providerName>/<serviceName>
Assemble related information
10
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Search Results
11
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Overview
12
© Copyright 2012 SEEKDA GmbH – www.seekda.com
seekda Web Service Search Engine
13
WSDL ONLY
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Why crawl for Web APIs?
Significant growth of Web APIs > 5,400 Web APIs on ProgrammableWeb (including SOAP and
REST APIs) [end of 2009: ca. 1,500 Web APIs] > 6,500 Mashups on ProgrammableWeb (combining Web APIs
from one or more sources) SOAP services are only a small part of the overall available
public services
14
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Web API – Example (1/3)
15
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Web API – Example (2/3)
16
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Web API – Example (3/3)
17
Problem: Web APIs are
described by regular HTML pages
No standardized structure that helps with the identification
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Web API Identification
Solution: Crawl for Web APIs
Approach 1: Manual Feature Identification Approach Taking into account HTML structure (e.g., title, mark-up), syntactical
properties of used language (e.g., camel-cased words), and link properties of pages (ratio external links / internal links)
Approach 2: Automatic Classification Approach Text Classification, supervised learning (Support Vector Machine
model) Training set: APIs from ProgrammableWeb
18
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Unique Service Objects – Web APIs
Create unique identifiers: Again using the provider name (from the Web API homepage) We do not know the service name hash value of URL instead http://seekda.com/providers/<providerName>/
<hashValueOfURL>
But: still needed human confirmation to be sure
19
© Copyright 2012 SEEKDA GmbH – www.seekda.com
New Search Engine Prototype
20
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Prototype – User Contributions
Web API – yes/no: confirmation from human needed!
Other annotations that help improve the search for Web Services
Categories Tags Natural Language descriptions Cost: Free or paid service
21
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Problem - User Contribution
Problem: Users/developers don’t contribute enough Hard to motivate them to provide annotations Community recognition or peer respect not enough
Solution: crowdsourcing the annotations, pay people to provide annotations
Use Amazon Mechanical Turk Bootstrap annotations quickly and cheap
22
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Annotation Wizard (1/4)
23
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Annotation Wizard (2/4)
24
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Annotation Wizard (3/4)
25
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Annotation Wizard (4/4)
26
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Amazon Mechanical Turk – Iteration 1
Annotation Wizard Web API Yes/No Assign a category Assign tags Provide a natural language description Determine whether page is documentation, pricing or listing Rate the service
27
Number of Submissions 70
Reward per task $0.10
Restrictions none
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Amazon Mechanical Turk – Iteration 1
Results 21 APIs correctly identified as APIs 28 Web documents (non APIs) identified correctly as non APIs 49/70 correctly identified (70% accuracy) Average task completion time: 2:20 min
But, only: 4 well done & complete annotations 8 acceptable annotations (non complete)
28
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Amazon Mechanical Turk – Iterations 2 & 3
Annotation Wizard Removed page type identification & service rating For a task to be accepted:
At least one category must be assigned At least 2 tags must be provided A meaningful description must be provided
29
Iteration 2 Iteration 3
Number of Submissions 100 150
Reward per task $0.20 $0.20
Restrictions yes yes
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Amazon Mechanical Turk – Iteration 2 & 3
Results Iteration 2 & 3: Ca. 80% of documents correctly identified Very satisfying annotations Average completion time: 2:36 min
30
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Amazon Mechanical Turk – Survey
48 survey submissions Female 18, Male 30 Most popular origins: India (27) and USA (9) Popular age groups:
15-22 (12) 23-30 (18) 31-50 (16)
Most of them worked in some IT profession Provided best quality annotations
31
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Amazon Mechanical Turk
Recommendations for further improvement: Improve task description, especially ‘what is a Web API’ Better examples (e.g., hinting what makes a false page false) Allow assignment of multiple categories Restrict to workers in IT professions?
Conclusion: Very positive results good way to get quality annotations Results will help provide better search experience to users Results can be used as positive set for automatic classification
32
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Ontologies (1/2)
33
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Service Ontologies (2/2)
34
http://www.service-finder.eu/ontologies/ServiceCategories
© Copyright 2012 SEEKDA GmbH – www.seekda.com
Questions?
35
Recommended