12
1 Thomas Martinuzzo, Jr. Eng.

Deep Information and Extraction Tool

Embed Size (px)

Citation preview

Page 1: Deep Information and Extraction Tool

1

Thomas Martinuzzo, Jr. Eng.

Page 2: Deep Information and Extraction Tool

2

Page 3: Deep Information and Extraction Tool

3

What is DIET ?DIET is an information extraction and manipulation toolDIET can extract information from the DEEP web by understanding

pages structures

Web surface : 20 Billion pages indexed by search engines

DEEP web : +600 Billion pages

« The 60 largest Deep Web sources contain 84 billion pages of content. That's about 750 terabytes of information, sufficient by themselves to exceed the size of the surface Web by 40 times. » Brightplanet.com

Pic from Maxumowners.org

Page 4: Deep Information and Extraction Tool

4

DIET Features & Benefits Use artificial intelligence to build automatic wrappersNo to minimal user interventionUser can easily extract and manipulate information

Page 5: Deep Information and Extraction Tool

5

Car website : Characteristics: List of cars by name with description, date, price,

picture … Over 100 pages of data ! Problem : No local search engine.

But … I am looking for Acura MDX 2005 or something like that !

Job website : Characteristics: List of jobs by title with small description, salary,

city. Over 800 jobs. Local search engine. Sort capabilities. Problem : We can only see 10 jobs by page. Unable to search by

salary range. Unable to sort by city.

BUT … I want to see all jobs over 75 000$ in one single page and save it for future consultation.

Page 6: Deep Information and Extraction Tool

6

DIET TechnologiesDIET Core Web Services

Access only by certified clientsDIET Web Application

Users and services managersWeb based application (JSP/Servlet/JavaServer Faces/JavaBean)

Based on Java EE 5/Glassfish/MySql technology

Page 7: Deep Information and Extraction Tool

7

Univalor WebsiteList of new technology group by domainsSimple search engine available

Page 8: Deep Information and Extraction Tool

8

Using DIETWe want to extract and them to manipulate all available technologiesGive Univalor technologies URL to DIET :

http://www.univalor.ca/companies_available_technologies.asp

Page 9: Deep Information and Extraction Tool

9

Wrapper are generatedDIET creates a Wrapper by learning the structures of Univalor

Webpages.DIET extracts data thru the Wrapper.DIET displays the results

Page 10: Deep Information and Extraction Tool

10

Manipulate information with DIETOnce the information was extracted, it can be manipulated.

Page 11: Deep Information and Extraction Tool

11

Plug-in opportunityDIET Core Web Services can be used by third party clientsInternet Explorer and Mozilla Firefox integration

Export capabilitiesExtracted information can be export on multiple storages formats

And more …Users can create their own WrappersDIET can be the perfect tool for DEEP search

Page 12: Deep Information and Extraction Tool

12

Research and Development: Samuel Pierre, [email protected]

Commercialization and licensing

Didier Leconte, [email protected]

Thomas Martinuzzo, Jr. [email protected]

 

Thanks !