Upload
clifford-lester
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Abe Lederman, President and CTO
Deep Web Technologies, Inc.
ScienceEducation.gov Meeting
National Academy of Sciences, March 18, 2009
A Look at the Technology
Under the Hood
Content Integration Technologies for ScienceEducation.gov
• Crawling and Indexing (Part of Science.gov, E-Print Network)
• Federated Search (Science.gov, WorldWideScience.org)
ScienceEducation.gov Needs to successfully integrate content from a
variety of websites and databases requiring custom tools other search engines are unable to provide.
Drawing on the Experience of the E-Print Network
Gateway to 30,000 websites and databases worldwide, containing over 5 million e-prints in basic and applied sciences.
Drawing on the Experience of the E-Print Network
• Initially developed in 2001• Crawls and indexes 30,000 websites• Uses sophisticated filters to ensure that
only quality e-prints are included in the Network
• Contains full-text index of over 1.5 million e-prints
• Uses an Admin Tool to manage websites in the E-Print Network
What is Federated Search?
Federated Search is an application or service that allows a user to submit a
search in parallel to multiple, distributed information sources
and retrieve aggregated, ranked and de-duped results.
In Other Words…One Search, Many Sources
DOD
Search
EPANASAFDA
NIH
DOE NSF
Other Agencies
Assembling the ScienceEducation.gov Search Engine- Part I
Assemble Starting URLs
Education Experts
Assembling the ScienceEducation.gov Search Engine- Part II
Starting URLs Crawl Websites
Filter Bad URLsAnd Remove Duplicates
Build Index
Assign Learning Levels
ScienceEducation.gov Index
Challenges Ahead
• Determining what sites
to crawl
• Filtering undesirable
URLs
• Assigning appropriate
learning level to content
• Categorizing content
To Crawl or Not To Crawl?
Would miss these
Don’t crawl these pages
Will crawl these
Filtering Undesirable URLs
All Crawled URLs
Filter
Good URLs
CalendarContact
FeedbackHousing
.
.
.Registration
Survey
Removing Duplicate Web Pages
URL: http://seawifs.gsfc.nasa.gov/OCEAN_PLANET/HTML/education_threats.html
DUP: http://seawifs.gsfc.nasa.gov/OCEAN_PLANET/HTML/ocean_planet_book_threats.html
TITLE: Ocean Planet: Threats
SNIPPET: Threats to the health of the oceans Oil spills account for only about five percent of the oil entering the oceans The Coast Guard estimates that for United States waters sewage treatment plants discharge twice as much oil each year as tanker spills Each year industrial household cleaning gardening and automotive products pollute water About 65 000 chemicals are used commercially in the United States today with about 1 000 new ones added each year Only about 300 have been extensively tested for toxicity It is estimated that medical waste that washed up onto Long Island and New Jersey beaches in the summer of 1988 cost as much as 3 billion in lost revenue from tourism and recreation.
Learning Level Stratification
Categorizing Content
• Audience: Student or Teacher• Grade Level: K-3, 4-6, 7-9, 10-12, College• Content Type: Interactive Activities, Lesson Plans, Reference Materials, Science Fair Projects, Videos• Subject Area: Chemistry, Computer Science, Energy, Life Sciences,
Mathematics, Physics
A Look at the TechnologyUnder the Hood
Thank you!Abe Lederman