Enhanced Content Delivery
Action 2: Mine the Web
Industrial Day
Roma, 10 Giugno 2004
ECD - Industrial Day, Roma 10 Giugno 2004
Action 2 - Partners
ICAR-CNR, Cosenza
KDD & HPC Labs ISTI-CNR, Pisa
Dipartimento di Informatica, Università di Pisa
ECD - Industrial Day, Roma 10 Giugno 2004
Action 2 – Mine the Web
The project: four Work Packages
(Action Coordinator Dott. Fosca Giannotti, ISTI-CNR) Work Package 2.1. Web Mining (UNIPI, ISTI, ICAR)
WP Coordinator: Dott. Salvatore Ruggieri, Dip. Informatica
Work Package 2.2. Indexing and compression (UNIPI) WP Coordinator : Prof. Paolo Ferragina, Dip. Informatica
Work Package 2.3. Managing Terabytes (ISTI, ICAR) WP Coordinator : Dott. Raffaele Perego, ISTI-CNR
Work Package 2.4. Participatory Search Services (UNIPI) WP Coordinator : Prof. Maria Simi, Dip. Informatica
ECD - Industrial Day, Roma 10 Giugno 2004
Action 2 – Mine the Web
The main goals of the ECD Project, content enhancement and delivery, are here pursued in a complementary way w.r.t. Action 1
The focus is on Delivering Enhanced Web Contents to (Communities of) Users: Exploiting Web Mining to extract knowledge/models that can
be used to enhance efficacy and efficiency of the various phases of the information search process
Design, validate and provide efficient and scalable solutions for retrieving, storing, and delivering Web contents to users
ECD - Industrial Day, Roma 10 Giugno 2004
Motivations
On-line data grows rapidly: 50+M new pages/day, font: IBM 100+k news, articles/day font: IBM Databases, digital libraries, etc.
Internet use tracking produces additional interesting data: Servers logs, WSE logs, network traffic logs
Goldman Sachs estimates (2002):“between 80 and 90 percent of information on the Internet and corporate networks is unstructured”
ECD - Industrial Day, Roma 10 Giugno 2004
Motivations The limits of the current means of access to web contents
are becoming clear Low precision and quality, difficulty of matching users’
subjective relevance over-abundance of low-quality web materiallow covering and freshness
much relevant information in the hidden web ranking mechanisms penalize important pages that enter the
scene Difficulties in
managing size, complexity, heterogeneity identifying Patterns and Trends within huge amounts of
unstructured contents
Web Mining plays an important role. It allows to synthesize and extract precious information and knowledge
ECD - Industrial Day, Roma 10 Giugno 2004
Web Mining
User-Centric View (Client-Side) discovery of documents on a subject discovery of semantically related documents or document
segments extraction of relevant knowledge about a subject from
multiple sources
Web Mining: Exploiting Data Mining techniques with data coming from the Web
Data Mining: the process of discovery interesting knowledge from large amount of data stored in databases, data warehouses, or other repositories
Goal: assist users or site owners in finding something useful/interesting/relevant
Owner-Centric View (Server-Side) increasing contact / conversion efficiency (Web marketing) targeted promotion of goods, services, products, ads measuring effectiveness of site content / structure providing dynamic personalized services or content
ECD - Industrial Day, Roma 10 Giugno 2004
Web Mining Taxonomy
Web Mining
Web Usage Mining
Web Content Mining
Web Structure
Mining131.114.21.41 - - [27/May/2004:19:24:00 +0200] "GET /images/finger.jpg HTTP/1.1" 304 -131.114.21.41 - - [27/May/2004:19:24:00 +0200] "GET /images/logokdd.jpg HTTP/1.1" 304 -131.114.21.41 - - [27/May/2004:19:24:09 +0200] "GET /didattica/BDM2004/TDM_intro.19.02.04.pdf HTTP/1.1" 200 131072131.114.21.41 - - [27/May/2004:19:24:12 +0200] "GET /didattica/BDM2004/TDM_intro.19.02.04.pdf HTTP/1.1" 206 196608131.114.21.41 - - [27/May/2004:19:24:13 +0200] "GET /didattica/BDM2004/TDM_intro.19.02.04.pdf HTTP/1.1" 206 338224
ECD - Industrial Day, Roma 10 Giugno 2004
Web Mining Applications Web Usage Mining
discovering customer preference and behavior Web personalization / collaborative filtering adaptive Web sites / improving Web site organization e-business intelligence, etc.
Web Content Mining information filtering / knowledge extraction Web document categorization discovery of ontologies on the Web, etc.
Web Structure Mining Finding "Quality" or "authoritative" sites based on linkage and citations
IBM CLEVER project Google
Etc.
ECD - Industrial Day, Roma 10 Giugno 2004
Some related projects
WebFountain - IBM WebBase - Stanford DBGroup
ECD - Industrial Day, Roma 10 Giugno 2004
WebFountain
World-Wide Web, News
Forums, Weblogs, etc.
Newspapers, Magazines, etc.
Customer Electronic Text WebFountain
Infrastructure
for
Advanced Text Analytics
Finds patterns, trends and
relationships in text
Application Examples:
• Marketing
• Intelligence
• Research
IBM
ECD - Industrial Day, Roma 10 Giugno 2004
WebFountain: an infrastructure for Advanced Text Analytics applications
CustomerDBs
3rd
PartyDBs
CustomerDBs
Application Server
CustomersInternet
Intranets
NewsFeeds
Crawler
Crawler
Structured Data
GathererStructured D
ata Gatherer
Data StoreData Store
Information Miners
Information Miners
CommunicationsInfrastructure
Index(es)Index(es)
Cluster Management System
Crawler
Crawler
Structured Data
GathererStructured D
ata Gatherer
Data StoreData Store
Information Miners
Information Miners
CommunicationsInfrastructure
Index(es)Index(es)
Cluster Management System
PROJECT WF INFRASTRUCTURE
½ PetabyeCluster capacity
2,000,000,000 Number of pages in store
25,000,000 Number of pages crawled per day
10,000Number of pages mined per second
3674 Number of 73GB hard drives
1231 Number of CPU’s
250
Number of scientists and researchers who have contributed to WebFountain technology
100 Patents pending
75 Patents issued
70Megabytes/sec traffic coming in from internet
5 minutes, 22 secondsTime to complete query
5Number of countries contributing to technology
ECD - Industrial Day, Roma 10 Giugno 2004
WebFountain: Reputation Tracking
ECD - Industrial Day, Roma 10 Giugno 2004
WebBaseStanford DBgroup
ECD - Industrial Day, Roma 10 Giugno 2004
WebBase Challenges
Scalability crawling archive distribution index construction storage
Consistency freshness versions
Dissemination
Archiving “units” coordination
IP Management copy access link access access control
Hidden Web Topic-Specific
Collection Building
ECD - Industrial Day, Roma 10 Giugno 2004
Action 2 – Mine the Web: application scenario
So far, barely no approach analyzes how a given group of users access the Web, with the aim of exploiting usage information to provide enhanced access to web resources to the users from this group
We think that it is possible to learn from usage data of a group of web users new models and patterns that, in combination with document content and structure, may yield enhanced content access and delivery better search services, better categorization and document
classification services, better question answering services
ECD - Industrial Day, Roma 10 Giugno 2004
Action 2 – Mine the Web
Ambitious objective:Exploit the combination of Web data about:
USAGE, STRUCTURE, CONTENT
originated/accessed by a Virtual Organization, to improve the efficacy and efficiency of the knowledge
extraction process from the users point of viewDeveloping solutions:
Innovative w.r.t. the state of the art Appropriate for the Web domain
ECD - Industrial Day, Roma 10 Giugno 2004
Virtual Organizations
Virtual CommunityInternet
ECD - Industrial Day, Roma 10 Giugno 2004
Tracking Virtual Organizations
Tracking the interaction of the virtual community with internet allows us to collect several interesting information
Network Traffic data provide detailed information about:
Usage Preferred sites, user sessions
Content Accessed Documents
Structure From client sessions we can
build the usage Web subgraph By parsing the documents
retrieved we can build the corresponding link graph
Virtual Community
ECD - Industrial Day, Roma 10 Giugno 2004
Tracking Virtual Organizations
Link graph
Traffic graph
Link andTraffic graph
Virtual Community
ECD - Industrial Day, Roma 10 Giugno 2004
We need an infrastructure: the Web Object Store (WOS)
A Web Data Management System optimized to efficiently handle content, usage, and structure web dataPurpose: Enable (possibly) innovative Web IR and Web Mining research by locally providing a small, but significant, portion of the Web built according to our user-centric view Manage large collections of
Web pagesPreprocessed Usage dataStructure data
Collected within our virtual community
ECD - Industrial Day, Roma 10 Giugno 2004
Related activities:
- Clustering Emails
- Caching of Documents and of Query results
- Efficient and scalable pattern mining and clustering algorithms
- Enhanced compression methods
- Clustering/categorizing query results snippets
- Clustering XML documents
- Etc.
WOS and related activities
Clustering/Pattern/Classification Web Mining algorithms
Efficient and scalable access methods:
• IXE b-trees, full-text indexes
• search in compressed dataData cleaning, preprocessing, filteringPopulation:
•traffic raw data of our community
•IXE Crawler
•Partecipatory search
Efficient and scalable storage:
• IXE persistent objects
• compression
• distributed architecture
Persistent store of objects Web data management
system for web content, structure and usage data
Management of data at many abstraction levels
Fast development of new applications Easy C++ annotation of
new persistent objects Read and write data in
tables
ECD - Industrial Day, Roma 10 Giugno 2004
WOS applications Some innovative applications are currently pursued within
our project: Characterization, on the basis of usage only or usage +
contents + structure, of new important emerging sites, or irrelevant sites (e.g., advertising sites);
crucial to instruct the crawler of the community web repository towards fresh, relevant documents while avoiding unimportant documents
Page ranking based also on usage information, for achieving a more accurate and dynamic measurement of document relevance
Recommendation of similar/related documents and keywords, on the basis of combined usage/content analysis
Caching and clustering of web search results
ECD - Industrial Day, Roma 10 Giugno 2004
WOS population: usage data (WP 2.1)
Many-to-many interactions Inter-site user sessions Massive data
Millions/day HttpRequest ~1 GB/day raw data
We collected long periods of proxy-level IP traffic originated from SERRA network (domain unipi.it) The whole University of Pisa
ECD - Industrial Day, Roma 10 Giugno 2004
WOS population: content data (WP 2.4)
Methods to gather contents to populate Web Object Store IXE Crawler Participatory Search System (main activity this year) Hidden Web Search
ECD - Industrial Day, Roma 10 Giugno 2004
WOS population: content data (WP 2.4)
IXE crawlerinit
get next url
get page
extract urls
initial urls
web pages
Internet
ECD - Industrial Day, Roma 10 Giugno 2004
IXE Crawler
Parallel/distributed crawler High performance through:
asynchronous I/O (500 connections/thread) asynchronous DNS resolution keep-alive connections multi-threads URL compression
9 Mb/sec transfer rate (7 times nutch.org crawler)
ECD - Industrial Day, Roma 10 Giugno 2004
Participatory search: the idea
Participatory search: each participant builds an index of the local contents and
sends it to a central server the central server implements a community search service
collecting and merging the participants' indexes
A model that fits community needs for dedicated search services
A trade-off between a centralized search model (e.g.: Google), and a distributed approach (e.g.: Gnutella, Kazaa)
ECD - Industrial Day, Roma 10 Giugno 2004
Participatory Search
Centralized Participatory Distributed
Search Index Search resultsDocuments
C I
C I
C I
C ISC I S
C I S
C I SC I S C I S
C I S
C – Crawler I – IndexerS – Search Engine
ECD - Industrial Day, Roma 10 Giugno 2004
Participatory Search: benefits
Participants are in charge of selecting what to index and to publish when to publish (no need of coordination with an
external crawler) Control on index update and freshness Publishing of Hidden Web content
ECD - Industrial Day, Roma 10 Giugno 2004
Qualitatively, we show that
c’ is shorter than c, if s is compressible
Time(Aboost) = Time(A), i.e. no slowdown
A is used as a black-box
Storage and access methods: compression (WP 2.2)
c’
BoosterThe better is A,
the better is Aboost
As cThe more compressible is s,
the better is Aboost
Key Components: Burrows-Wheeler Transform,
Suffix Tree, and a Greedy processing of them
Our technique takes a poor compressor A and turns it into a compressor Aboost with better performance guarantee
ECD - Industrial Day, Roma 10 Giugno 2004
Storage and access methods (WP 2.1 and 2.2)
Repository of URLs Compressed Prefix and Suffix search within URLs
Search by hostname, path, file-ext, …
select count(*) from … where url LIKE ‘http://%.it/%.asp’
Up to two order of magnitude faster than using sequential scan and B-tree
Space occupacy << B-tree
ECD - Industrial Day, Roma 10 Giugno 2004
Storage and access methods: index compression (WP 2.3)
Assigning DocIDs in a clever way could improve the compression factor of traditional variable-[bit/byte] encoding methods by increasing the number of small DGaps.
Clustering property: within each posting lists there are dense zones (i.e. a lot of small DGaps).
Our problem consists of enhancing the Clustering Property of posting lists.
ECD - Industrial Day, Roma 10 Giugno 2004
Compression Enhancement
ECD - Industrial Day, Roma 10 Giugno 2004
Content delivery (WP 2.1, 2.2 and 2.3)
Web Caching Mining of web/proxy server requests aimed at improving LRU-
based document caching (WP 2.1)
Recommendation system (On line/Off line) Mining of web sessions aimed at profiling
users and recommending them related pages (WP 2.1, 2.3)
Transactional Clustering Clustering specialized on transactional data aimed at
categorizing web pages, user sessions, snippet sequences, search engine results (WP 2.1, 2.2)
ECD - Industrial Day, Roma 10 Giugno 2004
Content delivery (WP 2.3)
SUGGEST: a recommendation system made up of two distinct modules Offline: performing model extraction by a clustering algorithm
which partition the Usage Graph Online: performing users classification and suggestion
generation The WOS remarkably shortened implementation time (<
500 C++ lines) We used three WOS objects to produce a persistent clustering
structureCitationPageViewSession
sCluster
ECD - Industrial Day, Roma 10 Giugno 2004
Content delivery (WP 2.2)
Goal: Retrieve the pages which match the user needs.
This is a much difficult task in the light of the fact that: the Web size is increasing and so the number of answers
the Web coverage is a problem for a single search engine
Web pages are heterogeneous
User needs are subjective and time-varying
“list of keywords” paradigm for a user query may be ambiguousSnakeT: clusters the web-snippets returned by many
search engine(s) into hierarchically labeled folders which are created on-the-fly to catch the various meaning of the
answers returned for a user query
ECD - Industrial Day, Roma 10 Giugno 2004
SnakeT: An example fo use
ECD - Industrial Day, Roma 10 Giugno 2004
SnakeT: An example fo use
Look at theDEMO
ECD - Industrial Day, Roma 10 Giugno 2004
Content delivery (WP 2.1)
Clustering of E-mails (manco) XML documents (chiara) ??
ECD - Industrial Day, Roma 10 Giugno 2004
On going and future activities
Work in progress Pursuing our goal of exploiting USAGE, STRUCTURE, CONTENT
Web data to improve efficacy and efficiency in the interaction of the user with the Web
Implementation of additional WOS layers Compression booster, XML clustering
Future work (medium-long term) WOS, final version Community-oriented ranking Content (news, xml, ..) clustering Cooperation with Nutch.org
(Doug Cutting in Pisa next October) etc
ECD - Industrial Day, Roma 10 Giugno 2004
Deployment scenarios
Concerning the role of the WOS and of the ECD applications three (non-exclusive) possible deployment scenarios could be devised The WOS is a research infrastructure, in the spirit of the
WebBase project at Stanford University The WOS is an infrastructure for web analytics services to be
offered to third parties, in a spirit close to the WebFountain IBM project
The WOS can become a product for Web Data Management Systems aimed at developing and engineering web mining ECD applications, again in a spirit close to WebBase
ECD - Industrial Day, Roma 10 Giugno 2004
Demo Session
Three demos here WOS: browsing usage data (Mirko Nanni, Vincenzo
Bacarella) SnakeT: Web snippets clustering (Paolo Ferragina,
Antonio Gullì) ANTIX: Participatory Search System (Andrea Esuli)
Some other activities described in the Posters
ECD - Industrial Day, Roma 10 Giugno 2004
More information
Interested people can find these slides, more information, documents and the full list of publications at the address:
http://ecd.isti.cnr.it