Syntactical Integration of Product Information From Semi-Structured Sources

  • Upload
    lhaehne

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    1/92

    Department of Computer Science, Institute for Systems Architecture, Chair of Computer Networks

    Diplomarbeit

    SYNTACTICAL INTEGRATION OFPRODUCT INFORMATION FROM

    SEMI-STRUCTURED SOURCES

    Ludwig HhneMat.-Nr.: 2959267

    Supervised by:

    Dipl.-Medieninf. Maximilian Walther

    Prof. Dr. rer. nat. habil. Dr. h. c. Alexander SchillSubmitted on July 16, 2009

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    2/92

    II

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    3/92

    ABSTRACT

    This thesis presents a novel product information retrieval and extraction system. The

    goal is to provide a solution which automatically locates the manufacturers page of a

    given product and extracts relevant product attributes. The document retrieval subsys-

    tem exploits multiple web search services and uses various heuristics to improve the

    ranking. The unsupervised extraction of product attributes is based on syntactic fea-

    tures of the product pages. XPath queries are used to cluster and select genuine product

    attributes from web documents. Three different extraction rule induction algorithms are

    presented. One variant uses multiple training documents, another incorporates already

    extracted data, and a supervised solution falls back on user-supplied examples. A webcrawler was developed which automatically retrieves pages sharing common underlying

    page-templates.

    The implementation extends an experimental federated search engine developed at

    the TU Dresden. The extracted product attributes are meant to spice up already available

    data with first-hand information gathered from the respective manufacturer sites. The

    system was evaluated according to a gold standard. Considering the low expenses in

    terms of user guidance effort and execution time, the system exhibits good precisionand recall metrics.

    III

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    4/92

    IV

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    5/92

    CONFIRMATION

    I confirm that I independently prepared the thesis and that I used only the references

    and auxiliary means indicated in the thesis.

    Dresden, July 16, 2009

    V

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    6/92

    VI

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    7/92

    CONTENTS

    1 Introduction 1

    2 State of the Art 3

    2.1 Document Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.1.1 Document Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.1.2 Retrieval Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.1.3 Web Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.2 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.2.2 Wrapper Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.2.3 Supervised Information Extraction . . . . . . . . . . . . . . . . . . . 16

    2.2.4 Semi-Supervised Information Extraction . . . . . . . . . . . . . . . 16

    2.2.5 Unsupervised Information Extraction . . . . . . . . . . . . . . . . . 16

    2.2.6 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.3 Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.4 Legal Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.5 Fedseeko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.5.1 Producer Information Integration . . . . . . . . . . . . . . . . . . . 23

    2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    VII

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    8/92

    3 Requirements 25

    3.1 Information Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3.1.1 Product Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    3.2 Functional Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3.3 Behavioral Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    3.4 Validation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    4 Design 31

    4.1 Data Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    4.1.1 Retrieving Product Pages . . . . . . . . . . . . . . . . . . . . . . . . 32

    4.1.2 Information Extraction from Product Pages . . . . . . . . . . . . . 34

    4.2 Architectural Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    4.2.1 Fedseeko Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    5 Implementation 43

    5.1 Product Page Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    5.1.1 Locating the Producer Site . . . . . . . . . . . . . . . . . . . . . . . 44

    5.1.2 Locating the Product Page . . . . . . . . . . . . . . . . . . . . . . . 44

    5.1.3 Crawling Related Product Pages . . . . . . . . . . . . . . . . . . . . 47

    5.1.4 Locator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    5.2 Information Extraction Prototype . . . . . . . . . . . . . . . . . . . . . . . . 48

    5.2.1 Data Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    5.2.2 Phrase Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    5.2.3 Phrase Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.4 XPath Query Generalization . . . . . . . . . . . . . . . . . . . . . . 50

    5.2.5 Wrapper Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    5.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    5.3 Information Extraction Implementation . . . . . . . . . . . . . . . . . . . . 51

    5.3.1 Wrapper Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    5.3.2 Attribute Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    5.3.3 Selecting a Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    VIII Contents

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    9/92

    5.3.4 Architecture of the Web IE Subsystem . . . . . . . . . . . . . . . . . 57

    5.4 Fedseeko Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    6 Evaluation 61

    6.1 Feature Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    6.2 Effectiveness and Performance Evaluation . . . . . . . . . . . . . . . . . . 62

    6.2.1 Test Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    6.2.2 Product Page Retrieval Effectiveness . . . . . . . . . . . . . . . . . . 63

    6.2.3 Related Page Crawling Effectiveness . . . . . . . . . . . . . . . . . . 64

    6.2.4 Information Extraction Effectiveness . . . . . . . . . . . . . . . . . . 66

    6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    7 Conclusion 73

    7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    A Glossary 75

    Contents IX

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    10/92

    X Contents

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    11/92

    LIST OF FIGURES

    2.1 Interplay of document retrieval, information extraction and integration in

    web data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.2 Template-driven web page creation from database records . . . . . . . . . 9

    2.3 Different wrapper induction strategies [CKGS06] . . . . . . . . . . . . . . 11

    2.4 General tree mapping example [ZL05] . . . . . . . . . . . . . . . . . . . . . 13

    2.5 Iterative partial tree alignment example [ZL05] . . . . . . . . . . . . . . . . 14

    2.6 Wrapper induction example for RoadRunner [CMM01] . . . . . . . . . . . 18

    2.7 Input pages in ExAlg [AGM03] . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.8 Generalized nodes and data regions in DEPTA [ZL05] . . . . . . . . . . . 21

    2.9 Fedseeko architecture [WSS09] . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.1 Overview of information flow . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    3.2 Product page example with the extraction targets being highlighted . . . 27

    4.1 Information flow during extraction . . . . . . . . . . . . . . . . . . . . . . . 314.2 Selecting a product page from a set of candidates using multiple techniques 33

    4.3 Navigating to a related product page (Nikon D90 to Nikon D3X) . . . . . 34

    4.4 Examples of specification data embedded in different containers . . . . . 36

    4.5 Clustering text nodes from multiple documents . . . . . . . . . . . . . . . 38

    4.6 Source code of the two pages from figure 4.5 . . . . . . . . . . . . . . . . . 38

    4.7 Architecture overview of the complete system . . . . . . . . . . . . . . . . 40

    5.1 Ranking a set of candidate documents using multiple techniques . . . . . 44

    XI

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    12/92

    5.2 Architecture of the DR subsystem . . . . . . . . . . . . . . . . . . . . . . . 48

    5.3 Supervised retrieval and extraction . . . . . . . . . . . . . . . . . . . . . . . 53

    5.4 Architecture of the IE subsystem . . . . . . . . . . . . . . . . . . . . . . . . 58

    5.5 Fedseeko product administration view . . . . . . . . . . . . . . . . . . . . 59

    6.1 Word cloud visualizing the most common terms in key phrases . . . . . . 62

    6.2 Effectiveness of locating the right producer sites and product pages . . . 63

    6.3 Product page retrieval runtime performance distribution . . . . . . . . . . 65

    6.4 Number of successful operations of each isolated component . . . . . . . 67

    6.5 Correctness and completeness of extraction results . . . . . . . . . . . . . 68

    6.6 Example of a nested template page . . . . . . . . . . . . . . . . . . . . . . . 68

    6.7 Example of specification page for multiple products . . . . . . . . . . . . . 69

    6.8 Information extraction runtime performance . . . . . . . . . . . . . . . . . 70

    XII List of Figures

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    13/92

    1 INTRODUCTION

    The World Wide Web is a place where millions if not billions of products are marketed,

    searched, sold, bought and reviewed. Potential customers have a multitude of different

    sources at their disposal to facilitate a purchase decision. There are various product

    review sites, web shops provide product descriptions, blogs gain popularity as informa-

    tion resources and there is the information published by the products manufacturer. An

    important factor is the reliability of the individual information sources. When it comes

    to buying an expensive product, a customer probably prefers to resort to the most reli-

    able source of information. However, it is getting increasingly difficult to find first-hand

    product information via a simple web search. To reach potential customers, manufactur-

    ers have to compete with many other information providers in order to receive attentionand a good search engine rank.

    Nowadays, web search engines are the single point of contact interfacing to the exu-

    berant information in the World Wide Web. However, todays web search engines pre-

    dominantly only inform about the whereabouts of data and can still not answer complex

    queries. It is very difficult to do better as long as the web content is not semantically

    interwoven.

    Not only Tim Berners-Lee believes the Semantic Web to be the future of the Internet

    [BLHL01]. Instead of phrasing keyword queries and wading through search results to

    find relevant information, the vision is letting the Semantic Web answer actual ques-tions. In the context of product information retrieval one might want to ask questions

    like: How much power does the latest Siemens refrigerator consume compared to its

    predecessor and the new flagship product of Pengiun Electrics? As old as this vision

    is, it still has a long way to go. Web developers are required to semantically describe

    their data in languages that may seem too complex and lavish to pick up easily. Espe-

    cially the lack of obvious short term benefits may impede the adoption of Semantic Web

    technologies. It is not helpful either that a semantic query system needs a somewhat

    complete knowledge base in the target domain to be valuable for a potential user. But

    what if semantic data could be condensed out of existing web pages?

    1

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    14/92

    One idea is to bridge the gap between the "syntactic" Web and the Semantic Web

    by automatically transferring information from traditional web pages into a semantic

    context with the help of information extraction techniques. Acknowledged, information

    extraction systems will not immediately provide the anticipated power of the Semantic

    Web without further efforts. But these systems might help to facilitate the migrationprocess in some well-defined domains one of which might be product information

    extraction.

    With an automatic product information extraction and integration system at hand, it

    would be possible to find similar products based on all kinds of feature-related criteria.

    It would also relieve the customer of retrieving the producer information for the inter-

    esting products manually. Furthermore, such a system would be manufacturer- and

    vendor-independent.

    This work presents a novel approach towards automatic Web information extraction

    and is striving for becoming an enabling technology for product information integra-tion. A prototype implementation was developed and integrated into a federated search

    engine, demonstrating the practical viability for product information integration and its

    immanent challenges: locating product pages, automatically collecting training data for

    pattern mining and identifying and extracting valuable product data.

    The product page location component resorts to multiple web search services and

    incorporates various heuristics to optimize the retrieval precision. The extraction

    exploits structural characteristics of template-generated web pages. Extraction rules

    are stored as XPath queries in the system. A low complexity clustering algorithm is

    utilized to derive these extraction rules. Three algorithms are proposed, corresponding

    to different degrees of automation.

    Chapter two provides a theoretical background and the state of the art in Web infor-

    mation extraction and related fields of research are discussed. In chapter three the

    requirements of the novel product information extraction system are analyzed. The

    subsequent chapters deal with the design and implementation of the software system.

    Chapter six dissects the advantages and drawbacks of the presented solution and eval-

    uates the system according to a gold standard. Finally, a summary and an outlook are

    given.

    2 Chapter 1 Introduction

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    15/92

    2 STATE OF THE ART

    Integrating information from the World Wide Web into a local database relies on three

    major components, as depicted in figure 2.1. In this chapter, important concepts of

    document retrieval and information extraction are outlined and an overview of the state

    of the art in each field is given. This work strongly focuses on information extraction

    and thus presents a selection of existing information extraction systems. Information

    integration is covered briefly for the sake of completeness. The chapter closes with the

    presentation of Fedseeko, the system into which the new information extraction system

    shall be integrated.

    Figure 2.1: Interplay of document retrieval, information extraction and integration in

    web data extraction

    3

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    16/92

    2.1 DOCUMENT RETRIEVAL

    "Knowledge is of two kinds. We know a subject ourselves, or we know where we can

    find information upon it." Samuel Johnson

    Information retrieval (IR) is often only loosely defined. Moreover, in the context of

    most retrieval systems information retrieval actually refers to document retrieval. In effect,

    information retrieval shall be synonymous to document retrieval (DR) in this thesis. But

    for being less ambiguous, the latter term is preferred. Lancaster gives the following

    definition of IR that also draws a dividing line separating related fields of research like

    fact retrieval or question answering [Lan68]:

    Definition 1 (Information Retrieval)

    An information retrieval system does not inform (i.e. change the knowledge of) the user on the

    subject of his inquiry. It merely informs on the existence (or non-existence) and whereaboutsof documents relating to his request.

    Document retrieval aims to find relevant information from a large corpus of docu-

    ments. Given a user query, traditional DR systems identify and rank documents in a

    corporate or library network or on a single host (e.g. desktop search). In the context of

    the Internet, DR is an important foundation of web search technologies with web pages

    building the document corpus. Due to the vast amount of web content with trillions of

    web pages, web search systems have different requirements than traditional DR systems.

    User queries normally are lists of words. Based on a query, the DR system finds

    relevant documents by matching the query tokens with the documents contents. In

    the simplest case, each word occurring in the query must also occur in the document.

    Phrase queries are also a very common instrument in IR. In addition, the query may

    contain Boolean operators or means to express that two tokens must occur near each

    other. However, complex query constructs are rarely used in practice as those make the

    DR task more difficult for the users.

    In the following, DR document models, effectiveness metrics and web crawlers are

    discussed.

    2.1.1 Document Model

    The document model specifies how the documents and queries are represented and

    governs how the relevance of a document in respect to a query is computed.

    A document can be modeled in many different ways. It is common to most mod-

    els that documents and queries are treated as a "bag of words or terms" in which term

    sequence and position are ignored [Liu06]. An important characteristic of document

    models is whether and how term-interdependencies are modeled. In the simplest case,

    each word is treated independently. According to Kuropka, the various approaches can

    be divided into set-theoretic models (e.g. Boolean model), algebraic models (e.g. vector-

    4 Chapter 2 State of the Art

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    17/92

    space model) and probabilistic models [Kur04]. The different models will be briefly

    presented in the following.

    In the Boolean model each term is only checked for its presence or absence in a

    document. A query in a Boolean retrieval system can be given as a logical equation

    combining terms with logic operators, e.g. "James Joyce" AND Trieste. A document

    is relevant in respect to the query if the contained set of terms make the query logically

    true. Boolean models have the disadvantage that no ranking can be derived from the

    simple definition of the problem. Neither the term frequency is examined nor does the

    model permit inexact matches.

    In the vector-space model a document is represented by an n-dimensional vector,

    in which each dimension represents a distinct term of the vocabulary from the whole

    document corpus. The weight of the term is computed from its occurrence characteristic

    in the document. The query is also modeled as such a vector. Now the relevance of the

    document in respect to the query can be computed as the cosine of the angle betweenthe two vectors, defined as the cosine similarity (see equation (2.1)).

    cos =d q

    d q(2.1)

    An example for a probabilistic approach are language models which were first pro-

    posed for document retrieval by Ponte and Croft [PC98]. In a statistical language model

    a probability distribution of the n-grams is computed for each document in the corpus.

    The idea is to derive the ranking of a document di in respect to a query q from the a pos-

    teriori probability P(di|q). This is essentially the likelihood of the query being generated

    by the respective language model.

    The ranking derived from the degree of relevance is governed by the internal doc-

    ument model. It may reflect poorly the actual relevance of documents as perceived by

    the user. Thus, effectiveness metrics are required to evaluate the performance of a DR

    system.

    2.1.2 Retrieval Effectiveness

    Numerous metrics have been proposed to measure the performance of DR systems. The

    most commonly used metrics are precision and recall. Assuming a document is either

    relevant or irrelevant in respect to a query, precision is the fraction of relevant documents

    in the set of retrieved documents. In contrast, recall is the ratio of the number of relevant

    documents retrieved to the total number of relevant documents (including those that

    were not retrieved). Both metrics are related and are most often examined in context

    of each other. For example, it is trivial to achieve 100% recall by just returning all

    documents for every query. However, the precision metric would immediately reveal

    the deficiency of such an approach. Another commonly used metric is the F-score (or

    F-measure) which is defined as the weighted harmonic mean of precision and recall.

    Web search engines typically present search results in buckets of around ten docu-ments. Users, however, do not consider search results beyond the first few result pages.

    2.1 Document Retrieval 5

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    18/92

    In effect, a relevant but very low1 ranked document is essentially useless from the users

    perspective. Therefore, the ranking is also considered in the performance evaluation by

    only examining the first i search results.

    Let D be the whole document corpus set. A query is submitted to a given DR sys-

    tem. Dretrieved D is an ordered set of all retrieved documents while Diretrieved are the

    i top ranked documents returned by the system. Drelevant D is the set of all relevant

    documents. The effectiveness metrics can be computed according the equations (2.2).

    precision(i) =|Drelevant D

    iretrieved|

    |Diretrieved|

    recall(i) =|Drelevant D

    iretrieved|

    |Drelevant |

    F-score(i) = 2 precision(i) recall(i)precision(i) + recall(i)

    (2.2)

    In order to identify all relevant documents, the DR system first needs to be aware of

    the existence and whereabouts of the individual web pages. The gathering of web page

    is performed by a web crawler which is presented in the following section.

    2.1.3 Web Crawler

    Web IR systems have to gather web pages to build a document index. This non-trivial

    task is performed by a web crawler, also known as spider or robot. Web crawlers recur-sively follow links in web pages to build a document index. As the Internet is constantly

    evolving, web sites need to be visited regularly to account for new or changed content.

    Definition 2 (Spider [FOL09])

    A program that automatically explores the World-Wide Web by retrieving a document and

    recursively retrieving some or all the documents that are referenced in it.

    The best known web crawlers are universal crawlers which operate on behalf of web

    search engines collecting the data for the document index. According to Brin and Page,

    the crawler is the most fragile component of a search engine because it has to interactwith millions of remote servers all beyond the control of the system [BP98]. Thus, a

    crawler has to be very robust and handle a multitude of corner cases even if that might

    affect only a single page.

    Crawlers may impose a huge stress on the resources of the respective hosts if the

    request rate is not limited, leading to denial of service attacks in the worst case. Fur-

    thermore, crawlers should identify themselves and comply with the robots exclusion stan-

    dard2.

    1Depending on the users web browsing behavior and motivation he or she might wade through one

    hundred search results but more likely not more than ten search results will be considered by the user.

    2The robot exclusion standard or protocol is a de facto standard described at http://www.robotstxt.org/wc/robots.html.

    6 Chapter 2 State of the Art

    http://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.html
  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    19/92

    In addition to universal crawlers, there are focused and topical crawlers exploring

    the Web based on user preferences. These try to find only pages relating to a category

    of interest or being similar to a set of seed pages. Focused crawlers differ from universal

    crawlers in their strategy of picking URLs that shall be visited.

    2.1.4 Summary

    Important concepts of IR have been briefly presented. Information retrieval in terms

    of document retrieval is just a first step in obtaining and processing knowledge in an

    information system. The next step of information processing is the extraction of more

    fine-grained information from the retrieved documents.

    2.2 INFORMATION EXTRACTION

    "Get your facts first, and then you can distort them as much as you please." Mark

    Twain

    Roughly speaking, information extraction (IE) aims to condense knowledge about a

    specific domain of interest. Attributes of the domains entities or facts are distilled from

    one or more input documents. The goal of IE is enabling the information system to

    reason based on the extracted data. For example, an IE system that collects facts on the

    worlds countries may extract attributes such as population, capital or natality [UTF08].

    Definition 3 (Information Extraction [SAI01])

    Rather than indicating which documents need to be read by a user, [Information Extraction]

    extracts pieces of information that are salient to the users needs.

    IE produces structured data from unstructured and semi-structured documents.

    Semi-structured data typically refers to tables and lists, which are characteristic for web

    pages. Whether a document is perceived as structured or unstructured depends on the

    research domain. Databases are typically regarded as structured data while free text

    is commonly classified as unstructured. The classification, however, cannot be solely

    based on the data format. It is quite possible to dump a whole unstructured document

    into a single database record, or strictly format a text file as a sequence of key-value

    tuples. Similarly, a HTML body may contain an unstructured stream of free text or a

    fine-grained table. Nevertheless, in the IE community, HTML is commonly classified as

    semi-structured data, while XML documents with available meta-data are considered

    being structured [CKGS06]. The dividing line between semi-structured and structured

    data is drawn between documents containing some kind of syntactic structuring ele-

    ments (e.g. HTML tags) and semantic tags of the data.

    While IE for unstructured documents like free text has been thoroughly investigated

    during the last decades, as indicated by the success of the Message Understanding

    Conferences [Gri97], IE for semi-structured documents has received a growing inter-est from researchers during the last years. For the respective tasks, different techniques

    2.2 Information Extraction 7

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    20/92

    are required. Traditional IE needs to extract knowledge from human language texts and

    typically uses lexicons and grammars to achieve this goal. Web IE takes advantage of

    the fact that web pages are often automatically generated from (structured) database

    records. Because web pages are created by static templates, machine learning and pat-

    tern recognition techniques can be applied to analyze the syntactic structure of the doc-uments. Web scraping or screen scraping are used synonymously for Web IE. The Jargon

    File3 gives the following definition of screen scraping stressing the unintended usage of

    the medium.

    Definition 4 (Screen scraping)

    The act of capturing data from a system or program by snooping the contents of some display

    that is not actually intended for data transport or inspection by programs. [...] it often refers

    to parsing the HTML in generated web pages with programs designed to mine out particular

    patterns of content.

    Chang et al. give an overview of contemporary Web information extraction sys-

    tems and categorize those based on task difficulty, extraction technique and degree of

    automation [CKGS06].

    2.2.1 Data Model

    In the following, a generic IE data model is described informally. The chosen model is

    derived from the data model known from relational databases and is also referenced by

    other IE researchers [AGM03, Liu06]. According to this model, the data is structured as

    nested relations made up of basic types arranged in tuples and sets. A basic type B is anatomic entity, typically a string in the context of web pages. The tuple type T1, T2,..., Tn

    is an ordered collection of other types Ti. Tuples map to data records in a database

    context. Set types {T} are constructed by multiple elements of the same type T, like a

    list of equally typed tuples.

    Let S be the schema of a book description. The data record (tuple) describing a book

    might comprise the title, a set of authors, the publisher of the book and the number of

    pages. Then the schema can be described as S = Btitle, {Bname}authors ,Bpublisher ,Bpages.

    An instance ofS is the value x = "Ulysses",{"James Joyce"}, "Penguin", 1040.

    A template-based semi-structured page is created from one or more data recordsstored in a database and a template as illustrated in figure 2.2 on the next page. A

    template maps instances of a certain schema to a web page. More formally, an encoded

    web page P is created from a data record x and a template T via a template mapping

    function . Thus, the page creation process can be modeled as P = (T, x).

    The IE task is to extract x from P with Tbeing unknown. If1T is the extraction func-

    tion associated with template T, x = 1T (P) is performed by the extractor. The schema

    of the extracted data rarely matches the model of the original data schema. Either the

    IE system is not able to extract all data fields or only a subset of the data fields are

    3The Jargon File is "a comprehensive compendium of hacker slang illuminating many aspects of hackish

    tradition, folklore, and humor." [Jar03]

    8 Chapter 2 State of the Art

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    21/92

    Database

    Template

    Web

    Page

    Figure 2.2: Template-driven web page creation from database records

    required. Therefore, x is generally an incomplete approximation of the original data

    record x. For example, the schema for the extracted data in the running example mightbe S = Btitle,Bauthors ,Bpublisher with the nested data records for the authors collapsed

    into a single data field and the page count being omitted. In practice, many IE systems

    use simpler data models for the extraction targets than the one described. Particularly

    nesting of set and tuple types is not supported by the majority of the available IE sys-

    tems.

    An example template for the running example is given in listing 2.1 using a pseudo

    template language.

    1 2 Books

    3

    4 < l i > T i t l e : < i >

    5

    6 < l i >Author :

    7

    8 < l i > P u b l i s h er :

    9 < l i > pages

    10

    11

    Listing 2.1: Template example

    So far, the web page has been assumed to be a static document. However, techniques

    such as Ajax allow to perform operations asynchronously, for example the deferred

    loading of additional content using XMLHttpRequest [Gar05]. This poses new chal-

    lenges for DR and IE systems if relevant information becomes only available after per-

    forming a certain action, like clicking a link or button on the page. A potential solution

    to remedy this problem is to drive a full-fledged web browser with a JavaScript inter-

    preter and using a plug-in like Watir4 to store static snapshots of the dynamic page.

    4

    Watir is an open-source library for automating web browsers: http://wtr.rubyforge.org/index.html

    2.2 Information Extraction 9

    http://wtr.rubyforge.org/index.htmlhttp://wtr.rubyforge.org/index.html
  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    22/92

    As has already been identified in this section, the goal of the Web IE system is to

    extract data embedded in web pages created from a template. This task is performed by

    a wrapper program which may be hand-crafted or automatically generated. Wrapper

    generation techniques are discussed in the following section.

    2.2.2 Wrapper Induction

    According to a very general definition, a wrapper provides an interface to an entity and

    allows it to be treated as if being something else. In the Web IE context, a wrapper allows

    to regard a web page as a database record. Consequently, the wrapper is responsible for

    extracting one or more data records from web pages.

    Early IE systems were programmed manually.5 A set of web documents are examined

    and common patterns have to be identified by a human operator. Recurrent patterns

    enable the programmer to write a wrapper for extracting the target data, either manuallyor aided by pattern specification languages. The hand-crafted wrapper should then be

    able to extract data from documents sharing the same template.

    1

    2 Books

    3

    4 < l i > T i t l e : < i >Ulysses

    5 < l i >Author : J am es J o y c e

    6 < l i >Pub lis her : Penguin

    7 < l i >1040 pages

    8

    9

    Listing 2.2: Sample web page

    Listing 2.2 shows a simple web page generated from the aforementioned template.

    Assuming the extraction task is to extract the books title, the programmer might write a

    program that skips to the tag and extracts the text that follows until the closing

    tag. Alternatively, regular expressions or XPath queries could be used. The different

    variants to represent extraction rules are discussed on page 15.

    Manually programmed wrappers are prone to failures when templates change,require knowledge of the employed technologies and are very labor-intensive. In con-

    trast to manually specifying extraction rules, wrapper induction systems derive these

    from a set of training documents with various degrees of automation.

    Regardless how the wrapper was generated, Web IE systems have to deal with the

    problems of wrapper verification and wrapper repair. A wrapper relies on the extraction

    targets to be encoded in a certain way. However, web pages are subject to change

    and information providers may choose to replace their templates at any time. This

    causes hardship for wrapper maintenance. The detection of whether the wrapper is

    5Special purpose IE tasks often are still conducted manually, e.g. extracting the links from a web searchresult page.

    10 Chapter 2 State of the Art

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    23/92

    Figure 2.3: Different wrapper induction strategies [CKGS06]

    suited to extract data from a presented page is called the wrapper verification problem.6Adapting the wrapper to a changed template is called the wrapper repair problem. A way

    to approach both problems is to learn and verify characteristic patterns of the target

    data. In case of failure, the patterns can be used in attempting to adapt the wrapper to

    the new template. However, both tasks are very difficult to solve and are still an active

    research area [Liu06].

    The goal of wrapper induction is to derive the encoding template from a collection

    of encoded instances of the same type. Repeated patterns in HTML documents can be

    detected with string or tree matching and alignment techniques. These will be discussed

    in the next sections.

    String Matching

    String matching helps revealing to what extent two character strings resemble each

    other. The Levenshtein distance is a commonly used algorithm to compute the similarity

    of two strings [Lev65]. It is defined as the minimum number of operations to transform

    one string into the other. These operations are inserting, deleting or replacing a single

    character in the string. The edit distance can be computed using dynamic programming.

    Let s1 and s2 be the input strings and n and m the respective character counts. The

    table D of dimension (n + 1) (m + 1) is initialized with Di,0 = i and D0,j = j. The

    remaining cells are computed using equation (2.3).

    (i,j)i [1..n],j [1..m] : Di,j = min

    Di1,j1 same character

    Di1,j1 + 1 replace

    Di,j1 + 1 insert

    Di1,j + 1 delete

    (2.3)

    6In fact, wrapper verification is also needed if the IE system may be confronted with ineligible pages,

    i.e. pages that are created from different templates.

    2.2 Information Extraction 11

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    24/92

    The final edit distance is retrieved from the bottom right corner cell Dn,m. An align-

    ment path can be traced back through the matrix illustrating the operations. The time-

    complexity of the algorithm is O(nm). Table 2.1 shows an example matrix of the com-

    parison of the character strings sheep and shepard yielding a Levenshtein distance of 4.

    For similarity computations, the edit distance can be normalized by dividing it throughthe length of the longer string max(n, m).

    Table 2.1: Edit distance matrix of the strings "shepard" and "sheep"

    s1 s h e p a r d

    s2 0 1 2 3 4 5 6 7

    s 1 0 1 2 3 4 5 6

    h 2 1 0 1 2 3 4 5

    e 3 2 1 0 1 2 3 4

    e 4 3 2 1 1 2 3 4

    p 5 4 3 2 1 2 3 4

    Tree Matching

    String matching across non-trivial Web documents is a complex and expensive oper-

    ation considering the average document length in terms of characters. There are no

    pre-determined boundaries and the content and length of the data may differ across

    multiple documents or records. The semi-structured nature of Web documents led to

    the application of tree matching to conduct IE tasks. Tree matching compares the struc-

    ture of two trees and computes a cost of pairing the vertices. In the context of Web IE,

    the DOM-tree or parts thereof are commonly compared by using the element tags as the

    vertices labels.

    Tree matching computes a minimum-cost mapping for two ordered labeled trees.

    According to the general definition, each node appears no more than once and the

    order and hierarchical relations among nodes are preserved. Figure 2.4 on the facing

    page illustrates such a mapping. Tai presented the first polynomial algorithm for com-

    puting the edit distance based on dynamic programming [Tai79]. The algorithm has acomplexity ofO(n1n2h1h2) in time and space, with n1 and n2 being the number of nodes

    and h1 and h2 the heights of the respective trees.

    Cost functions are assigned to the editing operations transforming one tree into

    another, i.e. relabeling, deleting and inserting nodes. Relabeling is of special interest

    as it lends itself to identifying recurrent patterns in similar structured documents. More

    elaborate cost functions for the relabel operation may exploit syntactic (e.g. string edit

    distance) or semantic (e.g. feature vector) similarities. Zigoris et al. propose using sup-

    port vector machines to learn the parameters of the cost function for semantic matching.

    The preliminary results, however, indicated no performance-gain in comparison to sim-

    pler cost functions [ZEZ06].

    12 Chapter 2 State of the Art

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    25/92

    Figure 2.4: General tree mapping example [ZL05]

    A more restrictive variant of tree matching was defined by Selkow in 1977 [Sel77].

    According to Selkows definition, insertion and deletion is limited to the leaf nodes and

    node replacement is not supported. In effect, the aim of tree matching is to find the

    maximum matching where every node-pair has the same parent nodes. This definition

    has been found to better fit to web documents because structural (i.e. level-crossing)

    changes are not generally applicable to DOM-trees [CAM01]. Simple tree matching (STM)

    is an algorithm solving this problem in quadratic time [Yan91]. It is again based on

    dynamic programming and shown in listing 2.3.

    1 STM(A , B )

    2 i f Aroot = Broot then3 r e tu r n 0

    4 e l s e

    5 m Achildren6 n Bchildren7 Mi,0 0 i [0..m]

    8 M0,j 0 j [0..n]

    9 f o r i = 1 to m do

    10 f o r j = 1 t o n do

    11 Mi,j max (Mi,j1, Mi1,j, Mi1,j1 + STM(Ai, Bj))

    12 r e t u r n Mm,n + 1

    Listing 2.3: Simple tree matching algorithm

    Multiple Alignment

    In order to identify patterns in case more than two strings or trees are involved, mul-

    tiple sequence alignment (MSA) techniques can be applied. Multiple alignment has its

    foundation in molecular biology where it is used to identify similarities of sequences

    (e.g. proteins). Given a set of similar sequences, MSA tries to find an optimal align-

    ment by inserting gaps into the sequences. Carrillo and Lipman presented an algorithm

    based on multidimensional dynamic programming that yields optimal results but has

    an exponential time complexity [CL88]. Hence, various heuristic methods have been

    proposed amongst which the center star method has found its way into IE systems.

    2.2 Information Extraction 13

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    26/92

    In this method, a center sequence c is selected from a set of sequences X minimizing

    the pair-wise distance to the other sequences.

    c = arg minxcX

    xiX

    d(xi, xc) (2.4)

    Afterwards, the alignments with the remaining sequences are computed and gaps are

    inserted into the center string where necessary. The time complexity of the center star

    method is O(n2k2) for n sequences of length k. While being of polynomial complexity,

    the character sequence lengths of HTML pages still incurs excessive runtime behavior

    in IE systems.

    Partial Tree Alignment

    Partial tree alignment was specifically crafted to solve the multiple alignment problem

    in an IE context [ZL05]. It aligns multiple trees by progressively growing a seed tree.

    The latter is initialized to be the tree with the maximum number of nodes. This way

    it likely aligns well with the other trees. The remaining trees are matched by linking

    matching nodes and trying to insert nodes into the seed tree for which no match was

    found.

    Nodes are only inserted if a position can be uniquely determined. That is, if the

    neighboring siblings in the source tree are matched with consecutive siblings in the

    seed tree. Figure 2.5 illustrates growing such a seed tree Ts from three input trees.

    Figure 2.5: Iterative partial tree alignment example [ZL05]

    14 Chapter 2 State of the Art

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    27/92

    Extraction Rules

    Once the extraction targets are identified, rules to mine the relevant information need

    to be formalized and stored for future use. There are various possibilities ranging from

    first-order logic rules over regular expressions to XPath and CSS selectors. Logic rulesare primarily used in free-text IE where common tokens and characteristic delimiters

    facilitating the other approaches are rarely available.

    Regular expressions have been widely adopted for data mining from semi-structured

    documents. In the example in listing 2.2 on page 10 the title of the book can be mined

    with the regular expression (\w+). In practice, however, regular expressions

    are not very well suited to match data in HTML documents. To correctly match all

    possible incantations of a specific HTML tag with a regular expression is a daunting

    task, especially due to the statefulness of the HTML syntax. For example, the given

    expression will not work if the tag contains any attributes and will unintentionally

    match with occurrences of the tag in comments or strings. Therefore, the interest has

    recently shifted to query languages like XPath or CSS selectors which are much more

    suitable to extract information from an HTML or XML document.

    Especially the usage of the XPath language in Web information extraction has gained

    importance with a growing number of libraries supporting this query mechanism. In a

    nutshell, XPath queries provide means to address node-sets or individual nodes in the

    DOM tree of an XML (or HTML) document. For instance, //li/i/text() addresses the

    title phrase of the book in the running example while querying for //ul/li[1] returns

    the node containing the whole book-title attribute. XPath queries are far more power-

    ful than the examples given above. This complexity, however, has caused hardship forproviding full support of the XPath standard in implementations and an uncertainty

    concerning the complexity of XPath queries in general. Gottlob et al. have shown that

    large fragments of XPath are of LOGCFL7 complexity and thus can be massively par-

    allelized [GKP03]. A more elaborate treatise on XPath can be found in Essential XML

    Quick Reference [SG01].

    OKeefe and Trotman present a number of query languages aside to XPath and argue

    that most available solutions are overly complicated [OT03]. On the one hand, the lack

    of comprehensive support of the XPath 1.0 standard in many query libraries backs this

    assumption. On the other hand, in Web IE the expressive power to select the relevant

    parts of the available information with the utmost precision is a more favorable goal

    than a simpler yet inferior solution. CSS selectors, for example, share similar concepts

    with XPath queries but are not quite as powerful.

    After foundational approaches and techniques have been covered, supervised, semi-

    supervised and unsupervised IE system concepts are presented along with a few exem-

    plary case studies.

    7Logarithmically Reducible to Context-Free Languages

    2.2 Information Extraction 15

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    28/92

    2.2.3 Supervised Information Extraction

    Manually observing recurrent patterns in web pages is a rather cumbersome and error-

    prone process which can be alleviated by automatically learning extraction rules from

    labeled training documents. This approach is referred to as supervised IE. As depictedin 2.3 on page 11, the user has to label relevant data with the help of a graphical user

    interface (GUI). In the example of the book page, the user may mark "Ulysses" as the

    title of the book and does that for a set of other pages. The IE system then tries to derive

    rules from these examples and, depending on the IE system, may suggest additional

    informative pages to be labeled by the user.

    For example, Rapier is a supervised extraction system that uses a relational learning

    algorithm [CM97]. It initializes the system with specific rules to extract the labeled

    data and successively replaces those with more general rules. Syntactic and semantic

    information is incorporated using a part-of-speech (POS) tagger. Extraction rules consist

    of pre-filler, filler and post-filler patterns for each data field. These describe the context

    and syntax of the extraction target. The respective patterns for extracting the publisher

    name in the running example could be "", "", "Publisher:" as pre-filler

    tokens and "", "" as post-filler tokens. Depending on the training data, the

    filler pattern might specify that the publisher name consists of at most two words which

    were labeled as nouns by the POS tagger.

    Other examples of supervised IE systems are SRV [Fre98], WIEN [KWD97], Soft-

    Mealy [HD98], STALKER [MMK99] and DEByE [LRNdS02].

    2.2.4 Semi-Supervised Information Extraction

    Labeling training data in advance is a labor-intensive process limiting the scope of the IE

    system. Instead of requiring labeled data, semi-supervised IE systems extract potentially

    interesting data and let the user decide what shall be extracted. In other words, the user

    provides feedback to the IE system which is incorporated into the wrapper generation

    process.

    In the running example, a semi-supervised system might recover title, author and

    the publisher as extractable data fields from a set of unlabeled book pages. The user

    then selects which fields shall be extracted and how to integrate the information, e.g. bylabeling the titles as such in the extraction target tuple.

    An example for a semi-supervised system is IEPAD [CL01]. Apart from extraction

    target selection, semi-supervised IE systems are very similar to unsupervised IE systems.

    2.2.5 Unsupervised Information Extraction

    Automatic or unsupervised IE systems extract data from unlabeled training documents.

    The core concept behind all unsupervised IE systems is to identify repetitive patterns in

    the input data and extracting data items embodied in the recurrent pattern.

    16 Chapter 2 State of the Art

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    29/92

    Unsupervised IE systems can be subdivided into page-level extraction systems and

    record-level extraction systems. The former extract data from a page-wide template,

    while the latter assume multiple data records of the same type are available rendered

    by a common template into one page. In case multiple records exist in a single web

    page, it might be possible to derive extraction rules from a single web page, assumingthe individual data records can be told apart. The record-level extraction task can be

    described as trying to extract various items from a list page (e.g. a product list from a

    web shop). In contrast, page-level extraction tasks require multiple pages (e.g. product

    detail pages) to discover patterns and learn extraction rules.

    Evidently, record-level extraction systems can only operate on documents containing

    multiple data records and require means to identify the data regions describing the

    individual data records. The latter problem can be tackled with string or tree alignment

    techniques. Examples for such systems are DEPTA [ZL05] and NET [LZ05].

    Page-level extraction systems can treat the whole input page as a data region fromwhich the data record shall be extracted. However, multiple pages8 for wrapper induc-

    tion need to be fetched in advance. Thus, the problem of collecting training data is

    shifted into the DR domain and is rarely addressed by IE researchers. Examples for

    page-level extraction systems are RoadRunner [CMM01] and ExAlg [AGM03].

    2.2.6 Case Studies

    In the following, a selection of well-known IE systems are presented which try to solve

    similar problems. One semi-supervised and three unsupervised IE systems are pre-

    sented illustrating various techniques and the associated constraints to solve different

    IE tasks.

    RoadRunner

    RoadRunner is one of the early unsupervised Web IE systems, presented in 2001 by

    Crescenzi, Mecca and Merialdo [CMM01]. It compares multiple pages and generates

    union-free9 regular expressions based on the identified similarities and differences.

    RoadRunner initializes the wrapper with a random page of the input set and matches

    the remaining pages using an algorithm called ACME matching. The wrapper is gener-alized for every encountered mismatch. Text string mismatches are interpreted as data

    fields, tag mismatches are treated as indicators of optional items and iterators. In the

    RoadRunner data model, individual data items must be separated by HTML tags but

    tags must not occur as part of the data field. Figure 2.6 on the following page shows an

    example of a wrapper generated from two input pages.

    8At least two training pages are required for page-level wrapper induction. Depending on the IE sys-

    tem and the template, however, ten or even more training pages may be necessary to successfully derive

    extraction rules.9A union-free regular expression does not contain disjunctions (e.g. (A|B)).

    2.2 Information Extraction 17

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    30/92

    Figure 2.6: Wrapper induction example for RoadRunner [CMM01]

    The runtime complexity is exponential in the input string length. Therefore, heuris-

    tics were introduced to limit the exploration space.

    ExAlg

    Arasu and Garcia-Molina propose an IE system automatically deducing the template

    from a set of template-generated pages [AGM03]. ExAlg has a hierarchically structured

    data model and supports optional elements and disjunctions. A web page is modeled

    as a list of tokens in which a token might either be a HTML tag or a word from a text

    node. ExAlg builds equivalence classes of the tokens found in the input documents.

    Based on these sets of tokens, the underlying template is deduced.

    Figure 2.7 on the next page shows four example pages where each template-token is

    labeled with an index. Tokens with the same occurrence vector across all input docu-

    ments build an equivalence class. The idea is that tokens emitted from the same template

    constructor will likely occur with the same frequency. Furthermore, ExAlg can detect

    tokens with multiple roles, e.g. the token Name in Book Name and Reviewer Name has

    a different semantic in either occurrence. It differentiates between roles based on the

    occurrence-path10 and the spans of valid equivalence classes. For instance, an equiv-

    10The occurrence-path, as defined by Arasu and Garcia-Molina, has a close resemblance to an XPathquery.

    18 Chapter 2 State of the Art

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    31/92

    alence class in the given example is {, Reviewer, Name, Rating, Text, }

    with the occurrence vector 1,2,1,0.

    ExAlg defines large and frequent equivalence classes (LFEQs) as classes containing

    many tokens which occur in a large fraction of the input documents. The LFEQs are

    hierarchically structured and the order of the tokens is preserved. The nesting is gov-

    erned by the span formed by all tokens in the respective equivalence class. LFEQs are

    passed to the analysis stage in which the template is deduced.

    Figure 2.7: Input pages in ExAlg [AGM03]

    Starting from the root LFEQ the tokens occurring exactly once in all input doc-

    uments, ExAlg searches for non-empty positions between consecutive tokens and gen-

    erates type constructors for these locations. Nested LFEQs are recursively visited and

    the types are constructed according to the data model. The generated template can then

    be used to extract data from input pages. For the given example, the original schema

    BBook, {BReviewer,BScore,BText} can be recovered by analyzing the four input pages.

    ExAlg has a sophisticated data model compared to other automatic IE systems.Moreover, ExAlg operates on the token-level not on the tag-level as many other

    unsupervised extraction systems do and thus has the chance of extracting attributes

    embedded in text nodes without any markup. The effectiveness of the extraction tends

    to improve with the number of input pages. However, experiments indicate that ExAlg

    works well for collections of under ten input documents given that the occurrence of

    attributes to be extracted exceed the chosen threshold.

    2.2 Information Extraction 19

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    32/92

    IEPAD

    IEPAD, a semi-supervised IE system, was presented by Chang and Liu in 2001 [ CL01]. It

    is capable of extracting homogeneous data records from a set of unlabeled pages. IEPAD

    generates wrappers by discovering repetitive patterns using multiple string alignment.

    The input document is converted to a binary representation of the data. HTML tags

    and text elements are mapped to a set of fixed length binary tokens. A PAT tree, which

    is a binary suffix tree, is created from the binary representation. The PAT tree, in turn, is

    used to find repetitive patterns by recording occurrence count and reference points for

    each recurring pattern. To tolerate inexact matches, the center star algorithm is applied

    to obtain generalized extraction patterns.

    The candidate patterns and the occurrence metrics are presented to the user. Upon

    selection of a pattern, a regular expression is created from the binary representation.

    Thus, the wrapper can also operate on web pages without transforming those into thebinary representation.

    DEPTA

    DEPTA stands for Data Extraction based on Partial Tree Alignment and is an unsupervised

    IE system [ZL05]. DEPTA extracts data records from list pages with an algorithm called

    MDR, taking advantage of the tree structure of the HTML page. MDR was first pre-

    sented by Liu et al. in 2003 [LGZ03].

    The design of MDR is based on two observations about data records. The first obser-vation states that similar objects are likely located in a contigous region and formatted

    with almost identical or at least similar HTML tags. The second observation is that

    similar data records are built by sub-trees of a common parent node.

    The algorithm first builds the DOM-tree for the web page and stores the bounding

    box for each element.11 Adjacent nodes that share the same parent are then compared by

    computing the string edit distance of the tag strings. If the estimated similarity exceeds

    a predefined threshold, the group of nodes is identified as a data region. To account for

    data records that are spread over multiple sibling nodes, the concept of generalized nodes

    was introduced. Generalized nodes encompass one or more sibling nodes . Figure 2.8

    on the facing page shows an abstracted tag tree where nodes 5, 6 and 8, 9, 10 build twodata regions as the respective nodes in each region are similar. The combined node-

    pairs (14, 15), (16, 17) and (18, 19) are also similar to each other and each pair builds a

    generalized node. Data records are derived from generalized nodes. However, there are

    cases when such a node does not represent a single data record. DEPTA handles some

    special cases to deal with these discontinuities in data records.

    Finally, data fields are extracted from the alleged data records. After all tag-trees

    belonging to the data record are assembled in a new tree, partial tree alignment is

    performed to induce the structure of the data. The idea is to match the fields from all

    data records to build a generalized representation of the data record.

    11The visual information for each tag is supplied by a web browser.

    20 Chapter 2 State of the Art

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    33/92

    Figure 2.8: Generalized nodes and data regions in DEPTA [ZL05]

    MDR can handle non-contiguous data records and is capable of extracting data

    records that span multiple sibling nodes. The assumption is made that HTML tags

    are generated by the template and text nodes belong to the data to be extracted. Visual

    cues are consulted to distinguish individual data records. However, the extraction is lim-

    ited to flat data records. Support for nested data records (e.g. two data records sharing

    data items from a common parent data record) was added in a successor system called

    NET [LZ05]. In the latter system, a post-order traversal of the tag tree is performed to

    identify data records at different levels. NET uses simple tree matching to compute the

    tree similarity and aligns the trees whose similarity is above a chosen threshold.

    2.2.7 Summary

    This section introduced Web IE concepts and techniques and presented a few inter-

    esting automatic IE systems from the literature. An information system consisting of

    a document retrieval and information extraction component is able to identify rele-

    vant Web pages and extract salient data from the respective pages. However, to embed

    the obtained information into an existing knowledge base information integration tech-

    niques are required.

    2.3 INFORMATION INTEGRATION

    "It is a very sad thing that nowadays there is so little useless information." Oscar

    Wilde

    After retrieving and extracting information from heterogeneous sources, the obtained

    data needs to be related to existing data. The inherent challenges of information integra-

    tion (II) originate in the structural and semantic heterogeneity of the various information

    sources. Data can be laid out and stored in different ways depending on the chosen data

    model leading to structural heterogeneity. Semantic heterogeneity is concerned with the

    content and meaning of the data.

    2.3 Information Integration 21

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    34/92

    Wache et al. state the problem of information integration and semantic interoperabil-

    ity as follows [WVV+01]:

    "In order to achieve semantic interoperability in a heterogeneous information sys-

    tem, the meaning of the information that is interchanged has to be understood acrossthe systems. Semantic conflicts occur whenever two contexts do not use the same

    interpretation of the information."

    According to Pollock and Hodgson semantic conflicts can be classified as naming

    conflicts, scaling and unit conflicts, confounding conflicts or domain conflicts. Naming

    conflicts occur in the presence of synonyms and homonyms, i.e. multiple names exist

    for the same entity. Different units and currencies lead to scaling conflicts. Metrics

    may either be explicitly encoded in the data or implicitly assumed. Confounding con-

    flicts arise when a same-named entity is defined differently by the various information

    providers. Finally, domain conflicts occur when data is modeled with distinct domain-

    specific intentions resulting in overlapping or disjoint concepts [PH04].

    Information integration can be approached with ontology-mapping techniques.

    Ontologies are well suited to model hidden and implicit knowledge for different

    domains. Wache et al. give a concise overview of ontology-based information inte-

    gration techniques [WVV+01].

    2.4 LEGAL CONSIDERATIONS

    Retrieving, extracting and integrating information published by a third party may have

    legal implications. The terms of service of the respective sites apply which may prohibit

    web scraping of their content. Although a few precedents exist, this is a grey area of

    law and was differently ruled depending on the jurisdiction and the case. Adhering to

    the terms of use of a web site only being visited by the IR/IE-system is not realizable

    unless the terms could be retrieved and understood by the crawler.

    Legal advise should be sought before employing web scraping in a public or com-

    mercial software systems.

    2.5 FEDSEEKO

    Fedseeko is a federated search engine with the goal to facilitate obtaining product infor-

    mation from the Internet [WSS09]. It uses adapters to access diverse product informa-

    tion providers such as online shopping malls, producer sites and third party information

    portals like forums or blogs. The information sources are accessed via web services if

    such a possibility exists. For instance, the Amazon Product Advertising API12 provides

    extensive vendor information through a web service. In case no such interface exists,

    the information may be extracted using web scraping techniques. Figure 2.9 depicts the

    12http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/

    22 Chapter 2 State of the Art

    http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/http://docs.amazonwebservices.com/AWSECommerceService/latest/DG/
  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    35/92

    architecture of Fedseeko and its internal and external interfaces. The reference imple-

    mentation is based on Ruby on Rails.

    Figure 2.9: Fedseeko architecture [WSS09]

    2.5.1 Producer Information Integration

    In the following section, some important aspects of the original producer information

    extraction implementation will be outlined. As a first step, the manufacturer URL for a

    given product is retrieved by a web search query. The first hit of a web search restricted

    to the com domain is considered to be the producer site and will be the basis of down-

    stream product page searches. The product page is located via a phrase search on the

    suspected producer site.

    Fedseeko uses XPath queries to address the individual nodes associated with a prod-

    uct attribute. The mining of XPath queries requires guidance. An example key/value

    pair needs to be supplied, which is used to locate the proper product URL. Starting

    from the suspected product page, the linked pages are walked and page contents are

    matched via a similarity check with the key/value phrases. The search stops once a

    page with the requested resemblance is found. Once a matching product page is found,

    a Scrubyt13 extractor computes the XPath queries for the key, value and base queryrespectively. The identified XPath queries are associated with the producer, insinuating

    a single producer-wide template.

    Fedseeko uses mapping ontologies to relate producer information to available infor-

    mation of similar products by other manufacturers.

    The shortcomings of the existing producer information solution are first and fore-

    most the required amount of user supervision. Supplying samples for each attribute

    and producer is a labor-intensive process, especially considering the large variety of

    13scRUBYt! is a Ruby library designed to facilitate web scraping tasks. http://scrubyt.org

    2.5 Fedseeko 23

    http://scrubyt.org/http://scrubyt.org/
  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    36/92

    producers and the number of attributes associated with some products14. Furthermore,

    the limitation of one template per producer is an oversimplifying presumption. Large

    producers with a manifold product range may use slightly different templates for dif-

    ferent product categories. A new approach towards producer information retrieval and

    extraction aiming to overcome the deficiencies of the existing implementation will bepresented in this thesis.

    2.6 SUMMARY

    An overview was given covering the research areas information retrieval, information

    extraction and information integration. The brief treatise of IR focused on effectiveness

    metrics while an in-depth introduction to Web IE was provided. Important IE tech-

    niques have been presented and exemplary IE systems have been examined. Some of

    the methods and techniques will be reused and referenced in the subsequent chapters.

    II was swiftly covered for the sake of completeness but is otherwise outside the scope of

    this thesis.15

    Finally, the federated search engine Fedseeko has been introduced and its producer

    information integration component was evaluated. During the course of this thesis, a

    replacement of this component will be developed.

    14For instance, in the domain of digital cameras more than one hundred attributes may be listed per

    product.15A related work is conducted contemporaneously which revamps the ontology mapping in Fedseeko.

    24 Chapter 2 State of the Art

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    37/92

    3 REQUIREMENTS

    The goal of the revised information extraction component is to minimize the effort as

    well as the cost of obtaining and providing first-hand product information. Upon a

    query for a certain product, the system shall extract all available product attributes from

    the manufacturers web site without requiring guidance or supervision. In contrast

    to the existing IE system, web sites based on not yet encountered templates shall be

    analyzed automatically and extraction rules be inferred and stored for future requests.

    A change of a known template requiring different extraction rules should be detected

    and acted upon.

    In this chapter, the information flow of the retrieval and extraction system is ana-lyzed. A functional and behavioral description is given. Finally, the validation criteria

    for the software system will be briefly covered.

    3.1 INFORMATION DESCRIPTION

    In a nutshell, the information extraction system shall locate product pages in the Inter-

    net and extract product attributes without any mandatory user interaction. As depicted

    in figure 3.1 on the following page, the only input to the software system is a prod-

    uct descriptor. This product descriptor or identifier may be manually entered or mayoriginate from vendor databases or other sources listing products.

    The input is a tuple comprising a manufacturer name and a product identifier. The latter

    can be decomposed into a list of tokens, where the tokens describe a specific product.

    Based on this information, the manufacturers product page is to be retrieved. An

    example input is Apple Inc. and MacBook Pro.

    The output is an ordered set of attribute tuples extracted from the product page asso-

    ciated with a product. Each attribute tuple consists of a key and a value character string,

    e.g. "Weight", "42 kg".

    25

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    38/92

    Figure 3.1: Overview of information flow

    The extracted attributes may be saved in a database, be passed to a downstream

    processor or can be presented directly to the user. There is a product detail view inFedseeko presenting the producer information alongside other related data like product

    reviews. Furthermore, the extracted data is passed to an information integration system

    performing ontology mapping. The latter task is carried out by a separate system which

    will not be discussed herein.

    The source of the attributes to be extracted are product detail pages residing at the

    respective manufacturer sites. Empiric observations regarding these pages will be pre-

    sented in the next section.

    3.1.1 Product Pages

    The IE engine shall be able to extract product attributes from a vast amount of hetero-

    geneous manufacturer pages. The following empirical observations describe character-

    istics of typical product pages.

    1. A product page with sufficient information often describes only a single product

    but may contain data for different product variants.

    2. A manufacturer may use more than one template for different product categories

    or families.

    3. There might be very few pages available with a common template.

    4. Multiple description pages with different templates might exist for the same prod-

    uct, e.g. a summary and a specification page.

    These characteristics do not apply to all product domains. Throughout this work

    the focus is laid upon those kinds of products for which a human operator could

    easily tell product features apart by looking at the product page. Figure 3.2 on the

    next page shows a product page of a Nikon digital camera for which attributes like

    "Total Pixels", "12.9 million" shall be extracted.

    26 Chapter 3 Requirements

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    39/92

    Figure 3.2: Product page example with the extraction targets being highlighted

    3.2 FUNCTIONAL DESCRIPTION

    The complete product information retrieval system can be decomposed into two major

    components. One component is responsible for the identification of the manufacturer

    site as well as the proper product page. The other components task is to extract product

    attributes from the aforementioned product page.

    The document retrieval component locates and fetches the product page from themanufacturers web site. If multiple pages exist for a single product, the page with the

    most syntactically structured content should be picked. For example, a specifications

    page is better suited for Web IE than a free text summary page.

    The information extraction component extracts attribute tuples from a product page

    of a specific template. Its job is to filter irrelevant data and identify the useful bits

    of information in a given document. Either new rules are derived for identifying the

    extraction targets or already stored ones are used to extract data out of a page created

    from a previously encounered template. Extraction from a page based upon a known

    template is an on-line operation1. Therefore, it should deliver results within the time-

    1It shall be performed while the user of the system waits for a respond to his request.

    3.2 Functional Description 27

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    40/92

    frame given for the overall Fedseeko query to complete. In other words, if a query for

    a Fedseeko product detail page should respond within fifteen seconds, the extractions

    execution time should not exceed this bound in the average case.

    As it might not be possible to select the proper wrapper object to extract data from a

    given document, a wrapper shall be able to detect ineligible input pages. In effect, the

    wrapper verification problem must be solved inside the wrapper object.

    The wrapper induction component creates extraction rules for one or more pages sharing

    a specific template. Wrapper induction only needs to be executed if a new template is

    discovered or a known template has changed. Thus, the operation may be performed

    off-line on a best effort basis.

    3.3 BEHAVIORAL DESCRIPTION

    Most of the systems operations are invisible to the user. Upon requesting detailed

    information for a given product, the system will retrieve the product page and extract

    all product attributes from that page. No user input is required.

    However, the system may not be able to retrieve the proper product page, may fail

    to extract any information or select bogus data. For these cases, the user may intervene

    after the retrieval and extraction steps have been executed. The user shall be given

    means to correct the estimated product page URL. Furthermore, extracted data may be

    discarded whereabout the extraction can be restarted. Should the automatic extraction

    fail to deliver meaningful data, the user may provide hints to facilitate the extraction

    process.

    3.4 VALIDATION CRITERIA

    The software system is evaluated according to a gold standard2. A control group of

    one hundred products from twenty different domains is used to validate the proper

    operation of the system as well as to measure the effectiveness of the retrieval and

    extraction components. In order to spot the cause of extraction failures, the subsystems

    are examined individually.

    The automatic extraction of attributes shall work reliably in the majority of the test

    cases. With additional information, it ought to be possible to successfully extract the

    proper data from four out of five documents.

    For each test product, the proper product URL is gathered manually and a reference

    attribute is recorded. This manually gathered data is matched with the automatically

    computed data during evaluation.

    The document retrieval subsystem either succeeds to locate a product page suitable

    for information extraction, or fails to do so. Therefore, the precision metric follows

    2Wikipedia defines a gold standard test as a "diagnostic test or benchmark that is regarded as definitive"[Wik09]. Test results are interpreted in a way that no false-positive or false-negative results are included.

    28 Chapter 3 Requirements

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    41/92

    the probabilistic interpretation and states the probability that the returned document is

    relevant.

    3.5 SUMMARY

    This chapter stated the goal of the software system and requirements were analyzed

    from various perspectives. Based on the given problem analysis, a software system will

    be developed. Its design, implementation and evaluation will be presented throughout

    the subsequent chapters.

    3.5 Summary 29

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    42/92

    30 Chapter 3 Requirements

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    43/92

    4 DESIGN

    The system design is outlined in this chapter. A description of each component required

    to solve the problem is provided as a processing narrative and in context of the archi-

    tectural design.

    4.1 DATA DESIGN

    The input and output data is depicted in figure 4.1. The key components have been

    identified as the product page locator responsible for DR, and the components revolving

    around the wrapper logic, responsible for Web IE. Both components and their design

    constraints will be exhibited in this section.

    Product IDWrapper

    Database

    Wrapper

    Induction

    Wrapper Attributes

    Product Page

    Locator

    Product

    Page

    Manufacturer Web Site

    Figure 4.1: Information flow during extraction

    31

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    44/92

    4.1.1 Retrieving Product Pages

    The DR component must supply the downstream IE processor with a genuine product

    page. In contrast to the more common DR systems in which a large set of documents

    is returned, selecting the proper product page is a binary choice. Either the right prod-uct page is identified or the IE component wont be able to extract relevant data. In

    effect, the goal of the document retrieval subsystem is to optimize the precision for

    the top-ranked candidate (i.e. according to the terminology introduced in section 2.1.2,

    precision(1) shall be maximized).

    In a full-fledged product page retrieval system, all manufacturer sites would have

    to be indexed in advance in order to allow the retrieval of subordinate product pages.

    However, this work puts the focus onto the information extraction task and only limited

    resources are available. Hence, it was chosen not to build a dedicated document index

    for product page retrieval from the World Wide Web. In contrast, the results of existing

    web search services are used and combined to pick the product page. The results of

    multiple web search engines such as Google Search, MSN Search and Yahoo! Search

    shall be aggregated to obtain a maximum coverage of the World Wide Web and benefit

    from well-established ranking algorithms used in the respective services.

    Product page retrieval is laid out as a two step process. In a first step, the producer

    page is located and, in a second step, the product page is searched at the producer

    site. In this manner, first hand product information is not intermixed with third party

    information like web shop offers or product reviews. In case of failing to locate the

    proper producer site in the first step, the DR component should fall back to another

    candidate. This is done if the product was not featured on the site.

    Product Page Ranking

    During product page retrieval on the producer site, the DR subsystem tries to pick the

    proper page from the top-ranked set of candidates of multiple web search engines. Not

    just using the single top-ranked candidate improves the chance that a relevant document

    is among the set of retrieved documents. The ranking of the individual search engines

    is combined using Borda ranking, known from social choice theory. In Borda ranking,

    named after Jean-Charles de Borda who proposed it as an election method in 1770,

    every voter announces an ordered list of preferred candidates. If there are n candidates,

    the top-ranked candidate of each voter receives n points and each lower ranked candi-

    date receives a decremented score. Borda ranking and other search result combination

    methods are discussed in web Data Mining by Bing Liu [Liu06].

    Table 4.1 shows the search results of an artificial query. As indicated in the example,

    a combined ranking may not suffice to select the proper document from a set of can-

    didates. Therefore, additional metrics are incorporated to refine the original ranking.

    Figure 4.2 on the next page gives an overview of the approaches used to process the

    candidate list. Some techniques try to identify a page that contains specification infor-

    mation and other methods scan for references to the searched product. The scores of

    32 Chapter 4 Design

  • 8/9/2019 Syntactical Integration of Product Information From Semi-Structured Sources

    45/92

    Table 4.1: Top four search results of two web search engines

    Document Relevant? Rank A Rank B Borda Rank

    /news/november/the_new_shiny_product no 1 - 4 + 0 = 4

    /products/detail.html?category=6&id=17 yes 2 4 3 + 1 = 4/products/index.html?category=6 no 3 1 2 + 4 = 6

    /forum/show.html?post=42 no 4 3 1 + 2 = 3

    /reviews/produ