Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,

Learning to Extract Symbolic Knowledge from the World Wide Web

Changho Choi

Source: http://www.cs.cmu.edu/~knigam/

Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum

Carnegie Mellon University, J.Stefan Institute

AAAI-98

3/6/2001 Changho Choi, University at Buffalo 2

Abstract

Information onthe Web Unstandable to Human

????

KBExtract information

Knowledgable


Introduction (#1/4)

Two types of inputsof the information extraction system Ontology

Specifying the classes and relations of interest For example, a hierarchy of classes including Person, Student,

Research.Project, Course, etc.

Training examples Represent instances of the ontology classes and relations

For example, a course web page for Course classes, faculty web pages for Faculty classes, this pair of pages for Courses.Taught.By, etc.


ClassesRelations : value


Introduction (#3/4)

Assumptions about the mapping between the ontology and the Web

1. Each instance of an ontology class is a single Web page, a contiguous string of text, or a collection of several Web pages.

2. Each instance of a relation is a segment of hypertext, a contiguous segment of text, or t he hypertext segment.


Introduction (#4/4)

Three primary learning tasks Involved in extracting knowledge-base instances for the Web

1. Recognizing class instances by classifying bodies.

2. Recognizing relation instances by classifying chains of hyperlinks.

3. Recognizing class and relation instances by extracting small fields of text form Web pages.


Experimental Testbed

Experiments Based on the ontology Classes:Department, faculty, staff, student, research_project,

course, other Relations: Instructors.Of.Course(251), Members.Of.Project(392),

Department.Of.Person(748) Data sets

A set of pages(4127) and hyperlinks(10945) from 4 CS dept. A set of pages(4120) from numerous other CS dept.

Evaluation Four-fold cross validation

3 for training, 1 for testing


Statistical Text Classification

Process building a probabilistic model of each class using

labeled training data Classifying newly seen pages by selecting the class that

that is most probable given the evidence of words describing the new page.

Train three classifiers Full-text Title/Heading Hyperlink



Approach the naïve Bayes, with minor modifications

Based on Kullback-Leibler Divergence Given a document d to classify, we calculate a score for each

class c as follows:

aryin vocabul ith wordw

y vocabular theof size the T

din wordsofnumber then

))|Pr(

)|Pr(log()|Pr(

)Pr(log)(

i

1

T

ii

iic dw

cwdw

n

cdScore



Experimental evaluation

Actual

Predicted

course student faculty staff Research_project

department

other Accuracy

Course 202 17 0 0 1 0 552 26.2

Student 0 421 14 17 2 0 519 43.3

Faculty 5 56 118 16 3 0 264 17.9

Staff 0 15 0 4 0 0 45 6.2

Research_project 8 9 10 5 62 0 384 13.0

Department 10 8 3 1 5 4 209 1.7

Other 19 32 7 3 12 0 1064 93.6

Coverage 82.8 72.4 77.1 8.7 72.9 100.0 35.0


Accuracy/coverage

Coverage The percentage of pages for a given class that are

correctly classified as belonging to the class accuracy

The percentage of pages classified into a given class that are actually members of that class


Accuracy/coverage tradeoff

1. Full-text classifiers 2. Hyperlink classifiers 3. Title/heading classifiers

“Hyperlink information can provide strong knowledge.”


First-Order Text Classification

Second approach for text classification : learn first-order rules for classifying pages 1st-order: rules with variables

FOIL is the well-known algorithm for first-order learning. 0th-order: no variables. Prolog-like. Function-free Horn

clauses C4.5 is the well-known algorithm for zeroth-order learning.


FOIL’s input for text classification

For each distinct word, has_word(Page) word is stemmed.

For every hyperlink, link_to(Page, Page)

Training data, Student(“http://www.cs.buffalo.edu/grads.html”), … Course(“http://www.cse.buffalo.edu/courses.html”), … …


FOIL’s result

Sample learned rules, Student(A) := not(has_data(A)), not(has_comment(A)),

link_to(B,A), has_jame(B), has_paul(B), not(has_mail(B)).Test Set: 126(+), 5(-)

Faculty(A) :- has_professor(A), has_ph(A), link_to(B,A), has_faculti(B).Test Set: 18(+), 3(-)


FOIL’s result

Comparing to statistical classification

More accurate Less coverage


Classifying Hyperlinks

Use a first-order representation because this task involves discovering hyperlink paths

of unknown and variable size. and, since we want to find out following patterns.

“The ProjectMember(A,B) relation holds if A is a Person, and B is a ResearchProject, and B includes a link to A near the word ‘People’”.


FOIL’s Input for classifying hyperlinks

Predicates: class(Page) link_to(Hyperlink, Page, Page) has_word(Hyperlink) all_words_capitalized(Hyperlink) has alphanumeric_word(Hyperlink) has_neighborhood_word(Hyperlink)

Training examples: Department.Of.Person(“CSE”, “Changho Choi”), … Instructors.Of.Course(“Sargur N. Srihari”, “CSE711”), …


FOIL’s result

Sample learned rules, “Members_of_project(A, B) := research_project(A),

person(B), link_to(C,A,D), link_to(E,D,B), neighborhood_word_people(C).”Test Set: 18(+), 0(-)

“department_of_person(A,B) := person(A), department(B), link_to(C,D,A), link_to(E,F,D), link_to(G,B,F), neighborhood_word_graduate(E).”Test Set: 371(+), 4(-)


FOIL’s result

Fairly High Accuracy

Limited coverage Because limited

coverage of page classifiers


Extracting Text Fields

Uses a richer set of predicates length(Fragment, Relop, N) Some(Fragment, Var, Path, Attr, Value) Position(Fragment, Var, From, Relop, N) Relpos(Fragment, Var1, Var2, Relop, N)

Sample learned rule, “ownername(Fragment) := some(Fragment, B, [], in_title, true),

length(Fragment, <, 3), some(Fragment, B, [prev_token], word, “gmt”), some(Fragment, A, [], longp, true), some(Fragment, B, [], word, unknown), some(Fragment, B, [], quadrupletonp, false)”


FOIL’s result


Conclusions

The approach we propose in this paper is to construct a system that can be trained to automatically populate such a KB.

We have presented a variety pf approaches that take advantage of the special structure of hypertext By considering relationships among Web pages, Their hyperlinks, And specific words on individual pages and

hyperlinks.

Documents

Learning to Extract Symbolic Knowledge from the World Wide Web Changho Choi Source: knigam/knigam/ Mark Craven,