KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress...

KnowItAll

April 5 2007

William Cohen

Announcements

• Reminder: project presentations (or progress report)

– Sign up for a 30min presentation (or else)– First pair of slots is April 17– Last pair of slots is May 10

• William is out of town April 6-April 9– So, no office hours Friday.

• Next week: no critiques assigned– But I will lecture

Bootstrapping

BM’98

Brin’98

Hearst ‘92

Scalability, surface patterns, use of web crawlers…

Learning, semi-supervised learning, dual feature spaces…

Deeper linguistic features, free text…

Collins & Singer ‘99

Riloff & Jones ‘99

Cucerzan & Yarowsky ‘99

Etzioni et al 2005

Rosenfeld and Feldman

Stevenson & Greenwood

Clever idea for learning relation patterns & strong

experimental results

De-emphasize duality, focus on distance between patterns.

Know It All

Architecture

Set of (disjoint?) predicates to consider + two names for each

~= [H92]

• Context – keywords from user to filter out non-domain pages• … ?

Architecture

Bootstrapping - 1

“city”

template rule

Bootstrapping - 2

Each discriminator U is a function: fU(x) = hits(“city x”)/hits(“x”)i.e. fU(“Pittsburgh”) = hits(“city Pittsburgh”)/hits(“Pittsburgh”)

These are then used to create features: fU(x)>θ and fU(x)<θ

Bootstrapping - 3

1. Submit the queries & apply the rules to produce initial seeds.

2. Evaluate each seed with each discriminator U: e.g., compute PMI stats like: |hits(“city Boston”)| / |hits(“Boston”)|

3. Take the top seeds from each class and call them POSITIVE then use disjointness of classes to find NEGATIVE seeds.

4. Train a NaiveBayes classifier using thresholded U’s as features.

Bootstrapping - 4

Estimate using the classifier

based on the previously-

trained discriminators

Some ad hoc stopping conditions… (“signal to noise” ratio)

Architecture - 2

Extensions to KnowItAll

• Problem: Unsupervised learning finds clusters—what if the text doesn’t support the clustering we want– Eg target is “scientist”, but natural clusters are “biologist”,

“physicist”, “chemist”

• Solution: subclass extraction– Modify template/rule system to extract subclasses of target

class (eg scientist chemist, biologist, …)– Check extracted subclasses with WordNet and/or PMI-like

method (as for instances)– Extract from each subclass recursively

Extensions to KnowItAll• Problem: Set of rules is limited:

– Derived from fixed set of “templates” (general patterns ~ from H92)

• Solution 1: Pattern learning: augment the initial set of rules derivable from templates

1. Search for instances I on the web2. Generate patterns: some substring of I in context: “b1 … b4 I a1 … a4”3. Assume classes are disjoint and estimate recall/precision of each pattern P4. Exclude patterns that cover only one seed (very low recall)5. Take the top 200 remaining patterns and

• Evaluate them as extractors “using PMI” (?)• Evaluate them as discriminators (in usual way?)

Examples: “headquartered in <city>”, “<city> hotels”, …,

Extensions to KnowItAll• Solution 2:

– List extraction: augment the initial set of rules with rules that are local to a specific web page

1. Search for pages containing small sets of instances (eg “London Paris Rome Pittsburgh”)

2. For each page P:• Find subtrees T of the DOM tree that contain >k seeds• Find longest common prefix/suffix of the seeds in T

– [Some heuristics added to generalize this further]• Find all other strings inside T with the same prefix/suffix

• Heuristically select the “best” wrapper for a page– Wrapper = P, T, prefix, suffix

w1 Italy, Japan, France, Israel, Spain, Brazil, Dog, Cat, Alligator

w3 Italy, Japan, France, Israel, Spain, Brazil

w4 Italy, Japan

Results - City

Results - Film

Results - Scientist

Observations

• Corpus is accessed indirectly thru Google API– Only use top k discriminators– Run extractors via query keywords & extract– Limited by network access time

• Lots of moving parts to engineer– Rule templates– Signal-to-noise– LE wrapper evaluation details– Parameters: number of discriminators, number of seeds to

keep, number of names per concept, ….

KnowItNow: Son of KnowItAll

• Goal: faster results, not better results• Difference 1:

– Store documents locally– Build local index (Bindings Engine) optimized for

finding instances of KnowItAll rules and patterns• Based on inverted index

term (doc,position,contextInfo)

• Difference 2:– New model (URNS model) to merge information from multiple

extraction rules– Intuition: instances generated from each extractor are assumed

to be a mixture of two distributions1. Random noise from large instance pool2. Stuff with known structure (e.g., uniform, Zipf’s law, …)

– Using EM you can estimate mixture probabilities and parameters of non-noisy data Prob(x noise|x extracted)

137 colors = 41% of mass 15,346 colors = 59% of mass Prob(noise)= 0.59Non-noisy data: uniform• over 137 instances

59% of mass doesn’t Prob(noise)= 0.59Non-noisy data: Zipf’s• over >N instances

41% of mass fits powerlaw

KnowItAll April 5 2007 William Cohen. Announcements Reminder: project presentations (or progress...

Documents

GStreamer Daemon - Building a media server under 30min · GStreamer Daemon - Building a media server under 30min Michael Grüner - michael.gruner@ridgerun.com David Soto - david.soto@ridgerun.com

KnowItAll Vibrational Spectroscopy · PDF fileWhether you use IR, NIR, or Raman, KnowItAll® Vibrational Spectroscopy Software has the right solution for your lab! Bio-Rad’s KnowItAll

YEAR 9 Mathematics TIME: 1h 30min Main Paper

University of California Davis. 20min 30min 45min 1hour1hour 30min 0min 1min 2min 3min 10min Time evolution studies of metastable DPPC/PEG8S domains Pressure-area

Tomorrow's Membership Organization 2012-30min

Experience the power of control Key Features …...Network detection latency max. 30min max. 30min max. 30min MEASUREMENTS Measurement interval (temperature, humidity) 15min 15min

Building a spa in 30min

Welcome to Tampines North Primary School school/2016 P1... · Malay, Chinese, & Tamil Story Telling (30min) 31 . Tampines North Primary School LSM: Math (30min) 32 . Tampines North

KnowItAll ChemWindow Edition

KnowItAll I D ú ÿ J - Bio-Rad Laboratories

A little you time PACKAGES · • 1 Masaje de Bambuterapia (60min) • 1 Exfoliación corporal (30min) • 1 Pedicura (30min) Todos los programas incluem uso gratuito del Baño turco

KnowItAll ChemWindow Edition - Bio-Rad · KnowItAll ® ChemWindow ®Edition Software for Structure Drawing, Data Management, & More

Lean Canvas 30min course

Social Media ENG OSBD sfs 30min

KnowItAll Spectroscopy Edition · 2020. 7. 20. · Wiley’s KnowItAll Spectroscopy Edition offers solutions to identify, analyze, and manage spectral data. It supports multiple instrument

ISR-Smarter Strategic Program - K2H_Rev2-PD-30min

YEAR 9 INFORMATION AND COMMUNICATION TIME: 1h 30min …

Databases & Software - bio-rad.com · This library also includes the KnowItAll ID ExpertTM search software at no additional charge. The KnowItAll IR Spectral Library provides annual

Heinz Plenge DISCOVER butterflies, reptiles, and mammals ...comeltur.com/docs/mapa_ingles.pdf · Maldonado Tacna Arequipa Cusco Ayacucho h 30 1 h 30min 2-30min 35min 30min 50 15 PUERTO

Student Class Schedule September 2018 FINAL Class Schedule September... · 2018-09-06 · Length Studio 45min 30min 45min 45min 45min 45min 30min 30min 35min 45min 45min 45min 30min