Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Natural Language Processing for Web Applications

Haifeng Wang, Shiqi Zhao

Outline

• Part I–Background

• Part II–New Applications & Trends

• Part III–New Data & Resources

• Part IV–New Difficulties & Challenges

• Part V–New Methodologies & Solutions

Status of the Internet

Search Engine UnionProvider

Customer

Internet Users

- Any Enterprise （Hundreds of thousands， finance, travel,

shipping）- advertisement

- webmaster（ vertical website）

-Press

-Website（1.83M in 2011 in

China）

-Hidden web

Infrastructure- Domain name- Telecommunications, - Mobile- Broadband, - Hardware- Internet service provider

-485M+ users in chinaAbout 1 billion in the future

Trends of the Internet

Media Content Source

Internet

A B

Media-TimeTransition

DTime Constraint

C

• Professional Editors User Generated；

Static DynamicallyGenerated

• Simply Surfing Online gaming、video-watching、Social interaction – All

kinds of Application

• Text Image、Video、Audio,

etc.

• Users want real-time

information when

searching

• The Internet and users have changed a lot in the past decade

Traditional Search Engine

Box

Text matching based requirement

analysis

All results returned in

similar formact

Traditional web search

User

Entrance

Query Information

Presentation

Webpages

Crawler Webpages

Internet

New Generation of Search Engines

User

Entrance Box

Requirement Analysis

ResultsIntegration

Execution

Info Apps

Presentation

Query

•Apps Execution•Rich Media Analysis

Real-time data

Individual and group developers

InternetSNS/UGC/Hidden web

Open AppsPlatform

Open DatePlatform

TraditionalWeb Search

Rich MediaSearch

Apps Accurate Structured-results Webpages Rich Media

Queries to Baidu Search Engine

听起来欢乐的歌曲joyous song

现在几点了What time is it

电脑中毒了怎么办How to deal with computer virus

哪能买到漂亮衣服Where could I buy some beautiful clothes

北京哪能找到女朋友Where could I get a girlfriend in Beijing

令人心情愉快的图片Pleasant pictures

Challenge to NLP

RequirementAnalysis

KnowledgeMining

A B

ResultPresentation

DHuman-Computer

Interaction

C

• Hidden web, hiding knowledge

• Structured,semi- structured, unstructured

• Various levels

• Direct answer• Clustering• Summarization• Relation Graph• intelligent push• Rich media

• Complex Query• Diversiform

Requirement

• Suggestion• Extension• Interaction

NLP

NLP for Web Applications

Rule-based Method Statistical / ML MethodMethods

Dictionary Corpus Web data LogResources

Modules

Term

Granularity-Segmentation- Unknown word- Component

Property- Proper noun- Requirement- POS- Phonetic

Relation- Collocation- Similarity- Language model- Ontology

Phra

se Structure - Chunking-Term importance-Trunk parsing

Transformation- Synonym - Semantic norm.- Correction

Classification- Requirement

classification- Topic detection

Sent Syntax Semantic

Doc

Single Document- Topic analysis - Page value analysis

Multi-document- Classification & clustering- Feature extraction

Machine Translation IMENLP Applications Web search

Mobile search

Vertical search

cQA

Wikipedia

Advertising

Recommendation & personalization

Outline






New Applications & Trends

• NLP for web applications–Main applications

• Web search, online translation, recommendation system, e-business, social networks,…

–Evolution of traditional topics• Word segmentation, machine translation, question

answering,…

–New research topics

Main Applications (cont.)

• NLP for web search–Basic applications

• Segmentation

• Query term importance rating

• Query rewriting

• Query intent analysis

–Advanced applications• Question answering

• Summarization

• Word sense disambiguation

• Clustering

• Information extraction

• Ontology


• Examples:–天龙八步 —> 天龙八部

–怎样能有归一证 —> 怎样能有皈依证

–宝马X6价钱 —> 宝马X6报价

–成都的哥罢工 —> 成都出租车罢工

–赞颂母爱的现代诗 —> 母爱的现代诗

–康柏笔记本vista系统一键恢复—>康柏vista一键恢复


• NLP for online translation–Webpage

–Query

– communication

–Social network

–E-commerce

–Computer aided translation

–Computer aided learning

–Mobile


• Recommendation system

–Recommendation among a single product

–Recommendation cross products

• Recommendation clue

–User profile

–User log

–Content


• NLP for e-business– Intent / requirement recognition

• e-business website

– Information extraction• E.g., extract product information

– Information recommendation• Maintain user profiles and recommend products to users

• Advertising

– Sentiment analysis• Analyze comments of products


• NLP for social networks–User profile

–Recommendation• Not only information, but also users

–Data mining • Identify emergency events

• Sentiment analysis

• Polls


Main applications

Web search

Online translation

Recommendation

E-business

Social networks

Segmentation

Query term importance rating

Query rewriting

Query intent analysis

Question answering

Summarization

Word sense disambiguation

Information extraction

Ontology

Basic applications

Advanced applications

Clustering

Basic Applications for Web Search

• Word segmentation –Basic unit for languages like Chinese

–Segmentation for both queries and web documents

功夫熊猫在线观看


功夫在线观看熊猫

Basic Applications for Web Search (cont.)

• Query segmentation–Segment a query into chunks

–Scenario: document ranking• What terms must appear contiguously in the retrieved

documents?

功夫在线观看熊猫




• Query term importance rating–Estimate importance rating for each query term

–Scenario: document ranking• Which query terms MUST be matched in the retrieved

documents?


term rating功夫熊猫 3在线 2观看 1


• Query rewriting–User queries are never perfect

• Contain errors

– E.g., 牛肉顿萝卜

• Too short

– E.g., 牛肉萝卜

• Too verbose

– E.g., 哪位朋友可以告诉我牛肉炖萝卜该怎么做啊？？？

• Use less common expressions

– E.g., 牛肉烧萝卜

• ……


• Query rewriting (cont.)– Involve the following NLP techniques:

• Error detection & correction

– Collecting query correction pairs

• Query expansion

– Learning expansion term from query logs or web docs

• Query reduction

– Query term importance rating

• Query paraphrasing & entailment

– Synonymous resource extraction

• ……


• Query intent analysis

Advanced Applications for Web Search (cont.)

• Question answering

question

answer


• Summarization

Summaries rather than snippets


• Word sense disambiguation


• Search result clustering


• Information extraction


• Ontology





answering,…


Evolution of Traditional Topics

• Example-1: word segmentation –Demands from web applications (especially

search engines)• High efficiency

– To process tens of billions of web documents

• Frequently update

– To recognize new terms / concepts / named entities…

• Flexibility

– Different applications requires different segmentation outputs

Evolution of Traditional Topics (cont.)

• Example-1: word segmentation (cont.)– Solutions:

• Light weight model

– Efficient

• Learn new terms and NEs

– Mine web corpora and query logs

– Be easily added to the segmentation dictionary

• Various granularities

– Customize for different applications


• Example-2: machine translation–Machine translation methods

• Rule based systems

– E.g., Systran

• Statistical machine translation

– Google, Baidu …

–Difficulties• Data sparseness

• Too huge model

• Not fast enough


• Example-2: machine translation (cont.)–Online translation

• Web service instead of software

• Collect tens of millions of multilingual parallel / comparable sentence pairs from the web for model training

• Extract translations for idioms, named entities, and new terms from the web

• Both translation model and language model are extremely large


• Example-2: machine translation (cont.)–Challenges to online translation

• Quality control for the automatically collected data

• Model selection

• Model compression

• Distributed storage and computing

• Model update

• Fast decoding

• Domain adaptation

• ……


• Example-2: machine translation (cont.)

Reordering problem


• Example-3: question answering–Stage-1:

• Natural-language interface to expert systems

• Within specific domains

– Stage-2: web-based QA• Open domain

• Built upon large web corpora

• Mainly based on statistical methods

– Correct answers appear more times than incorrect ones


• Example-3: question answering (cont.)– Stage-2: web-based QA (cont.)

• Main modules

– Question classifier

– Search engine

– Answer extractor

• Deep analysis

– Morphological analysis, syntactic analysis, NER, WSD, coreference resolution, inference, reasoning, Ontology,…

• Usually work well on factoid questions


• Example-3: question answering (cont.)– Stage-3: community-based QA

• Users ask questions and wait for other users to answer them

• Answers have higher coverage than those from automatic QA system

• Previously asked questions and answers can be searched by other users

• Can work well on description and subjective questions





answering,…


New Research Topics

• Sentiment analysis

• Wikipedia-based research

• Microblog-based research

• Crowdsourcing

• ……

Outline






New Data & Resources

• Treasures from WWW

Large-scale Web corpora

Query logs with user behaviors

User generated content (UGC)

Large-scale Web Corpora

Large-scale Web Corpora (cont.)

• Large-scale web corpora for NLP–Statistics of distributions

• N-grams / Language model

• Collocations

• Co-occurrence

• ……

–Data mining• Information extraction / answer extraction for QA

• Paraphrase / entailment rules acquisition

• Bilingual data collection

• ……


• Statistics of distributions–Example: N-grams / language model

• E.g., Google Web 1-T 5-grams

• Source data:

– 1 trillion word tokens from web pages

• Data size:

Number of tokens 1,024,908,267,229

Number of sentences 95,119,665,584

Number of unigrams 13,588,391

Number of bigrams 314,843,401

Number of trigrams 977,069,902

Number of fourgrams 1,313,818,354

Number of fivegrams 1,176,470,663


• Statistics of distributions (cont.)–Applications of web n-grams

•Example: lexical substitution–Substitute words in a sentence with their synonyms

that fit in the given context

–Two stages:

»Extract candidate substitutes from thesauri

»Rank candidates according to their fitness in the given context


• Statistics of distributions (cont.)– Lexical substitution

• Stage-2: candidate ranking [Giuliano et al., 2007]

Giuliano et al. FBK-irst: Lexical Substitution Task Exploiting Domain and Syntagmatic Coherence.

H_l w H_r

H_l e H_r

Left context Right context

Count the frequency of the generated fragment using

Google 5-grams


• Data mining–Example-1: Learning surface patterns for QA

[Ravichandran and Hovy, ACL-2002]

Ravichandran and Hovy. Learning Surface Text Patterns for a Question Answering System.

Question taxonomy BIRTHDAY

1.00 <NAME> ( <ANSWER> - )0.85 <NAME> was born on <ANSWER>,0.60 <NAME> was born in <ANSWER>0.59 <NAME> was born <ANSWER>0.53 <ANSWER> <NAME> was born0.50 – <NAME> ( <ANSWER>0.36 <NAME> ( <ANSWER> -

Given seed (Mozart, 1756)

scores Paraphrasepatterns


• Example-1: Learning surface patterns for QA (cont.)

–Main steps for learning patterns:• Seed selection

– E.g., Mozart 1756 for BIRTHDAY

• Submit the seed to a search engine

• Download the top-1000 web documents

• Retain sentences containing the Q and A terms

• Pass each sentence through a suffix tree constructor

• Retain phrases in the suffix tree that contain both the Q and A terms

• Replace word for Q term as <NAME> and word for A term as <ANSWER>



• Example-1: Learning surface patterns for QA (cont.)

–Calculate precision of each pattern:• Query the search engine with the Q term

– E.g., Mozart

• Download top-1000 web documents

• Retain sentences that contain Q term

• For each learnt pattern, compute the percentage that the correct answer occurs in the <ANSWER> slot

• Return only the patterns matching a sufficient number of seeds (>5)



• Example-2: Mining parallel data –Mining parallel data from bilingual websites [Shi

et al., ACL-2006]

•Basic idea: –One can identify parallel web pages from bilingual

websites, and further extract bilingual sentences from them.

– If two hyperlinks in two parallel web pages are aligned, then their corresponding web pages are also likely to be parallel

Shi et al. A DOM Tree Alignment Model for Mining Parallel Data from the Web.


• Example of bilingual websites

Parallel web pages


• Example-2: Mining parallel data (cont.) –Mining parallel data from bilingual websites

• Main steps:

– Identify bilingual websites using trigger words

» E.g., English, English Version, 中文, 中文版

– Identify bilingual web pages with a classifier

» Features: length ratio; HTML tag similarity; sentence alignment score

– DOM tree alignment for bilingual parallel web pages

– Sentence alignment within aligned text chunks

– Recursively mine parallel hyperlinks

Shi et al. A DOM Tree Alignment Model for Mining Parallel Data from the Web.


• Example-2: Mining parallel data (cont.)– Mining parallel data from bilingual web pages

[Jiang et al., ACL-2009]• Disadvantage of the above method

– The number of bilingual websites is small, thus the volume of extracted bilingual data is limited.

• Basic idea:

– In many bilingual web pages bilingual data appear collectively and follow similar surface patterns.

Jiang et al. Mining Bilingual Data from the Web with Adaptively Learnt Patterns.


• Examples of collective bilingual pages

Bilingual terms

Bilingual sentences



• Main steps:




• Preprocessing:

– Parse web documents into DOM trees

– Segment text into snippets according to languages

• Seed mining:

– Judge any adjacent E/C snippet pair

– Compute the likelihood of being a translation pair

» Combines a translation model and a transliteration model




• Pattern learning:

– Candidate pattern extraction

– Pattern selection with a SVM classifier

» Features: generality; average translation score; length; irregularity

• Pattern-based translation mining


7. Don’t worry. 别担心。

[N][P][S][E][P][S][C][P]


• Example-3: Improve parsing–Web-scale features for Parsing [Bansal and Klein,

ACL-2011]• Motivation:

– Web counts are powerful syntactic cues

• Basic idea:

– Generate web count features to address the full range of syntactic attachments

Bansal and Klein. Web-Scale Features for Full-Scale Parsing.

raising from is much more frequent on the web than

$ x billion from


• Example-3: Improve parsing (cont.)–Web-scale features for Parsing

• Affinity features

– E.g., lexical co-occurrence counts from large corpora

• Paraphrase features

–Web corpus:• Google n-gram corpus



• Example-3: Improve parsing (cont.)–Affinity features

• Given a head-argument pair (h,a), define the adjacent count feature:

– ADJ

» Count of query q=ha or q=ah in the web corpus

– ADJ∧POS(h)∧POS(a)

» Specific to each pair of POS tags

– ADJ∧POS(h)∧POS(a)∧b» b is the binned query count

– Other complex features



• Example-3: Improve parsing (cont.)– Paraphrase features

• Form:– PARA ∧POS(h) ∧POS(a) ∧c ∧p ∧dir

• Example:– PARA ∧VBG ∧IN ∧it ∧MIDDLE ∧→

• Explanation:

– If frequent occurrences of raising it from indicated a correct attachment between raising and from, frequent occurrences of lowering it with will indicate the correctness of an attachment between lowering and with.







Query Logs with User Behaviors

• Search engine query logs:–Log data recording users’ queries along with

corresponding behaviors with the search engine

–What can be mined in query logs?• Query: keywords that users search with the search

engine

• Clicks: urls that users click after submitting queries

• Session: a sequence of queries and clicks from the same user within a short time interval

• Other information: search time, user ID, user browsing…

Query Logs with User Behaviors (cont.)

• Search engine query logs (cont.)–An example:

time id query click2011-08-21-09:02:32 12345 非诚勿扰 -2011-08-21-09:02:40 12345 非诚勿扰电影 url1, title: 非诚勿扰电影百度视频

2011-08-21-09:04:51 12345 - url2, title: 非诚勿扰-电影-高清在线观看

2011-08-21-09:05:12 12345 非诚勿扰2 url3, title: 非诚勿扰2-高清在线观看

2011-08-21-09:05:18 12346 姚晨微波 -2011-08-21-09:05:39 12346 姚晨微博 url4, title: 姚晨的微博新浪微博-随时…2011-08-21-09:05:43 12347 中国期刊网 url5, title: 中国知网首页

2011-08-21-09:05:48 12348 非诚勿扰2 在线 url3, title: 非诚勿扰2-高清在线观看

Query expansion

NE extraction

Spelling error correction

Query clustering


• Query logs for IR and NLP–Query clustering

–Query intent recognition

–Query rewriting• Expansion, reduction, synonymous reformulation, error

correction

–Query suggestion

–Learning-to-rank

–Named entity recognition

–Ontology construction

–……


• Example-1: Query clustering–Wen et al., 2002

–Measure query similarity from two aspects:• Content similarity

– Similarity of content words in two queries

• Click-through similarity

– Similarity of user clicked urls for two queries

• Combined similarity

–

Wen et al. Query Clustering Using User Logs.

_* *content cross refsimilarity similarity similarityα β= +


• Example-1: Query clustering (cont.)–Content similarity

• Based on keywords or phrases

–

–

• Based on edit distance

–


Word overlap rate

Cosine sim.


• Example-1: Query clustering (cont.)–Click-through similarity

• Through single document

–

• Through document hierarchy

–

–


Overlap rate of the clicked URLs

Lowest common parent node


• Example-2: Building NE-query intent taxonomy

–Yin and Shah, WWW-2010

Yin and Shah. Building Taxonomy of Web Search Intents for Name Entity Queries.

Mining intent phrases for NE-

queries and organizing them in a

taxonomy


• Example-2: Building NE-query intent taxonomy (cont.)–Three main steps:

• Identify search intents for a class of entities

– E.g., download, mp3, mv for music

• Infer relationship between two intent phrases

– E.g., synonyms: pictures / pics;

hypernyms: wallpapers / pictures

• Organize intent phrases into a tree



• Example-2: Building NE-query intent taxonomy (cont.)– Intent phrase identification

• Extract phrases that co-appear with NEs of a given class from query logs

– Infer relations between intent phrases• Basic idea:

– Given an NE e, its intent phrases w1 and w2, if the clicked urls for “e+w1” also satisfy “e+w2”, then w1 w2

–Organize intent phrases• Three approaches:

– Directed maximum spanning tree; hierarchical agglomerative clustering; Pachinko allocation models


⊆


• Example-3: Query substitution – Johns et al., WWW-2006

–Example:

Johns et al. Generating Query Substitutions.

Query-level substitution

Phrase-level substitution


• Example-3: Query substitution (cont.)–Collect reformulation query / phrase pairs

• Using query sessions

• Query pair:

– Successive queries issued by a single user

– E.g., britney spears mp3s -> britney spears lyrics

• Phrase pairs:

– Segment queries into phrases

– Retain queries in which only one segment has changed

– E.g., (britney spears) (mp3s) -> (britney spears) (lyrics)

• Measure relatedness of the query / phrase pairs

– Log likelihood ratio (LLR) score



• Example-3: Query substitution (cont.)–Generate substitutions

• Query-level substitution for frequent queries, while phrase-level substitution for infrequent ones

–Rank candidate substitutions• Machine learning methods

– Linear regression

– Binary classification

• Features:

– Query length, #segments, %Alphabetic characters, edit distance, #segments substituted, #tokens shared, size of prefix overlapping, LLR, frequency, mutual information,…



• Example-4: Optimizing page rank– Joachims, SIGKDD-2002

Joachims. Optimizing Search Engines using Clickthrough Data.

link3 <r* link2

link7 <r* link2

link7 <r* link4

link7 <r* link5

link7 <r* link61,3,7 are clicked


• Example-4: Optimizing page rank (cont.)–Defined as a ranking problem:

• Given:

–

» n: size of training data; qi: query; ri*: target ranking

• Maximize:

–


* * *1 1 2 2(q , r ), (q , r ),..., (q , r )n n

*f (q )

1

1(f ) (r , r )i

n

S iin

τ τ=

= ∑

Kendall’s : τ (r , r )a bP QP Q

τ −=

+#concordant pairs

#discordant pairs


• Example-4: Optimizing page rank (cont.)–Ranking SVM


Using partial

feedback


• Example-5: Ontology construction–Sekine and Suzuki, WWW-2007

–Basic idea: • Entities belonging to the same class should appear in

similar contexts

–Data: • Query logs (only queries are used)

–Main steps:• Step-1: Extract typical contexts for a given class

• Step-2: Find new entities belonging to the class

Sekine and Suzuki. Acquiring Ontological Knowledge from Query Logs.


• Example-5: Ontology construction (cont.)

Sekine and Suzuki. Acquiring Ontological Knowledge from Query Logs.

Examples of class “Award”

Typical context words for “Award”

New entities for the class “Award”


• Example-6: Paraphrase acquisition–Zhao et al., 2010

–Corpus• Query logs (queries and titles) of a search engine

–Assumption• If a query q hits a title t, then q and t are likely to be

paraphrases

• If queries q1 and q2 hit the same title t, then q1 and q2 are likely to be paraphrases

• If a query q hits titles t1 and t2, then t1 and t2 are likely to be paraphrases

Zhao et al. Paraphrasing with Search Engine Query Logs.


关于草原的诗词

描写草原的诗句

有关草原的诗歌

……

……

q1

t1

t2

赞美大草原的诗q2

Paraphrases:

<q1, t1>

<q1, t2>

<q2, t1>

<q1,q2>

<t1,t2>

query-title

query-query

title-title

Example:



• Step-1: extracting <q, t> paraphrases– Extracting candidate <q, t> pairs from query logs

– Paraphrase validation based on binary classification

• Combining multiple features

• Step-2: extracting <q, q> paraphrases– Extracting candidate <q, q> from <q, t> paraphrases


• Step-3: extracting <t, t> paraphrases – Extracting candidate <t, t> from <q, t> paraphrases








User Generated Content (UGC)

UGC

cQA

Blogs / microblogs

Forums

Online encyclopedia

Forums

Forums

• Example: Mining QA pairs from forums–Cong et al., SIGIR-2008

–Find question-answer pairs from forum thread• Motivation:

– Initiating post usually contains questions, while reply posts may contain answers

• Two main stages

– Question detection

– Answer detection

Cong et al. Finding Question-Answer Pairs from Online Forums.

Forums (cont.)

• Example: Mining QA pairs from forums (cont.)–Question detection

• Non-trivial problem

– Cannot simply rely on question marks or question words

• Classification-based method

– Feature: Labeled Sequential Patterns (LSPs) extracted questions and non-questions

» E.g., <what, do, PRP, VB>→Q


Forums (cont.)

• Example: Mining QA pairs from forums (cont.)–Answer detection

• Problems:

– Multiple questions and answers interweaved together

– 1-question vs. n-answer; n-question vs. 1-answer

• Graph-based method, which considers:

– Rank of candidate answers for a given query

– Relationship of candidate answers

– Forum-specific feature: distance of a candidate answer from the question



UGC

cQA

Blogs / microblogs

Forums

Online encyclopedia

Community-based QA (cQA)

• Research on cQA–Search and Recommendation

• cQA retrieval model

• Question similarity computation

• Multi-sentence question segmentation

• ……

–Quality estimation• Question quality

• Answer quality

• User (questioner / answerer) quality

Community-based QA (cQA) (cont.)

• Example-1: CQA retrieval model–Xue et al., SIGIR-2008

–Basic idea: • In cQA retrieval, both question parts and answer parts

should be modeled

–Propose a mixed model:• A translation-based language model for the question

part

• A query likelihood approach for the answer part

Xue et al. Retrieval Models for Question and Answer Archives.


• Example-1: CQA retrieval model (cont.)–Translation-based language model for the

question part


( | ( , )) ( | ( , ))

| ( , ) |( | ( , )) ( | ( , )) ( | )| ( , ) | | ( , ) |

( | ( , )) (1 ) ( | ) ( | ) ( | )

w

mx ml

mx ml mlt q

P q a P w q a

q aP w q a P w q a P w Cq a q a

P w q a P w q P w t P t q

λλ λ

β β

∈

∈

=

= ++ +

= − +

∏

∑

q

qSmoothing

Language model

Translation model trained with Q-A pairs in cQA archives


• Example-1: CQA retrieval model (cont.)– Incorporating the answer part


( | ( , )) (1 ) ( | ) ( | ) ( | )mx ml mlt q

P w q a P w q P w t P t qβ β∈

= − + ∑

( | ( , )) ( | ) ( | ) ( | ) ( | )mx ml ml mlt q

P w q a P w q P w t P t q P w aα β γ∈

= + +∑

Query likelihood model for the answer part


• Example-2: Answer quality prediction– Jeon et al., SIGIR-2006

–Background:• There are plenty of bad answers in cQA archives

– Some users answer nonsense

– Some answers contain irrelevant advertisements

– Examples:

Jeon et al. A Framework to Predict the Quality of Answers with Non-Textual Features.

Q: What is the minimum positive real number in Matlab?

A: Your IQ

Q: What is new in Java2.0?

A: Nothing new

Q: Can I get a router if I have a usb dsl modem?

A: Good question but I do not know


• Example-2: Answer quality prediction (cont.)–Measure answer quality with non-textual

features


Answerer’s Acceptance ratio Answer Length

Questioner’s Self Evaluation Answerer’s Activity Level

Answerer’s Category Specialty Print Count

Copy Count Users’ Recommendation

Editor’s Recommendation Sponsor’s Answer

Click Counts Number of Answers

Users’ Dis-Recommendation


• Example-2: Answer quality prediction (cont.)–Maximum entropy for answer quality estimation

– Integrated into the cQA retrieval model


13

1

1( | ) exp ( , )( ) i i

ip y x f x y

Z xλ

=

⎡ ⎤= ⎢ ⎥⎣ ⎦∑

( | ) ( ) ( | )w Q

P D Q P D P w D∈

= ∏

( ) ( | )P D p y x D= =

Prior probability of D


UGC

cQA

Blogs / microblogs

Forums

Online encyclopedia

Online Encyclopedia

• How to make use of online encyclopedia?–Clean and semi-structured data

• Applications: information extraction, relation extraction, summarization…

–Links among entries• Applications: WSD, lexical reference rules extraction,

NE recognition…

–Revision history• Applications: Sentence compression, sentence

simplification…

Online Encyclopedia (cont.)

• Example-1: Information extraction–Wu and Weld, ACL-2010

–Main idea:• Generate relation-specific training examples by

matching Infobox attribute values to corresponding sentences

• Main modules:

– Preprocessor

– Matcher

– Learner

Wu and Weld. Open Information Extraction using Wikipedia.


• Example-1: Information extraction (cont.)–An example



• Example-1: Information extraction (cont.)– Preprocessor

• Sentence splitting

• NLP annotation

• Compiling synonyms

– Using wikipedia redirection pages and backward links

–Matcher• Match target entity

– Full match, synonym match, partial match, type match, pronoun match…

• Match sentences

– Seek a unique sentence to match the attribute value



• Example-1: Information extraction (cont.)– Learn two kinds of extractors

• Extractor-1:

– Using dependency parse tree features

– Higher accuracy but slower

• Extractor-2:

– Using only shallow features, such as POS tags

– Lower accuracy but faster



• Example-2: Multilingual NER–Richman and Schone, ACL-2008

–Main idea:• English NE categorization

– Use category links

• Foreign NE categorization

– Find counterpart in English wikipedia pages

– Use category links for English NE categorization

– Foreign NE and its English counterpart share identical NE type

Richman and Schone. Mining Wiki Resources for Multilingual Named Entity Recognition.



Jacqueline Bhabha

Categories: British lawyersJewish American writersIndian Jews

Lawyers by nationality

American writers by ethnic or national origin

Jewish writers

British legal professionals

Indian people by religion

Indian people by ethnic or national originKey phrase for PER

http://en.wikipedia.org/wiki/Special:Categories

http://en.wikipedia.org/wiki/Category:British_lawyers

http://en.wikipedia.org/wiki/Category:Jewish_American_writers

http://en.wikipedia.org/wiki/Category:Indian_Jews

http://en.wikipedia.org/wiki/Category:Lawyers_by_nationality

http://en.wikipedia.org/wiki/Category:American_writers_by_ethnic_or_national_origin

http://en.wikipedia.org/wiki/Category:Jewish_writers

http://en.wikipedia.org/wiki/Category:British_legal_professionals

http://en.wikipedia.org/wiki/Category:Indian_people_by_religion

http://en.wikipedia.org/wiki/Category:Indian_people_by_ethnic_or_national_origin

为了帮助保护您的隐私，PowerPoint 禁止自动下载此外部图片。若要下载并显示此图片，请单击消息栏中的 “选项”，然后单击“启用外部内容”。



Catégories : Commune des Côtes-d'ArmorVille portuaire de FrancePort de plaisanceStation balnéaire française

French English

Category:Communes of Côtes-d'Armor

Category:Port cities and towns in France

Category:Marinas

Category:Seaside resorts in France

Erquy

Category:Towns in Brittany

Category:Cities in France

Category:Coastal construction

Category:Seaside resorts

Easy to be identified as

GPE

http://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Accueil

http://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Commune_des_C%C3%B4tes-d%27Armor

http://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Ville_portuaire_de_France

http://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Port_de_plaisance

http://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Station_baln%C3%A9aire_fran%C3%A7aise


• Example-3: Sentence compression–Yamangil and Nelken, ACL-2008

–Key problem for sentence compression• Data sparseness

• Ziff-Davis corpus: 1067 sentence pairs

–Main idea of this work• Abundant sentence compressions can be extracted from

Wikipedia’s revision history, and used as training data

• >380,000 sentence pairs are extracted

Yamangil and Nelken. Mining Wikipedia Revision Histories for Improving Sentence Compression.


• Example-3: Sentence compression (cont.)–Assumption:

• All edits retain the core meaning of the sentence

–Focus only on sentence-level edits that add or drop words

–Train a lexicalized channel model for sentence compression

Yamangil and Nelken. Mining Wikipedia Revision Histories for Improving Sentence Compression.


UGC

cQA

Blogs / microblogs

Forums

Online encyclopedia

Blogs / Microblogs

• Research on blogs / microblogs–Adaptation of conventional NLP techniques

• Tokenization, POS tagging, spelling error correction,…

–User profile learning

–Search and recommendation

–Sentiment analysis

–Monitoring emergency event

–……

Blogs / Microblogs (cont.)

• Example-1: POS tagging for twitter–Gimpel et al., ACL-2011

–Background:• Conventional NLP tools are typically trained on news

texts, which perform poorly on Twitter

–This work:• Produce an English POS tagger that is designed

especially for Twitter data

Gimpel et al. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments.


• Example-1: POS tagging for twitter (cont.)–Tagset designed for Twitter



• Example-1: POS tagging for twitter (cont.)–Twitter-specific features

• TWORTH: Twitter orthography

– Regular expression rules to detect #, @, and U

• NAMES: Frequently-capitalized tokens

– If a token is frequently capitalized

• TAGDICT: Traditional tag dictionary

– Token’s POS tags from conventional corpora

• DISTSIM: Distributional similarity

– Similar words of the token

• METAPH: Phonetic normalization

– E.g., {thanks thangs thanksss …} -> 0NKS



• Example-2: Sentiment classification– Jiang et al, ACL-2011

–Sentiment classification for Twitter should be target-dependent• E.g.,

Jiang et al. Target-dependent Twitter Sentiment Classification.

People everywhere love Windows & Vista. Bill Gates

Windows 7 is much better than Vista


• Example-2: Sentiment classification (cont.)– Sentiment classification for Twitter should be

context-aware• E.g.,


[tweet_n] First game: Lakers!

[tweet_n-1] I love Lakers, I love Kobe!!

[Retweet] Lakers won the game, Great!

[Reply] I love Lakers too.

Positive

Positive

Positive

Positive


• Example-2: Sentiment classification (cont.)–Method of this work

• Target-dependent features

– Use features based on syntactic parse trees

» E.g., verb+target(obj); target(sub)+verb; adjective+target…

– Binary SVM classifier

• Graph-based sentiment optimization

– Three kinds of related tweets:

» Retweets; reply; tweets containing the target from the same person

– Construct a graph with related tweets and classify tweets on the graph



• Example-3: User grouping –Qu and Liu, ACL-2011

–Motivation:• People in a twitter user’s following list need to be

grouped if the length of the list is large

–Main idea:• Provide some seeding friends for each target class, and

automatically group other friends similar / related to the seeds

Qu and Liu. Interactive Group Suggesting for Twitter.


• Example-3: User grouping (cont.)–Two sub-systems for the task

• Sub-system 1:

– Content based sub-system

» Compute similarity between each friend and the seeding friends based on their tweet contents

• Sub-system 2:

– Friend based sub-system

» Use the count of bi-directional friend relationships and mentions between each friend and seeding friends as the score for ranking

Qu and Liu. Interactive Group Suggesting for Twitter.

Outline






New Difficulties & Challenges

• Storage and computing capability

• Rapid evolution of languages

• Noise and errors in the web corpora

• Information credibility

Storage and Computing Capability

• Large-scale data sources– Web page corpora

– Web search query logs

– Language model

– ……

• Two questions:– How to store the data and easily access to them?

– How to efficiently process the tremendous data?

• Solutions:– Pruning and filtering

– Efficient algorithm

– Distributed computing

Storage and Computing Capability (cont.)

• Example-1: machine translation–Scale of data:

• Translation model trained with 100 million sentence pairs

– Size of model: ≈20G

• Language model (5-gram) trained with 100 million sentences

– Size of model: ≈5G

• Models used by online translation systems are usually larger than those above!


• Example-1: machine translation–Pruning phrase table (PT) [Johnson et al., 2007]

• Observation:

– Many phrase pairs in PT are wrong or will never be used

• Disadvantage of large PT

– Require more resources and time to process

– Requires more features and more sophisticated search

• Work in this paper:

– Prune PT based on significance testing

Johnson et al. Improving Translation Quality by Discarding Most of the Phrasetable.


• Example-1: machine translation–Pruning phrase table (PT) [Johnson et al., 2007]

• Fisher’s exact test:

• p-value:

( ) ( )( , ) ( ) ( , )

( ( , ))

( )

h

C s N C sC s t C t C s t

p C s tNC t

−⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟−⎝ ⎠⎝ ⎠=

⎛ ⎞⎜ ⎟⎝ ⎠

% %

% % %% %%%

%

Johnson et al. Improving Translation Quality by Discarding Most of the Phrasetable.

( , )( ( , )) ( )h

k C s tp value C s t p k

∞

=

− = ∑%%

%%


• Example-2: machine translation–Parallel decoding and distributed language model

• Li et al., 2009

• Parallel decoding

– Exploit multi-core and multi-processor architectures

– Translate multiple sentences in separate threads

– Store the language model and translation grammar in shared memory

• Distributed language model

– Reduce memory pressure

– Use larger language model

Li et al. Decoding in Joshua: Open Source Parsing-based Machine Translation.


• Example-3: paraphrase acquisition–Basic idea:

• Hypothesis: Paraphrases should appear in similar contexts (distributional similarity)

• Extract paraphrase phrases whose contextual vectors are similar with each other

– E.g., X acquired Y and X completed the acquisition of Y

• Data scale:

– 150GB monolingual corpus

– Check any pair of phrases if they are paraphrases

Bhagat and Ravichandran. Large Scale Acquisition of Paraphrases for Learning Surface Patterns.


• Example-3: paraphrase acquisition (cont.)–Apply Locality Sensitive Hashing (LSH) to speed

up computation [Bhagat and Ravichandran, 2008]• Represents a d dimensional vector by a stream of b bits

and has the property of preserving the cosine similarity between vectors

• Reduce time complexity from to

( )b d

Bhagat and Ravichandran. Large Scale Acquisition of Paraphrases for Learning Surface Patterns.

2( )O n d ( )O nd

n: #phrases






Rapid evolution of languages

• Languages change rapidly due to WWW:–New words

• E.g., 给力, 雷人

–New senses• E.g., 粉丝, 玉米

–New named entities• E.g., 旭日阳刚, 筷子兄弟

–Chat language• E.g., 818, 有木有

Rapid evolution of languages (cont.)

• Challenges posed to NLP:–Word segmentation

• Out-Of-Vocabulary problem

–Named entity recognition• More categories should be covered

–Word sense disambiguation• New senses of words need to be recognized

–Chat language normalization• Normalize chat language to natural language


• Example-1: Entity extraction–Pennacchiotti and Pantel, EMNLP-2009

–Two extractors:• Pattern-based extractor

– Find instances of a given relation with seed instances

– E.g., act-in (Actor, Movie)

• Distributional extractor

– Construct a context vector for each noun

– Sample seed entities of a given entity class C

– Compute context vector similarity with the seeds

– Similar nouns are returned and classified into C

–Take the union of the entities extracted above

Pennacchiotti and Pantel. Entity Extraction via Ensemble Semantics.


• Example-1: Entity extraction–Pennacchiotti and Pantel, EMNLP-2009

–ML-based ranker• Regression model

– Features:• Feature classes:

– Frequency, Co-occurrence, Distributional, Pattern, Termness

• Features are extracted from:

– Web corpus, query logs, web tables, wikipedia



• Example-2: Entity extraction and clustering– Jain and Pennacchiotti, COLING-2010

–Extract entities from web search query logs• Heuristic:

– Query term sequence with uppercase characters

• Filtering:

– Web-based representation score

» Checks if the case-sensitive representation of the candidate is the most likely representation

– Query-log-based standalone score

» Counts the occurrences of the standalone forms of the candidate in the query logs

Jain and Pennacchiotti. Open Entity Extraction from Web Search Query Logs.


• Example-2: Entity extraction and clustering– Jain and Pennacchiotti, COLING-2010

–Entity clustering• Features:

– Context feature space

» Hypothesis: similar queries should appear in similar contexts in the query logs

– Clickthrough feature space

» Hypothesis: similar queries should generate clicks on similar urls

– Hybrid feature space

» Normalized union of the two feature spaces above

Jain and Pennacchiotti. Open Entity Extraction from Web Search Query Logs.


• Example-3: Chinese chat text normalization–Xia et al., ACL-2006

–Problem:• Normalize Chinese chat text to natural language

– E.g., 介里->这里; 偶->我

–Characteristics:• Anomalous

– Anomalous words or anomalous usage

• Dynamic

– Change fast year by year

Xia et al. A Phonetic-Based Approach to Chinese Chat Text Normalization.


• Example-3: Chinese chat text normalization–Baseline model:

• Source-channel model

–

• Disadvantages:

– Data sparseness

– Training effectiveness is poor due to the dynamic nature

–Observation:• Most Chinese chat terms are created via phonetic

transcription


( | ) ( )ˆ arg max ( | ) arg max( )C C

p T C p CC p C Tp T

= =


• Example-3: Chinese chat text normalization–Extended source channel model:

• Inserting phonetic mapping model

•

• Chat term normalization observation model:

–

• Phonetic mapping model:

–


, ,( | , ) ( | ) ( )ˆ arg max ( | , ) arg max

( )C M C Mp T M C p M C p CC p C M T

p T= =

( | , ) ( | , )i i iip T M C p t m c=∏

( | ) ( | )i iip M C p m c=∏

Phonetic mapping probability


• Example-3: Chinese chat text normalization–Phonetic mapping model:

•

– : frequency of character c in a standard language corpus

– : phonetic similarity between two characters

»


( ) ( , )Pr ( , )( ( ) ( , ))

slcpm

slc i ii

fr c ps t cob t cfr c ps t c

×=

×∑( )slcfr c

( , )ps t c

( , ) ( ( ), ( ))( ( ( )), ( ( )))( ( ( )), ( ( )))

ps t c sim py t py csim initial py t initial py csim final py t final py c

==×






Noise and Errors in the Web Corpora

• Query logs–Queries are not well-formed sentences

– 10-15% of queries contain misspelled terms [Cucerzan and Brill, 2004]

• UGC data–Noise and errors are common in Forums,

community-based QA, blogs, microblogs

• Present challenges to NLP researches–E.g., word segmentation, POS tagging, parsing

Cucerzan and Brill. Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users.

Noise and Errors in the Web Corpora (cont.)

• Spelling correction –Non-word spelling correction

• Word not found in pre-compiled lexicon

• Words similar to the misspelled word are candidate spelling corrections

• Statistical error models are more effective

–Real-word spelling correction• Incorrect usage of a valid word in a given context

– A ___ of cake (peace / piece)

• Generate candidate corrections with a pre-defined confusion set

• Rank candidates according to contextual information


• Query spelling correction–Sun et al., ACL-2010

–Collect training data from clickthrough data of query logs• Collect cases in which a user submits a query and clicks

on the spelling suggestion

– “did you mean” function

– 3 million query-correction pairs are collected in this way

–Using the collected data to train a spelling correction system

Sun et al. Learning Phrase-Based Spelling Error Models from Clickthrough Data.



–Model:• Ranking model

– Two-layer neural net with 5 hidden nodes

• 96 ranking features

– Language model

– Error model

» Edit distance model, phonetic model

– Phrase-based error model

– ……




–Phrase-based error model• Similar to SMT model

• “translate” a correct query C into a misspelled query Q

• Given:– Segment C into K phrases: c1,…,cK

– T: K replacement phrases: q1,…,qK

• Model:

( , , )( , , *)

( | ) max ( | , )S T M

B C Q A

P Q C P T C S∈

≈


1( | , ) ( | )K

k kkP T C S P q c

==∏

alignment



–Features:• Phrase transformation feature

• Lexical weight feature

'

( , )( | )( , ')

q

N c qP q cN c q

=∑


| |

( , )1

1( | , ) ( | )|{ | ( , ) } |

q

w i ji j Ai

P q c A t q cj j i A ∀ ∈=

=∈ ∑∏






Information Credibility

• Credibility problem– It becomes easy to produce and disseminate

information in the web 2.0 age• Personal web sites

• Forums

• Blogs / microblogs

• ……

– It becomes difficult to control the authority and credibility of the information (sources)• Deceptive, erroneous, subjective, outdated information

Information Credibility (cont.)

• Aspects of information credibility [Metzger, 2007]–Accuracy

–Authority• trustworthiness, expertise

–Objectivity

–Currency• Up-to-date

–Coverage• Sufficient depth and breadth

Metzger. Making Sense of Credibility on the Web: Models for Evaluating Online Information and Recommendations for Future Research.


• Criteria for information credibility– Information contents

• Detect logical consistency and contradiction

– Information sender• Quality and quantity info. the sender has produced

–Document style and superficial characteristics• Sentential style, page layout, links in page,…

–Social evaluation• Mining users’ opinions and comments

Information Credibility Criteria Project: http://kc.nict.go.jp/project1/icc-project-description.html


• Example-1: Search engine–Credibility of returned search results:

• Relatedness

• Web site authority

• Inward links

• User clicks

• Anti-cheating

• Currency

• ……


• Example-1: Search engine (cont.)

Information from the credible sources are

higher ranked


• Example-2: News –Credibility of news:

• Credibility of the information sources

– E.g., Google News

• Example-3: E-commerce–Credibility of products

• Users’ rating

• Users’ comments


• Example-3: E-commerce (cont.)

Users’ rating

Users’ comments

Outline






New Methodologies & Solutions

• Response to real-world demands

• Familiar with cutting-edge research

• Balance between data and algorithms

• Experimental platforms for real applications

Response to Real-world Demands

• Research rooted in real applications

• Research beyond engineering

• Feasibility & Expansibility–Easy and robust solutions must come first

• Effectiveness vs. Efficiency–Web-scale applications

• Leverage all available data and resources

Response to Real-world Demands (cont.)

• Example-1: Entity linking–Application of word sense disambiguation (WSD)

–Conventional WSD research:• Disambiguate given words in certain contexts

–Entity linking:• Link entities appearing in documents with their

referents in knowledge bases (e.g., Wikipedia)

• WSD is the key problem in entity linking


• Example-1: Entity linking (cont.)–An example from [Han and Sun, 2011]

Han and Sun. A Generative Entity-Mention Model for Linking Entities with Knowledge Base.


• Example-2: Paraphrase acquisition – Investigate multiple resources [Zhao et al., 2008]

• Thesaurus

• Monolingual parallel corpora

– Multiple translations of the same foreign novel

• Monolingual comparable corpora

– Comparable news articles reporting on the same event

• Bilingual parallel corpora

• Online dictionary definitions

• Query clusters

– Based on click-through information

Zhao et al. Combining Multiple Resources to Improve SMT-based Paraphrasing Model.


• Example-2: Paraphrase acquisition (cont.)–Combine multiple resources [Zhao et al., 2008]

• Train a paraphrase table with each resource and combine them with a SMT model, which is then used in paraphrase generation

• Accuracy of the generated paraphrases is improved when combining multiple resources

Zhao et al. Combining Multiple Resources to Improve SMT-based Paraphrasing Model.


• Example-3: Ensemble Semantics (ES)–Pennacchiotti and Pantel, EMNLP-2009

–Ensemble semantics: • A general framework for modeling information

extraction algorithms that combine multiple sources of information and multiple extractors

• Advantages:

– Multiple sources of knowledge

– Multiple extractors

– Multiple sources of features



• Example-3: Ensemble Semantics (ES)–Pennacchiotti and Pantel, EMNLP-2009


sources






Familiar with Cutting-edge Research

• Cutting-edge research–New problem

• E.g., sentiment analysis, entailment

–New solution• E.g., crowdsourcing

–New application• E.g., microblogs

• In-depth analysis before entering a new field

Familiar with Cutting-edge Research (cont.)

• Example: Crowdsourcing–Definition:

• The act of outsourcing tasks, traditionally performed by an employee or contractor, to an undefined, large group of people or community (a "crowd"), through an open call ----Wikipedia

• …it gathers those who are most fit to perform tasks, solve complex problems and contribute with the most relevant and fresh ideas ---- Jeff Howe

–Amazon’s Mechanical Turk:• A crowdsourcing Internet marketplace that enables

computer programmers (known as Requesters) to co-ordinate the use of human intelligence to perform tasks that computers are unable to do yet ---- Wikipedia


• Crowdsourcing (cont.)–Crowdsourcing for NLP

• Speech recognition [Novotney and Callison-Burch, ’10]

• Machine translation [Zaidan and Callison-Burch, ’11]

• Paraphrase generation [Madnani, ’10]

• Anaphora resolution [Chamberlain et al., ’09]

• Word sense disambiguation [Akkaya, et al., ’10]

• Lexicon construction [Irvine and Klementiev, ’10]

• Named entity recognition [Finin etal., ’10]

• Grammatical error detection [Madnani et al., ’11]

• ……


• Crowdsourcing (cont.)–Advantage:

• Large and low-cost labor force

• Short turnaround time

• Access to foreign markets with native speakers

–Key problem: quality control• Knowledge producers are non-professional

– Post-processing for getting high-quality knowledge

–Machine learning techniques for automatically selecting high-quality knowledge


• Crowdsourcing (cont.)–Example: Collect Urdu-English translations

• Zaidan and Callison-Burch (ACL-2011)

Source stn.Collect translations

Post-edit and rank

Quality control

Best translation

Zaidan and Callison-Burch. Crowdsourcing Translation: Professional Quality from Non-Professionals.



• Collect English translations for Urdu sentences

– Turkers are mainly from India and Pakistan

– Input sentences are converted into images to avoid cheating by using an automatic MT system

– Collect multiple translations for each source sentence

• Post-editing and ranking

– Turkers should be native English speakers

– Post-edit: Edit the translations to make them more fluent

– Rank: rank all translations




• Automatically select the best translation from the candidate translations

• Features:

– Sentence-level features

» Language model features, sentence length features, web n-gram match percentage, web n-gram geometric average, edit rate to other translations

– Worker-level features

» Aggregate features, language ability, worker location

– Ranking features

» Average rank, is-best percentage, is-better percentage

– Worker calibration feature




• Evaluation:

– Compute BLEU against professional translations







Balance between Data and Algorithms

• Traditional research–Limited data

– Sophisticated algorithms• Have to mine “lean ore” for knowledge

• NLP for web applications–Large-scale data

• Information redundancy is critical for statistical methods

–Relatively simple algorithms• Easily acquire enough knowledge with lightweight

methods

• Efficiently process large corpora

Balance between Data and Algorithms (cont.)

• Example: Open Information Extraction (OIE)–Banko et al., IJCAI-2007

–Traditional IE systems:• Use small and homogeneous corpora

– E.g., news corpora

• Rely on heavy linguistic technologies

– E.g., parsing, NER

• Relations of interest are specified beforehand

Banko et al. Open Information Extraction from the Web.

Difficult to scale to the massive and heterogeneous web corpora



–The proposed method:• Module-1: self-supervised learner

– Train a classifier with a small corpus, which labels candidate extractions as “trustworthy” or not

• Module-2: single-pass extractor

– Extract candidate extractions from a large corpus, which are then filtered with the learnt classifier

• Module-3: redundancy-based assessor

– Assign a probability to each retained extraction based on a probabilistic model of redundancy in text



• Example: Open Information Extraction (OIE) –Banko et al., IJCAI-2007

–Module-1: self-supervised learner• Parse a small corpus with a syntax parser, and extract

noun phrase pairs with syntax paths

• Automatically label positive / negative examples based on constraints

– Path length, within sentence boundary, not solely pronoun

• Train a Naïve Bayes classifier with the training data




–Module-2: single-pass extractor• Identify noun phrases with a lightweight NP chunker

• Relations are found by examining the text between noun phrases

– Non-essential phrases are filtered

• Presented to the classifier and tuples labeled as positive are extracted and stored




–Module-3: redundancy-based assessor• Extraction is performed over the entire corpus

• Merge tuples where both entities and relations are the same and count their occurrences in sentences

• Assign a probability based on the occurrence count




–Data scale:• Extract facts from a 9 million web page corpus

• 60.5 million tuples were extracted

–Error rate:


Average error rate

Correct extractions

TEXTRUNNER 12% 11,476

KNOWITALL 18% 11,631






Experimental Platforms for Real Applications

• Academic and industrial circles should work together to establish evaluation platforms that resemble real applications– Industrial circle:

• Collect real-world application requirements

• Release real application data

– E.g., query logs

• Example-1: Yahoo! learning to rank challenge–A platform for learning-to-rank research

–Dataset:• Sampled from Yahoo! query logs

• Contain <query, url, features, relevance judgment>

– Queries, urls, and feature descriptions are not given, only feature values are

• Volume

Experimental Platforms for Real Applications (cont.)

• Example-2: Microsoft learning to rank dataset–Similar to the Yahoo! Dataset

– Sampled from Microsoft Bing query logs• MSLR-WEB30K: more than 30,000 queries

• MSLR-WEB10K: 10,000 queries

• Form

– Query ID, url ID, feature descriptions and feature values are given



• Example-3: Query logs–AOL query logs

• About 20M queries from about 650k users

• Users are represented as IDs

• Both queries and clicked urls are given

– Sogou query logs• 1 month query logs

• Presented info.

– Time, user ID, clicked url (Uc), rank of Uc, sequence number of Uc


• Example-4: other released data–Data from Yahoo! Webscope:

• http://webscope.sandbox.yahoo.com/

–Data from Sogou Labs• http://www.sogou.com/labs/

–Google Web 1T 5-gram corpus

–……

http://webscope.sandbox.yahoo.com/

http://www.sogou.com/labs/

Thanks!QA

Documents

Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile