187
Natural Language Processing for Web Applications Haifeng Wang, Shiqi Zhao

Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Natural Language Processing for Web Applications

Haifeng Wang, Shiqi Zhao

Page 2: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Outline

• Part I–Background

• Part II–New Applications & Trends

• Part III–New Data & Resources

• Part IV–New Difficulties & Challenges

• Part V–New Methodologies & Solutions

Page 3: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Status of the Internet

Search Engine UnionProvider

Customer

Internet Users

- Any Enterprise (Hundreds of thousands, finance, travel,

shipping)- advertisement

- webmaster( vertical website)

-Press

-Website(1.83M in 2011 in

China)

-Hidden web

Infrastructure- Domain name- Telecommunications, - Mobile- Broadband, - Hardware- Internet service provider

-485M+ users in chinaAbout 1 billion in the future

Page 4: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Trends of the Internet

Media Content Source

Internet

A B

Media-TimeTransition

DTime Constraint

C

• Professional Editors User Generated;

Static DynamicallyGenerated

• Simply Surfing Online gaming、video-watching、Social interaction – All

kinds of Application

• Text Image、Video、Audio,

etc.

• Users want real-time

information when

searching

• The Internet and users have changed a lot in the past decade

Page 5: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Traditional Search Engine

Box

Text matching based requirement

analysis

All results returned in

similar formact

Traditional web search

User

Entrance

Query Information

Presentation

Webpages

Crawler Webpages

Internet

Page 6: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Generation of Search Engines

User

Entrance Box

Requirement Analysis

ResultsIntegration

Execution

Info Apps

Presentation

Query

•Apps Execution•Rich Media Analysis

Real-time data

Individual and group developers

InternetSNS/UGC/Hidden web

Open AppsPlatform

Open DatePlatform

TraditionalWeb Search

Rich MediaSearch

Apps Accurate Structured-results Webpages Rich Media

Page 7: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Queries to Baidu Search Engine

听起来欢乐的歌曲joyous song

现在几点了What time is it

电脑中毒了怎么办How to deal with computer virus

哪能买到漂亮衣服Where could I buy some beautiful clothes

北京哪能找到女朋友Where could I get a girlfriend in Beijing

令人心情愉快的图片Pleasant pictures

Page 8: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Challenge to NLP

RequirementAnalysis

KnowledgeMining

A B

ResultPresentation

DHuman-Computer

Interaction

C

• Hidden web, hiding knowledge

• Structured,semi- structured, unstructured

• Various levels

• Direct answer• Clustering• Summarization• Relation Graph• intelligent push• Rich media

• Complex Query• Diversiform

Requirement

• Suggestion• Extension• Interaction

NLP

Page 9: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

NLP for Web Applications

Rule-based Method Statistical / ML MethodMethods

Dictionary Corpus Web data LogResources

Modules

Term

Granularity-Segmentation- Unknown word- Component

Property- Proper noun- Requirement- POS- Phonetic

Relation- Collocation- Similarity- Language model- Ontology

Phra

se Structure - Chunking-Term importance-Trunk parsing

Transformation- Synonym - Semantic norm.- Correction

Classification- Requirement

classification- Topic detection

Sent Syntax Semantic

Doc

Single Document- Topic analysis - Page value analysis

Multi-document- Classification & clustering- Feature extraction

Machine Translation IMENLP Applications Web search

Mobile search

Vertical search

cQA

Wikipedia

Advertising

Recommendation & personalization

Page 10: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Outline

• Part I–Background

• Part II–New Applications & Trends

• Part III–New Data & Resources

• Part IV–New Difficulties & Challenges

• Part V–New Methodologies & Solutions

Page 11: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Applications & Trends

• NLP for web applications–Main applications

• Web search, online translation, recommendation system, e-business, social networks,…

–Evolution of traditional topics• Word segmentation, machine translation, question

answering,…

–New research topics

Page 12: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Main Applications (cont.)

• NLP for web search–Basic applications

• Segmentation

• Query term importance rating

• Query rewriting

• Query intent analysis

–Advanced applications• Question answering

• Summarization

• Word sense disambiguation

• Clustering

• Information extraction

• Ontology

Page 13: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Main Applications (cont.)

• Examples:–天龙八步 —> 天龙八部

–怎样能有归一证 —> 怎样能有皈依证

–宝马X6价钱 —> 宝马X6报价

–成都的哥罢工 —> 成都出租车罢工

–赞颂母爱的现代诗 —> 母爱的现代诗

–康柏笔记本vista系统一键恢复—>康柏vista一键恢复

Page 14: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Main Applications (cont.)

• NLP for online translation–Webpage

–Query

– communication

–Social network

–E-commerce

–Computer aided translation

–Computer aided learning

–Mobile

Page 15: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Main Applications (cont.)

• Recommendation system

–Recommendation among a single product

–Recommendation cross products

• Recommendation clue

–User profile

–User log

–Content

Page 16: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Main Applications (cont.)

• NLP for e-business– Intent / requirement recognition

• e-business website

– Information extraction• E.g., extract product information

– Information recommendation• Maintain user profiles and recommend products to users

• Advertising

– Sentiment analysis• Analyze comments of products

Page 17: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Main Applications (cont.)

• NLP for social networks–User profile

–Recommendation• Not only information, but also users

–Data mining • Identify emergency events

• Sentiment analysis

• Polls

Page 18: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Main Applications (cont.)

Main applications

Web search

Online translation

Recommendation

E-business

Social networks

Segmentation

Query term importance rating

Query rewriting

Query intent analysis

Question answering

Summarization

Word sense disambiguation

Information extraction

Ontology

Basic applications

Advanced applications

Clustering

Page 19: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Basic Applications for Web Search

• Word segmentation –Basic unit for languages like Chinese

–Segmentation for both queries and web documents

功夫熊猫在线观看

功夫熊猫 在线 观看

功夫 在线 观看熊猫

Page 20: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Basic Applications for Web Search (cont.)

• Query segmentation–Segment a query into chunks

–Scenario: document ranking• What terms must appear contiguously in the retrieved

documents?

功夫 在线 观看熊猫

功夫熊猫 在线观看

功夫熊猫在线观看

Page 21: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Basic Applications for Web Search (cont.)

• Query term importance rating–Estimate importance rating for each query term

–Scenario: document ranking• Which query terms MUST be matched in the retrieved

documents?

功夫熊猫在线观看

term rating功夫熊猫 3在线 2观看 1

Page 22: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Basic Applications for Web Search (cont.)

• Query rewriting–User queries are never perfect

• Contain errors

– E.g., 牛肉顿萝卜

• Too short

– E.g., 牛肉 萝卜

• Too verbose

– E.g., 哪位朋友可以告诉我牛肉炖萝卜该怎么做啊???

• Use less common expressions

– E.g., 牛肉烧萝卜

• ……

Page 23: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Basic Applications for Web Search (cont.)

• Query rewriting (cont.)– Involve the following NLP techniques:

• Error detection & correction

– Collecting query correction pairs

• Query expansion

– Learning expansion term from query logs or web docs

• Query reduction

– Query term importance rating

• Query paraphrasing & entailment

– Synonymous resource extraction

• ……

Page 24: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Basic Applications for Web Search (cont.)

• Query intent analysis

Page 25: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Advanced Applications for Web Search (cont.)

• Question answering

question

answer

Page 26: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Advanced Applications for Web Search (cont.)

• Summarization

Summaries rather than snippets

Page 27: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Advanced Applications for Web Search (cont.)

• Word sense disambiguation

Page 28: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Advanced Applications for Web Search (cont.)

• Search result clustering

Page 29: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Advanced Applications for Web Search (cont.)

• Information extraction

Page 30: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Advanced Applications for Web Search (cont.)

• Ontology

Page 31: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Applications & Trends

• NLP for web applications–Main applications

• Web search, online translation, recommendation system, e-business, social networks,…

–Evolution of traditional topics• Word segmentation, machine translation, question

answering,…

–New research topics

Page 32: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Evolution of Traditional Topics

• Example-1: word segmentation –Demands from web applications (especially

search engines)• High efficiency

– To process tens of billions of web documents

• Frequently update

– To recognize new terms / concepts / named entities…

• Flexibility

– Different applications requires different segmentation outputs

Page 33: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Evolution of Traditional Topics (cont.)

• Example-1: word segmentation (cont.)– Solutions:

• Light weight model

– Efficient

• Learn new terms and NEs

– Mine web corpora and query logs

– Be easily added to the segmentation dictionary

• Various granularities

– Customize for different applications

Page 34: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Evolution of Traditional Topics (cont.)

• Example-2: machine translation–Machine translation methods

• Rule based systems

– E.g., Systran

• Statistical machine translation

– Google, Baidu …

–Difficulties• Data sparseness

• Too huge model

• Not fast enough

Page 35: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Evolution of Traditional Topics (cont.)

• Example-2: machine translation (cont.)–Online translation

• Web service instead of software

• Collect tens of millions of multilingual parallel / comparable sentence pairs from the web for model training

• Extract translations for idioms, named entities, and new terms from the web

• Both translation model and language model are extremely large

Page 36: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Evolution of Traditional Topics (cont.)

• Example-2: machine translation (cont.)–Challenges to online translation

• Quality control for the automatically collected data

• Model selection

• Model compression

• Distributed storage and computing

• Model update

• Fast decoding

• Domain adaptation

• ……

Page 37: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Evolution of Traditional Topics (cont.)

• Example-2: machine translation (cont.)

Reordering problem

Page 38: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Evolution of Traditional Topics (cont.)

• Example-3: question answering–Stage-1:

• Natural-language interface to expert systems

• Within specific domains

– Stage-2: web-based QA• Open domain

• Built upon large web corpora

• Mainly based on statistical methods

– Correct answers appear more times than incorrect ones

Page 39: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Evolution of Traditional Topics (cont.)

• Example-3: question answering (cont.)– Stage-2: web-based QA (cont.)

• Main modules

– Question classifier

– Search engine

– Answer extractor

• Deep analysis

– Morphological analysis, syntactic analysis, NER, WSD, coreference resolution, inference, reasoning, Ontology,…

• Usually work well on factoid questions

Page 40: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Evolution of Traditional Topics (cont.)

• Example-3: question answering (cont.)– Stage-3: community-based QA

• Users ask questions and wait for other users to answer them

• Answers have higher coverage than those from automatic QA system

• Previously asked questions and answers can be searched by other users

• Can work well on description and subjective questions

Page 41: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Applications & Trends

• NLP for web applications–Main applications

• Web search, online translation, recommendation system, e-business, social networks,…

–Evolution of traditional topics• Word segmentation, machine translation, question

answering,…

–New research topics

Page 42: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Research Topics

• Sentiment analysis

• Wikipedia-based research

• Microblog-based research

• Crowdsourcing

• ……

Page 43: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Outline

• Part I–Background

• Part II–New Applications & Trends

• Part III–New Data & Resources

• Part IV–New Difficulties & Challenges

• Part V–New Methodologies & Solutions

Page 44: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Data & Resources

• Treasures from WWW

Large-scale Web corpora

Query logs with user behaviors

User generated content (UGC)

Page 45: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora

Page 46: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Large-scale web corpora for NLP–Statistics of distributions

• N-grams / Language model

• Collocations

• Co-occurrence

• ……

–Data mining• Information extraction / answer extraction for QA

• Paraphrase / entailment rules acquisition

• Bilingual data collection

• ……

Page 47: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Statistics of distributions–Example: N-grams / language model

• E.g., Google Web 1-T 5-grams

• Source data:

– 1 trillion word tokens from web pages

• Data size:

Number of tokens 1,024,908,267,229

Number of sentences 95,119,665,584

Number of unigrams 13,588,391

Number of bigrams 314,843,401

Number of trigrams 977,069,902

Number of fourgrams 1,313,818,354

Number of fivegrams 1,176,470,663

Page 48: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Statistics of distributions (cont.)–Applications of web n-grams

•Example: lexical substitution–Substitute words in a sentence with their synonyms

that fit in the given context

–Two stages:

»Extract candidate substitutes from thesauri

»Rank candidates according to their fitness in the given context

Page 49: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Statistics of distributions (cont.)– Lexical substitution

• Stage-2: candidate ranking [Giuliano et al., 2007]

Giuliano et al. FBK-irst: Lexical Substitution Task Exploiting Domain and Syntagmatic Coherence.

H_l w H_r

H_l e H_r

Left context Right context

Count the frequency of the generated fragment using

Google 5-grams

Page 50: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Data mining–Example-1: Learning surface patterns for QA

[Ravichandran and Hovy, ACL-2002]

Ravichandran and Hovy. Learning Surface Text Patterns for a Question Answering System.

Question taxonomy BIRTHDAY

1.00 <NAME> ( <ANSWER> - )0.85 <NAME> was born on <ANSWER>,0.60 <NAME> was born in <ANSWER>0.59 <NAME> was born <ANSWER>0.53 <ANSWER> <NAME> was born0.50 – <NAME> ( <ANSWER>0.36 <NAME> ( <ANSWER> -

Given seed (Mozart, 1756)

scores Paraphrasepatterns

Page 51: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Example-1: Learning surface patterns for QA (cont.)

–Main steps for learning patterns:• Seed selection

– E.g., Mozart 1756 for BIRTHDAY

• Submit the seed to a search engine

• Download the top-1000 web documents

• Retain sentences containing the Q and A terms

• Pass each sentence through a suffix tree constructor

• Retain phrases in the suffix tree that contain both the Q and A terms

• Replace word for Q term as <NAME> and word for A term as <ANSWER>

Ravichandran and Hovy. Learning Surface Text Patterns for a Question Answering System.

Page 52: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Example-1: Learning surface patterns for QA (cont.)

–Calculate precision of each pattern:• Query the search engine with the Q term

– E.g., Mozart

• Download top-1000 web documents

• Retain sentences that contain Q term

• For each learnt pattern, compute the percentage that the correct answer occurs in the <ANSWER> slot

• Return only the patterns matching a sufficient number of seeds (>5)

Ravichandran and Hovy. Learning Surface Text Patterns for a Question Answering System.

Page 53: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Example-2: Mining parallel data –Mining parallel data from bilingual websites [Shi

et al., ACL-2006]

•Basic idea: –One can identify parallel web pages from bilingual

websites, and further extract bilingual sentences from them.

– If two hyperlinks in two parallel web pages are aligned, then their corresponding web pages are also likely to be parallel

Shi et al. A DOM Tree Alignment Model for Mining Parallel Data from the Web.

Page 54: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Example of bilingual websites

Parallel web pages

Page 55: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Example-2: Mining parallel data (cont.) –Mining parallel data from bilingual websites

• Main steps:

– Identify bilingual websites using trigger words

» E.g., English, English Version, 中文, 中文版

– Identify bilingual web pages with a classifier

» Features: length ratio; HTML tag similarity; sentence alignment score

– DOM tree alignment for bilingual parallel web pages

– Sentence alignment within aligned text chunks

– Recursively mine parallel hyperlinks

Shi et al. A DOM Tree Alignment Model for Mining Parallel Data from the Web.

Page 56: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Example-2: Mining parallel data (cont.)– Mining parallel data from bilingual web pages

[Jiang et al., ACL-2009]• Disadvantage of the above method

– The number of bilingual websites is small, thus the volume of extracted bilingual data is limited.

• Basic idea:

– In many bilingual web pages bilingual data appear collectively and follow similar surface patterns.

Jiang et al. Mining Bilingual Data from the Web with Adaptively Learnt Patterns.

Page 57: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Examples of collective bilingual pages

Bilingual terms

Bilingual sentences

Page 58: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Example-2: Mining parallel data (cont.)– Mining parallel data from bilingual web pages

• Main steps:

Jiang et al. Mining Bilingual Data from the Web with Adaptively Learnt Patterns.

Page 59: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Example-2: Mining parallel data (cont.)– Mining parallel data from bilingual web pages

• Preprocessing:

– Parse web documents into DOM trees

– Segment text into snippets according to languages

• Seed mining:

– Judge any adjacent E/C snippet pair

– Compute the likelihood of being a translation pair

» Combines a translation model and a transliteration model

Jiang et al. Mining Bilingual Data from the Web with Adaptively Learnt Patterns.

Page 60: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Example-2: Mining parallel data (cont.)– Mining parallel data from bilingual web pages

• Pattern learning:

– Candidate pattern extraction

– Pattern selection with a SVM classifier

» Features: generality; average translation score; length; irregularity

• Pattern-based translation mining

Jiang et al. Mining Bilingual Data from the Web with Adaptively Learnt Patterns.

7. Don’t worry. 别担心。

[N][P][S][E][P][S][C][P]

Page 61: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Example-3: Improve parsing–Web-scale features for Parsing [Bansal and Klein,

ACL-2011]• Motivation:

– Web counts are powerful syntactic cues

• Basic idea:

– Generate web count features to address the full range of syntactic attachments

Bansal and Klein. Web-Scale Features for Full-Scale Parsing.

raising from is much more frequent on the web than

$ x billion from

Page 62: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Example-3: Improve parsing (cont.)–Web-scale features for Parsing

• Affinity features

– E.g., lexical co-occurrence counts from large corpora

• Paraphrase features

–Web corpus:• Google n-gram corpus

Bansal and Klein. Web-Scale Features for Full-Scale Parsing.

Page 63: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Example-3: Improve parsing (cont.)–Affinity features

• Given a head-argument pair (h,a), define the adjacent count feature:

– ADJ

» Count of query q=ha or q=ah in the web corpus

– ADJ∧POS(h)∧POS(a)

» Specific to each pair of POS tags

– ADJ∧POS(h)∧POS(a)∧b» b is the binned query count

– Other complex features

Bansal and Klein. Web-Scale Features for Full-Scale Parsing.

Page 64: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Large-scale Web Corpora (cont.)

• Example-3: Improve parsing (cont.)– Paraphrase features

• Form:– PARA ∧POS(h) ∧POS(a) ∧c ∧p ∧dir

• Example:– PARA ∧VBG ∧IN ∧it ∧MIDDLE ∧→

• Explanation:

– If frequent occurrences of raising it from indicated a correct attachment between raising and from, frequent occurrences of lowering it with will indicate the correctness of an attachment between lowering and with.

Bansal and Klein. Web-Scale Features for Full-Scale Parsing.

Page 65: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Data & Resources

• Treasures from WWW

Large-scale Web corpora

Query logs with user behaviors

User generated content (UGC)

Page 66: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors

• Search engine query logs:–Log data recording users’ queries along with

corresponding behaviors with the search engine

–What can be mined in query logs?• Query: keywords that users search with the search

engine

• Clicks: urls that users click after submitting queries

• Session: a sequence of queries and clicks from the same user within a short time interval

• Other information: search time, user ID, user browsing…

Page 67: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Search engine query logs (cont.)–An example:

time id query click2011-08-21-09:02:32 12345 非诚勿扰 -2011-08-21-09:02:40 12345 非诚勿扰电影 url1, title: 非诚勿扰电影 百度视频

2011-08-21-09:04:51 12345 - url2, title: 非诚勿扰-电影-高清在线观看

2011-08-21-09:05:12 12345 非诚勿扰2 url3, title: 非诚勿扰2-高清在线观看

2011-08-21-09:05:18 12346 姚晨微波 -2011-08-21-09:05:39 12346 姚晨微博 url4, title: 姚晨的微博 新浪微博-随时…2011-08-21-09:05:43 12347 中国期刊网 url5, title: 中国知网首页

2011-08-21-09:05:48 12348 非诚勿扰2 在线 url3, title: 非诚勿扰2-高清在线观看

Query expansion

NE extraction

Spelling error correction

Query clustering

Page 68: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Query logs for IR and NLP–Query clustering

–Query intent recognition

–Query rewriting• Expansion, reduction, synonymous reformulation, error

correction

–Query suggestion

–Learning-to-rank

–Named entity recognition

–Ontology construction

–……

Page 69: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-1: Query clustering–Wen et al., 2002

–Measure query similarity from two aspects:• Content similarity

– Similarity of content words in two queries

• Click-through similarity

– Similarity of user clicked urls for two queries

• Combined similarity

Wen et al. Query Clustering Using User Logs.

_* *content cross refsimilarity similarity similarityα β= +

Page 70: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-1: Query clustering (cont.)–Content similarity

• Based on keywords or phrases

• Based on edit distance

Wen et al. Query Clustering Using User Logs.

Word overlap rate

Cosine sim.

Page 71: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-1: Query clustering (cont.)–Click-through similarity

• Through single document

• Through document hierarchy

Wen et al. Query Clustering Using User Logs.

Overlap rate of the clicked URLs

Lowest common parent node

Page 72: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-2: Building NE-query intent taxonomy

–Yin and Shah, WWW-2010

Yin and Shah. Building Taxonomy of Web Search Intents for Name Entity Queries.

Mining intent phrases for NE-

queries and organizing them in a

taxonomy

Page 73: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-2: Building NE-query intent taxonomy (cont.)–Three main steps:

• Identify search intents for a class of entities

– E.g., download, mp3, mv for music

• Infer relationship between two intent phrases

– E.g., synonyms: pictures / pics;

hypernyms: wallpapers / pictures

• Organize intent phrases into a tree

Yin and Shah. Building Taxonomy of Web Search Intents for Name Entity Queries.

Page 74: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-2: Building NE-query intent taxonomy (cont.)– Intent phrase identification

• Extract phrases that co-appear with NEs of a given class from query logs

– Infer relations between intent phrases• Basic idea:

– Given an NE e, its intent phrases w1 and w2, if the clicked urls for “e+w1” also satisfy “e+w2”, then w1 w2

–Organize intent phrases• Three approaches:

– Directed maximum spanning tree; hierarchical agglomerative clustering; Pachinko allocation models

Yin and Shah. Building Taxonomy of Web Search Intents for Name Entity Queries.

Page 75: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-3: Query substitution – Johns et al., WWW-2006

–Example:

Johns et al. Generating Query Substitutions.

Query-level substitution

Phrase-level substitution

Page 76: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-3: Query substitution (cont.)–Collect reformulation query / phrase pairs

• Using query sessions

• Query pair:

– Successive queries issued by a single user

– E.g., britney spears mp3s -> britney spears lyrics

• Phrase pairs:

– Segment queries into phrases

– Retain queries in which only one segment has changed

– E.g., (britney spears) (mp3s) -> (britney spears) (lyrics)

• Measure relatedness of the query / phrase pairs

– Log likelihood ratio (LLR) score

Johns et al. Generating Query Substitutions.

Page 77: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-3: Query substitution (cont.)–Generate substitutions

• Query-level substitution for frequent queries, while phrase-level substitution for infrequent ones

–Rank candidate substitutions• Machine learning methods

– Linear regression

– Binary classification

• Features:

– Query length, #segments, %Alphabetic characters, edit distance, #segments substituted, #tokens shared, size of prefix overlapping, LLR, frequency, mutual information,…

Johns et al. Generating Query Substitutions.

Page 78: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-4: Optimizing page rank– Joachims, SIGKDD-2002

Joachims. Optimizing Search Engines using Clickthrough Data.

link3 <r* link2

link7 <r* link2

link7 <r* link4

link7 <r* link5

link7 <r* link61,3,7 are clicked

Page 79: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-4: Optimizing page rank (cont.)–Defined as a ranking problem:

• Given:

» n: size of training data; qi: query; ri*: target ranking

• Maximize:

Joachims. Optimizing Search Engines using Clickthrough Data.

* * *1 1 2 2(q , r ), (q , r ),..., (q , r )n n

*f (q )

1

1(f ) (r , r )i

n

S iin

τ τ=

= ∑

Kendall’s : τ (r , r )a bP QP Q

τ −=

+#concordant pairs

#discordant pairs

Page 80: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-4: Optimizing page rank (cont.)–Ranking SVM

Joachims. Optimizing Search Engines using Clickthrough Data.

Using partial

feedback

Page 81: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-5: Ontology construction–Sekine and Suzuki, WWW-2007

–Basic idea: • Entities belonging to the same class should appear in

similar contexts

–Data: • Query logs (only queries are used)

–Main steps:• Step-1: Extract typical contexts for a given class

• Step-2: Find new entities belonging to the class

Sekine and Suzuki. Acquiring Ontological Knowledge from Query Logs.

Page 82: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-5: Ontology construction (cont.)

Sekine and Suzuki. Acquiring Ontological Knowledge from Query Logs.

Examples of class “Award”

Typical context words for “Award”

New entities for the class “Award”

Page 83: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Example-6: Paraphrase acquisition–Zhao et al., 2010

–Corpus• Query logs (queries and titles) of a search engine

–Assumption• If a query q hits a title t, then q and t are likely to be

paraphrases

• If queries q1 and q2 hit the same title t, then q1 and q2 are likely to be paraphrases

• If a query q hits titles t1 and t2, then t1 and t2 are likely to be paraphrases

Zhao et al. Paraphrasing with Search Engine Query Logs.

Page 84: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

关于 草原 的 诗词

描写 草原 的 诗句

有关 草原 的 诗歌

……

……

q1

t1

t2

赞美 大 草原 的 诗q2

Paraphrases:

<q1, t1>

<q1, t2>

<q2, t1>

<q1,q2>

<t1,t2>

query-title

query-query

title-title

Example:

Zhao et al. Paraphrasing with Search Engine Query Logs.

Page 85: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Query Logs with User Behaviors (cont.)

• Step-1: extracting <q, t> paraphrases– Extracting candidate <q, t> pairs from query logs

– Paraphrase validation based on binary classification

• Combining multiple features

• Step-2: extracting <q, q> paraphrases– Extracting candidate <q, q> from <q, t> paraphrases

– Paraphrase validation based on binary classification

• Step-3: extracting <t, t> paraphrases – Extracting candidate <t, t> from <q, t> paraphrases

– Paraphrase validation based on binary classification

Zhao et al. Paraphrasing with Search Engine Query Logs.

Page 86: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Data & Resources

• Treasures from WWW

Large-scale Web corpora

Query logs with user behaviors

User generated content (UGC)

Page 87: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

User Generated Content (UGC)

UGC

cQA

Blogs / microblogs

Forums

Online encyclopedia

Forums

Page 88: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Forums

• Example: Mining QA pairs from forums–Cong et al., SIGIR-2008

–Find question-answer pairs from forum thread• Motivation:

– Initiating post usually contains questions, while reply posts may contain answers

• Two main stages

– Question detection

– Answer detection

Cong et al. Finding Question-Answer Pairs from Online Forums.

Page 89: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Forums (cont.)

• Example: Mining QA pairs from forums (cont.)–Question detection

• Non-trivial problem

– Cannot simply rely on question marks or question words

• Classification-based method

– Feature: Labeled Sequential Patterns (LSPs) extracted questions and non-questions

» E.g., <what, do, PRP, VB>→Q

Cong et al. Finding Question-Answer Pairs from Online Forums.

Page 90: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Forums (cont.)

• Example: Mining QA pairs from forums (cont.)–Answer detection

• Problems:

– Multiple questions and answers interweaved together

– 1-question vs. n-answer; n-question vs. 1-answer

• Graph-based method, which considers:

– Rank of candidate answers for a given query

– Relationship of candidate answers

– Forum-specific feature: distance of a candidate answer from the question

Cong et al. Finding Question-Answer Pairs from Online Forums.

Page 91: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

User Generated Content (UGC)

UGC

cQA

Blogs / microblogs

Forums

Online encyclopedia

Page 92: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Community-based QA (cQA)

• Research on cQA–Search and Recommendation

• cQA retrieval model

• Question similarity computation

• Multi-sentence question segmentation

• ……

–Quality estimation• Question quality

• Answer quality

• User (questioner / answerer) quality

Page 93: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Community-based QA (cQA) (cont.)

• Example-1: CQA retrieval model–Xue et al., SIGIR-2008

–Basic idea: • In cQA retrieval, both question parts and answer parts

should be modeled

–Propose a mixed model:• A translation-based language model for the question

part

• A query likelihood approach for the answer part

Xue et al. Retrieval Models for Question and Answer Archives.

Page 94: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Community-based QA (cQA) (cont.)

• Example-1: CQA retrieval model (cont.)–Translation-based language model for the

question part

Xue et al. Retrieval Models for Question and Answer Archives.

( | ( , )) ( | ( , ))

| ( , ) |( | ( , )) ( | ( , )) ( | )| ( , ) | | ( , ) |

( | ( , )) (1 ) ( | ) ( | ) ( | )

w

mx ml

mx ml mlt q

P q a P w q a

q aP w q a P w q a P w Cq a q a

P w q a P w q P w t P t q

λλ λ

β β

=

= ++ +

= − +

q

qSmoothing

Language model

Translation model trained with Q-A pairs in cQA archives

Page 95: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Community-based QA (cQA) (cont.)

• Example-1: CQA retrieval model (cont.)– Incorporating the answer part

Xue et al. Retrieval Models for Question and Answer Archives.

( | ( , )) (1 ) ( | ) ( | ) ( | )mx ml mlt q

P w q a P w q P w t P t qβ β∈

= − + ∑

( | ( , )) ( | ) ( | ) ( | ) ( | )mx ml ml mlt q

P w q a P w q P w t P t q P w aα β γ∈

= + +∑

Query likelihood model for the answer part

Page 96: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Community-based QA (cQA) (cont.)

• Example-2: Answer quality prediction– Jeon et al., SIGIR-2006

–Background:• There are plenty of bad answers in cQA archives

– Some users answer nonsense

– Some answers contain irrelevant advertisements

– Examples:

Jeon et al. A Framework to Predict the Quality of Answers with Non-Textual Features.

Q: What is the minimum positive real number in Matlab?

A: Your IQ

Q: What is new in Java2.0?

A: Nothing new

Q: Can I get a router if I have a usb dsl modem?

A: Good question but I do not know

Page 97: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Community-based QA (cQA) (cont.)

• Example-2: Answer quality prediction (cont.)–Measure answer quality with non-textual

features

Jeon et al. A Framework to Predict the Quality of Answers with Non-Textual Features.

Answerer’s Acceptance ratio Answer Length

Questioner’s Self Evaluation Answerer’s Activity Level

Answerer’s Category Specialty Print Count

Copy Count Users’ Recommendation

Editor’s Recommendation Sponsor’s Answer

Click Counts Number of Answers

Users’ Dis-Recommendation

Page 98: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Community-based QA (cQA) (cont.)

• Example-2: Answer quality prediction (cont.)–Maximum entropy for answer quality estimation

– Integrated into the cQA retrieval model

Jeon et al. A Framework to Predict the Quality of Answers with Non-Textual Features.

13

1

1( | ) exp ( , )( ) i i

ip y x f x y

Z xλ

=

⎡ ⎤= ⎢ ⎥⎣ ⎦∑

( | ) ( ) ( | )w Q

P D Q P D P w D∈

= ∏

( ) ( | )P D p y x D= =

Prior probability of D

Page 99: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

User Generated Content (UGC)

UGC

cQA

Blogs / microblogs

Forums

Online encyclopedia

Page 100: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Online Encyclopedia

• How to make use of online encyclopedia?–Clean and semi-structured data

• Applications: information extraction, relation extraction, summarization…

–Links among entries• Applications: WSD, lexical reference rules extraction,

NE recognition…

–Revision history• Applications: Sentence compression, sentence

simplification…

Page 101: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Online Encyclopedia (cont.)

• Example-1: Information extraction–Wu and Weld, ACL-2010

–Main idea:• Generate relation-specific training examples by

matching Infobox attribute values to corresponding sentences

• Main modules:

– Preprocessor

– Matcher

– Learner

Wu and Weld. Open Information Extraction using Wikipedia.

Page 102: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Online Encyclopedia (cont.)

• Example-1: Information extraction (cont.)–An example

Wu and Weld. Open Information Extraction using Wikipedia.

Page 103: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Online Encyclopedia (cont.)

• Example-1: Information extraction (cont.)– Preprocessor

• Sentence splitting

• NLP annotation

• Compiling synonyms

– Using wikipedia redirection pages and backward links

–Matcher• Match target entity

– Full match, synonym match, partial match, type match, pronoun match…

• Match sentences

– Seek a unique sentence to match the attribute value

Wu and Weld. Open Information Extraction using Wikipedia.

Page 104: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Online Encyclopedia (cont.)

• Example-1: Information extraction (cont.)– Learn two kinds of extractors

• Extractor-1:

– Using dependency parse tree features

– Higher accuracy but slower

• Extractor-2:

– Using only shallow features, such as POS tags

– Lower accuracy but faster

Wu and Weld. Open Information Extraction using Wikipedia.

Page 105: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Online Encyclopedia (cont.)

• Example-2: Multilingual NER–Richman and Schone, ACL-2008

–Main idea:• English NE categorization

– Use category links

• Foreign NE categorization

– Find counterpart in English wikipedia pages

– Use category links for English NE categorization

– Foreign NE and its English counterpart share identical NE type

Richman and Schone. Mining Wiki Resources for Multilingual Named Entity Recognition.

Page 106: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Online Encyclopedia (cont.)

Richman and Schone. Mining Wiki Resources for Multilingual Named Entity Recognition.

Jacqueline Bhabha

Categories: British lawyersJewish American writersIndian Jews

Lawyers by nationality

American writers by ethnic or national origin

Jewish writers

British legal professionals

Indian people by religion

Indian people by ethnic or national originKey phrase for PER

Page 107: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

为了帮助保护您的隐私,PowerPoint 禁止自动下载此外部图片。若要下载并显示此图片,请单击消息栏中的 “选项”,然后单击“启用外部内容”。

Online Encyclopedia (cont.)

Richman and Schone. Mining Wiki Resources for Multilingual Named Entity Recognition.

Catégories : Commune des Côtes-d'ArmorVille portuaire de FrancePort de plaisanceStation balnéaire française

French English

Category:Communes of Côtes-d'Armor

Category:Port cities and towns in France

Category:Marinas

Category:Seaside resorts in France

Erquy

Category:Towns in Brittany

Category:Cities in France

Category:Coastal construction

Category:Seaside resorts

Easy to be identified as

GPE

Page 108: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Online Encyclopedia (cont.)

• Example-3: Sentence compression–Yamangil and Nelken, ACL-2008

–Key problem for sentence compression• Data sparseness

• Ziff-Davis corpus: 1067 sentence pairs

–Main idea of this work• Abundant sentence compressions can be extracted from

Wikipedia’s revision history, and used as training data

• >380,000 sentence pairs are extracted

Yamangil and Nelken. Mining Wikipedia Revision Histories for Improving Sentence Compression.

Page 109: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Online Encyclopedia (cont.)

• Example-3: Sentence compression (cont.)–Assumption:

• All edits retain the core meaning of the sentence

–Focus only on sentence-level edits that add or drop words

–Train a lexicalized channel model for sentence compression

Yamangil and Nelken. Mining Wikipedia Revision Histories for Improving Sentence Compression.

Page 110: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

User Generated Content (UGC)

UGC

cQA

Blogs / microblogs

Forums

Online encyclopedia

Page 111: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Blogs / Microblogs

• Research on blogs / microblogs–Adaptation of conventional NLP techniques

• Tokenization, POS tagging, spelling error correction,…

–User profile learning

–Search and recommendation

–Sentiment analysis

–Monitoring emergency event

–……

Page 112: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Blogs / Microblogs (cont.)

• Example-1: POS tagging for twitter–Gimpel et al., ACL-2011

–Background:• Conventional NLP tools are typically trained on news

texts, which perform poorly on Twitter

–This work:• Produce an English POS tagger that is designed

especially for Twitter data

Gimpel et al. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments.

Page 113: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Blogs / Microblogs (cont.)

• Example-1: POS tagging for twitter (cont.)–Tagset designed for Twitter

Gimpel et al. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments.

Page 114: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Blogs / Microblogs (cont.)

• Example-1: POS tagging for twitter (cont.)–Twitter-specific features

• TWORTH: Twitter orthography

– Regular expression rules to detect #, @, and U

• NAMES: Frequently-capitalized tokens

– If a token is frequently capitalized

• TAGDICT: Traditional tag dictionary

– Token’s POS tags from conventional corpora

• DISTSIM: Distributional similarity

– Similar words of the token

• METAPH: Phonetic normalization

– E.g., {thanks thangs thanksss …} -> 0NKS

Gimpel et al. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments.

Page 115: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Blogs / Microblogs (cont.)

• Example-2: Sentiment classification– Jiang et al, ACL-2011

–Sentiment classification for Twitter should be target-dependent• E.g.,

Jiang et al. Target-dependent Twitter Sentiment Classification.

People everywhere love Windows & Vista. Bill Gates

Windows 7 is much better than Vista

Page 116: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Blogs / Microblogs (cont.)

• Example-2: Sentiment classification (cont.)– Sentiment classification for Twitter should be

context-aware• E.g.,

Jiang et al. Target-dependent Twitter Sentiment Classification.

[tweet_n] First game: Lakers!

[tweet_n-1] I love Lakers, I love Kobe!!

[Retweet] Lakers won the game, Great!

[Reply] I love Lakers too.

Positive

Positive

Positive

Positive

Page 117: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Blogs / Microblogs (cont.)

• Example-2: Sentiment classification (cont.)–Method of this work

• Target-dependent features

– Use features based on syntactic parse trees

» E.g., verb+target(obj); target(sub)+verb; adjective+target…

– Binary SVM classifier

• Graph-based sentiment optimization

– Three kinds of related tweets:

» Retweets; reply; tweets containing the target from the same person

– Construct a graph with related tweets and classify tweets on the graph

Jiang et al. Target-dependent Twitter Sentiment Classification.

Page 118: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Blogs / Microblogs (cont.)

• Example-3: User grouping –Qu and Liu, ACL-2011

–Motivation:• People in a twitter user’s following list need to be

grouped if the length of the list is large

–Main idea:• Provide some seeding friends for each target class, and

automatically group other friends similar / related to the seeds

Qu and Liu. Interactive Group Suggesting for Twitter.

Page 119: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Blogs / Microblogs (cont.)

• Example-3: User grouping (cont.)–Two sub-systems for the task

• Sub-system 1:

– Content based sub-system

» Compute similarity between each friend and the seeding friends based on their tweet contents

• Sub-system 2:

– Friend based sub-system

» Use the count of bi-directional friend relationships and mentions between each friend and seeding friends as the score for ranking

Qu and Liu. Interactive Group Suggesting for Twitter.

Page 120: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Outline

• Part I–Background

• Part II–New Applications & Trends

• Part III–New Data & Resources

• Part IV–New Difficulties & Challenges

• Part V–New Methodologies & Solutions

Page 121: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Difficulties & Challenges

• Storage and computing capability

• Rapid evolution of languages

• Noise and errors in the web corpora

• Information credibility

Page 122: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Storage and Computing Capability

• Large-scale data sources– Web page corpora

– Web search query logs

– Language model

– ……

• Two questions:– How to store the data and easily access to them?

– How to efficiently process the tremendous data?

• Solutions:– Pruning and filtering

– Efficient algorithm

– Distributed computing

Page 123: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Storage and Computing Capability (cont.)

• Example-1: machine translation–Scale of data:

• Translation model trained with 100 million sentence pairs

– Size of model: ≈20G

• Language model (5-gram) trained with 100 million sentences

– Size of model: ≈5G

• Models used by online translation systems are usually larger than those above!

Page 124: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Storage and Computing Capability (cont.)

• Example-1: machine translation–Pruning phrase table (PT) [Johnson et al., 2007]

• Observation:

– Many phrase pairs in PT are wrong or will never be used

• Disadvantage of large PT

– Require more resources and time to process

– Requires more features and more sophisticated search

• Work in this paper:

– Prune PT based on significance testing

Johnson et al. Improving Translation Quality by Discarding Most of the Phrasetable.

Page 125: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Storage and Computing Capability (cont.)

• Example-1: machine translation–Pruning phrase table (PT) [Johnson et al., 2007]

• Fisher’s exact test:

• p-value:

( ) ( )( , ) ( ) ( , )

( ( , ))

( )

h

C s N C sC s t C t C s t

p C s tNC t

−⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟−⎝ ⎠⎝ ⎠=

⎛ ⎞⎜ ⎟⎝ ⎠

% %

% % %% %%%

%

Johnson et al. Improving Translation Quality by Discarding Most of the Phrasetable.

( , )( ( , )) ( )h

k C s tp value C s t p k

=

− = ∑%%

%%

Page 126: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Storage and Computing Capability (cont.)

• Example-2: machine translation–Parallel decoding and distributed language model

• Li et al., 2009

• Parallel decoding

– Exploit multi-core and multi-processor architectures

– Translate multiple sentences in separate threads

– Store the language model and translation grammar in shared memory

• Distributed language model

– Reduce memory pressure

– Use larger language model

Li et al. Decoding in Joshua: Open Source Parsing-based Machine Translation.

Page 127: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Storage and Computing Capability (cont.)

• Example-3: paraphrase acquisition–Basic idea:

• Hypothesis: Paraphrases should appear in similar contexts (distributional similarity)

• Extract paraphrase phrases whose contextual vectors are similar with each other

– E.g., X acquired Y and X completed the acquisition of Y

• Data scale:

– 150GB monolingual corpus

– Check any pair of phrases if they are paraphrases

Bhagat and Ravichandran. Large Scale Acquisition of Paraphrases for Learning Surface Patterns.

Page 128: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Storage and Computing Capability (cont.)

• Example-3: paraphrase acquisition (cont.)–Apply Locality Sensitive Hashing (LSH) to speed

up computation [Bhagat and Ravichandran, 2008]• Represents a d dimensional vector by a stream of b bits

and has the property of preserving the cosine similarity between vectors

• Reduce time complexity from to

( )b d

Bhagat and Ravichandran. Large Scale Acquisition of Paraphrases for Learning Surface Patterns.

2( )O n d ( )O nd

n: #phrases

Page 129: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Difficulties & Challenges

• Storage and computing capability

• Rapid evolution of languages

• Noise and errors in the web corpora

• Information credibility

Page 130: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Rapid evolution of languages

• Languages change rapidly due to WWW:–New words

• E.g., 给力, 雷人

–New senses• E.g., 粉丝, 玉米

–New named entities• E.g., 旭日阳刚, 筷子兄弟

–Chat language• E.g., 818, 有木有

Page 131: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Rapid evolution of languages (cont.)

• Challenges posed to NLP:–Word segmentation

• Out-Of-Vocabulary problem

–Named entity recognition• More categories should be covered

–Word sense disambiguation• New senses of words need to be recognized

–Chat language normalization• Normalize chat language to natural language

Page 132: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Rapid evolution of languages (cont.)

• Example-1: Entity extraction–Pennacchiotti and Pantel, EMNLP-2009

–Two extractors:• Pattern-based extractor

– Find instances of a given relation with seed instances

– E.g., act-in (Actor, Movie)

• Distributional extractor

– Construct a context vector for each noun

– Sample seed entities of a given entity class C

– Compute context vector similarity with the seeds

– Similar nouns are returned and classified into C

–Take the union of the entities extracted above

Pennacchiotti and Pantel. Entity Extraction via Ensemble Semantics.

Page 133: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Rapid evolution of languages (cont.)

• Example-1: Entity extraction–Pennacchiotti and Pantel, EMNLP-2009

–ML-based ranker• Regression model

– Features:• Feature classes:

– Frequency, Co-occurrence, Distributional, Pattern, Termness

• Features are extracted from:

– Web corpus, query logs, web tables, wikipedia

Pennacchiotti and Pantel. Entity Extraction via Ensemble Semantics.

Page 134: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Rapid evolution of languages (cont.)

• Example-2: Entity extraction and clustering– Jain and Pennacchiotti, COLING-2010

–Extract entities from web search query logs• Heuristic:

– Query term sequence with uppercase characters

• Filtering:

– Web-based representation score

» Checks if the case-sensitive representation of the candidate is the most likely representation

– Query-log-based standalone score

» Counts the occurrences of the standalone forms of the candidate in the query logs

Jain and Pennacchiotti. Open Entity Extraction from Web Search Query Logs.

Page 135: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Rapid evolution of languages (cont.)

• Example-2: Entity extraction and clustering– Jain and Pennacchiotti, COLING-2010

–Entity clustering• Features:

– Context feature space

» Hypothesis: similar queries should appear in similar contexts in the query logs

– Clickthrough feature space

» Hypothesis: similar queries should generate clicks on similar urls

– Hybrid feature space

» Normalized union of the two feature spaces above

Jain and Pennacchiotti. Open Entity Extraction from Web Search Query Logs.

Page 136: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Rapid evolution of languages (cont.)

• Example-3: Chinese chat text normalization–Xia et al., ACL-2006

–Problem:• Normalize Chinese chat text to natural language

– E.g., 介里->这里; 偶->我

–Characteristics:• Anomalous

– Anomalous words or anomalous usage

• Dynamic

– Change fast year by year

Xia et al. A Phonetic-Based Approach to Chinese Chat Text Normalization.

Page 137: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Rapid evolution of languages (cont.)

• Example-3: Chinese chat text normalization–Baseline model:

• Source-channel model

• Disadvantages:

– Data sparseness

– Training effectiveness is poor due to the dynamic nature

–Observation:• Most Chinese chat terms are created via phonetic

transcription

Xia et al. A Phonetic-Based Approach to Chinese Chat Text Normalization.

( | ) ( )ˆ arg max ( | ) arg max( )C C

p T C p CC p C Tp T

= =

Page 138: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Rapid evolution of languages (cont.)

• Example-3: Chinese chat text normalization–Extended source channel model:

• Inserting phonetic mapping model

• Chat term normalization observation model:

• Phonetic mapping model:

Xia et al. A Phonetic-Based Approach to Chinese Chat Text Normalization.

, ,( | , ) ( | ) ( )ˆ arg max ( | , ) arg max

( )C M C Mp T M C p M C p CC p C M T

p T= =

( | , ) ( | , )i i iip T M C p t m c=∏

( | ) ( | )i iip M C p m c=∏

Phonetic mapping probability

Page 139: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Rapid evolution of languages (cont.)

• Example-3: Chinese chat text normalization–Phonetic mapping model:

– : frequency of character c in a standard language corpus

– : phonetic similarity between two characters

»

Xia et al. A Phonetic-Based Approach to Chinese Chat Text Normalization.

( ) ( , )Pr ( , )( ( ) ( , ))

slcpm

slc i ii

fr c ps t cob t cfr c ps t c

×=

×∑( )slcfr c

( , )ps t c

( , ) ( ( ), ( ))( ( ( )), ( ( )))( ( ( )), ( ( )))

ps t c sim py t py csim initial py t initial py csim final py t final py c

==×

Page 140: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Difficulties & Challenges

• Storage and computing capability

• Rapid evolution of languages

• Noise and errors in the web corpora

• Information credibility

Page 141: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Noise and Errors in the Web Corpora

• Query logs–Queries are not well-formed sentences

– 10-15% of queries contain misspelled terms [Cucerzan and Brill, 2004]

• UGC data–Noise and errors are common in Forums,

community-based QA, blogs, microblogs

• Present challenges to NLP researches–E.g., word segmentation, POS tagging, parsing

Cucerzan and Brill. Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users.

Page 142: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Noise and Errors in the Web Corpora (cont.)

• Spelling correction –Non-word spelling correction

• Word not found in pre-compiled lexicon

• Words similar to the misspelled word are candidate spelling corrections

• Statistical error models are more effective

–Real-word spelling correction• Incorrect usage of a valid word in a given context

– A ___ of cake (peace / piece)

• Generate candidate corrections with a pre-defined confusion set

• Rank candidates according to contextual information

Page 143: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Noise and Errors in the Web Corpora (cont.)

• Query spelling correction–Sun et al., ACL-2010

–Collect training data from clickthrough data of query logs• Collect cases in which a user submits a query and clicks

on the spelling suggestion

– “did you mean” function

– 3 million query-correction pairs are collected in this way

–Using the collected data to train a spelling correction system

Sun et al. Learning Phrase-Based Spelling Error Models from Clickthrough Data.

Page 144: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Noise and Errors in the Web Corpora (cont.)

• Query spelling correction–Sun et al., ACL-2010

–Model:• Ranking model

– Two-layer neural net with 5 hidden nodes

• 96 ranking features

– Language model

– Error model

» Edit distance model, phonetic model

– Phrase-based error model

– ……

Sun et al. Learning Phrase-Based Spelling Error Models from Clickthrough Data.

Page 145: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Noise and Errors in the Web Corpora (cont.)

• Query spelling correction–Sun et al., ACL-2010

–Phrase-based error model• Similar to SMT model

• “translate” a correct query C into a misspelled query Q

• Given:– Segment C into K phrases: c1,…,cK

– T: K replacement phrases: q1,…,qK

• Model:

( , , )( , , *)

( | ) max ( | , )S T M

B C Q A

P Q C P T C S∈

Sun et al. Learning Phrase-Based Spelling Error Models from Clickthrough Data.

1( | , ) ( | )K

k kkP T C S P q c

==∏

alignment

Page 146: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Noise and Errors in the Web Corpora (cont.)

• Query spelling correction–Sun et al., ACL-2010

–Features:• Phrase transformation feature

• Lexical weight feature

'

( , )( | )( , ')

q

N c qP q cN c q

=∑

Sun et al. Learning Phrase-Based Spelling Error Models from Clickthrough Data.

| |

( , )1

1( | , ) ( | )|{ | ( , ) } |

q

w i ji j Ai

P q c A t q cj j i A ∀ ∈=

=∈ ∑∏

Page 147: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Difficulties & Challenges

• Storage and computing capability

• Rapid evolution of languages

• Noise and errors in the web corpora

• Information credibility

Page 148: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Information Credibility

• Credibility problem– It becomes easy to produce and disseminate

information in the web 2.0 age• Personal web sites

• Forums

• Blogs / microblogs

• ……

– It becomes difficult to control the authority and credibility of the information (sources)• Deceptive, erroneous, subjective, outdated information

Page 149: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Information Credibility (cont.)

• Aspects of information credibility [Metzger, 2007]–Accuracy

–Authority• trustworthiness, expertise

–Objectivity

–Currency• Up-to-date

–Coverage• Sufficient depth and breadth

Metzger. Making Sense of Credibility on the Web: Models for Evaluating Online Information and Recommendations for Future Research.

Page 150: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Information Credibility (cont.)

• Criteria for information credibility– Information contents

• Detect logical consistency and contradiction

– Information sender• Quality and quantity info. the sender has produced

–Document style and superficial characteristics• Sentential style, page layout, links in page,…

–Social evaluation• Mining users’ opinions and comments

Information Credibility Criteria Project: http://kc.nict.go.jp/project1/icc-project-description.html

Page 151: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Information Credibility (cont.)

• Example-1: Search engine–Credibility of returned search results:

• Relatedness

• Web site authority

• Inward links

• User clicks

• Anti-cheating

• Currency

• ……

Page 152: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Information Credibility (cont.)

• Example-1: Search engine (cont.)

Information from the credible sources are

higher ranked

Page 153: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Information Credibility (cont.)

• Example-2: News –Credibility of news:

• Credibility of the information sources

– E.g., Google News

• Example-3: E-commerce–Credibility of products

• Users’ rating

• Users’ comments

Page 154: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Information Credibility (cont.)

• Example-3: E-commerce (cont.)

Users’ rating

Users’ comments

Page 155: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Outline

• Part I–Background

• Part II–New Applications & Trends

• Part III–New Data & Resources

• Part IV–New Difficulties & Challenges

• Part V–New Methodologies & Solutions

Page 156: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Methodologies & Solutions

• Response to real-world demands

• Familiar with cutting-edge research

• Balance between data and algorithms

• Experimental platforms for real applications

Page 157: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Response to Real-world Demands

• Research rooted in real applications

• Research beyond engineering

• Feasibility & Expansibility–Easy and robust solutions must come first

• Effectiveness vs. Efficiency–Web-scale applications

• Leverage all available data and resources

Page 158: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Response to Real-world Demands (cont.)

• Example-1: Entity linking–Application of word sense disambiguation (WSD)

–Conventional WSD research:• Disambiguate given words in certain contexts

–Entity linking:• Link entities appearing in documents with their

referents in knowledge bases (e.g., Wikipedia)

• WSD is the key problem in entity linking

Page 159: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Response to Real-world Demands (cont.)

• Example-1: Entity linking (cont.)–An example from [Han and Sun, 2011]

Han and Sun. A Generative Entity-Mention Model for Linking Entities with Knowledge Base.

Page 160: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Response to Real-world Demands (cont.)

• Example-2: Paraphrase acquisition – Investigate multiple resources [Zhao et al., 2008]

• Thesaurus

• Monolingual parallel corpora

– Multiple translations of the same foreign novel

• Monolingual comparable corpora

– Comparable news articles reporting on the same event

• Bilingual parallel corpora

• Online dictionary definitions

• Query clusters

– Based on click-through information

Zhao et al. Combining Multiple Resources to Improve SMT-based Paraphrasing Model.

Page 161: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Response to Real-world Demands (cont.)

• Example-2: Paraphrase acquisition (cont.)–Combine multiple resources [Zhao et al., 2008]

• Train a paraphrase table with each resource and combine them with a SMT model, which is then used in paraphrase generation

• Accuracy of the generated paraphrases is improved when combining multiple resources

Zhao et al. Combining Multiple Resources to Improve SMT-based Paraphrasing Model.

Page 162: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Response to Real-world Demands (cont.)

• Example-3: Ensemble Semantics (ES)–Pennacchiotti and Pantel, EMNLP-2009

–Ensemble semantics: • A general framework for modeling information

extraction algorithms that combine multiple sources of information and multiple extractors

• Advantages:

– Multiple sources of knowledge

– Multiple extractors

– Multiple sources of features

Pennacchiotti and Pantel. Entity Extraction via Ensemble Semantics.

Page 163: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Response to Real-world Demands (cont.)

• Example-3: Ensemble Semantics (ES)–Pennacchiotti and Pantel, EMNLP-2009

Pennacchiotti and Pantel. Entity Extraction via Ensemble Semantics.

sources

Page 164: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Methodologies & Solutions

• Response to real-world demands

• Familiar with cutting-edge research

• Balance between data and algorithms

• Experimental platforms for real applications

Page 165: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Familiar with Cutting-edge Research

• Cutting-edge research–New problem

• E.g., sentiment analysis, entailment

–New solution• E.g., crowdsourcing

–New application• E.g., microblogs

• In-depth analysis before entering a new field

Page 166: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Familiar with Cutting-edge Research (cont.)

• Example: Crowdsourcing–Definition:

• The act of outsourcing tasks, traditionally performed by an employee or contractor, to an undefined, large group of people or community (a "crowd"), through an open call ----Wikipedia

• …it gathers those who are most fit to perform tasks, solve complex problems and contribute with the most relevant and fresh ideas ---- Jeff Howe

–Amazon’s Mechanical Turk:• A crowdsourcing Internet marketplace that enables

computer programmers (known as Requesters) to co-ordinate the use of human intelligence to perform tasks that computers are unable to do yet ---- Wikipedia

Page 167: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Familiar with Cutting-edge Research (cont.)

• Crowdsourcing (cont.)–Crowdsourcing for NLP

• Speech recognition [Novotney and Callison-Burch, ’10]

• Machine translation [Zaidan and Callison-Burch, ’11]

• Paraphrase generation [Madnani, ’10]

• Anaphora resolution [Chamberlain et al., ’09]

• Word sense disambiguation [Akkaya, et al., ’10]

• Lexicon construction [Irvine and Klementiev, ’10]

• Named entity recognition [Finin etal., ’10]

• Grammatical error detection [Madnani et al., ’11]

• ……

Page 168: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Familiar with Cutting-edge Research (cont.)

• Crowdsourcing (cont.)–Advantage:

• Large and low-cost labor force

• Short turnaround time

• Access to foreign markets with native speakers

–Key problem: quality control• Knowledge producers are non-professional

– Post-processing for getting high-quality knowledge

–Machine learning techniques for automatically selecting high-quality knowledge

Page 169: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Familiar with Cutting-edge Research (cont.)

• Crowdsourcing (cont.)–Example: Collect Urdu-English translations

• Zaidan and Callison-Burch (ACL-2011)

Source stn.Collect translations

Post-edit and rank

Quality control

Best translation

Zaidan and Callison-Burch. Crowdsourcing Translation: Professional Quality from Non-Professionals.

Page 170: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Familiar with Cutting-edge Research (cont.)

• Crowdsourcing (cont.)–Example: Collect Urdu-English translations

• Collect English translations for Urdu sentences

– Turkers are mainly from India and Pakistan

– Input sentences are converted into images to avoid cheating by using an automatic MT system

– Collect multiple translations for each source sentence

• Post-editing and ranking

– Turkers should be native English speakers

– Post-edit: Edit the translations to make them more fluent

– Rank: rank all translations

Zaidan and Callison-Burch. Crowdsourcing Translation: Professional Quality from Non-Professionals.

Page 171: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Familiar with Cutting-edge Research (cont.)

• Crowdsourcing (cont.)–Example: Collect Urdu-English translations

• Automatically select the best translation from the candidate translations

• Features:

– Sentence-level features

» Language model features, sentence length features, web n-gram match percentage, web n-gram geometric average, edit rate to other translations

– Worker-level features

» Aggregate features, language ability, worker location

– Ranking features

» Average rank, is-best percentage, is-better percentage

– Worker calibration feature

Zaidan and Callison-Burch. Crowdsourcing Translation: Professional Quality from Non-Professionals.

Page 172: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Familiar with Cutting-edge Research (cont.)

• Crowdsourcing (cont.)–Example: Collect Urdu-English translations

• Evaluation:

– Compute BLEU against professional translations

Zaidan and Callison-Burch. Crowdsourcing Translation: Professional Quality from Non-Professionals.

Page 173: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Methodologies & Solutions

• Response to real-world demands

• Familiar with cutting-edge research

• Balance between data and algorithms

• Experimental platforms for real applications

Page 174: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Balance between Data and Algorithms

• Traditional research–Limited data

– Sophisticated algorithms• Have to mine “lean ore” for knowledge

• NLP for web applications–Large-scale data

• Information redundancy is critical for statistical methods

–Relatively simple algorithms• Easily acquire enough knowledge with lightweight

methods

• Efficiently process large corpora

Page 175: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Balance between Data and Algorithms (cont.)

• Example: Open Information Extraction (OIE)–Banko et al., IJCAI-2007

–Traditional IE systems:• Use small and homogeneous corpora

– E.g., news corpora

• Rely on heavy linguistic technologies

– E.g., parsing, NER

• Relations of interest are specified beforehand

Banko et al. Open Information Extraction from the Web.

Difficult to scale to the massive and heterogeneous web corpora

Page 176: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Balance between Data and Algorithms (cont.)

• Example: Open Information Extraction (OIE)–Banko et al., IJCAI-2007

–The proposed method:• Module-1: self-supervised learner

– Train a classifier with a small corpus, which labels candidate extractions as “trustworthy” or not

• Module-2: single-pass extractor

– Extract candidate extractions from a large corpus, which are then filtered with the learnt classifier

• Module-3: redundancy-based assessor

– Assign a probability to each retained extraction based on a probabilistic model of redundancy in text

Banko et al. Open Information Extraction from the Web.

Page 177: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Balance between Data and Algorithms (cont.)

• Example: Open Information Extraction (OIE) –Banko et al., IJCAI-2007

–Module-1: self-supervised learner• Parse a small corpus with a syntax parser, and extract

noun phrase pairs with syntax paths

• Automatically label positive / negative examples based on constraints

– Path length, within sentence boundary, not solely pronoun

• Train a Naïve Bayes classifier with the training data

Banko et al. Open Information Extraction from the Web.

Page 178: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Balance between Data and Algorithms (cont.)

• Example: Open Information Extraction (OIE)–Banko et al., IJCAI-2007

–Module-2: single-pass extractor• Identify noun phrases with a lightweight NP chunker

• Relations are found by examining the text between noun phrases

– Non-essential phrases are filtered

• Presented to the classifier and tuples labeled as positive are extracted and stored

Banko et al. Open Information Extraction from the Web.

Page 179: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Balance between Data and Algorithms (cont.)

• Example: Open Information Extraction (OIE)–Banko et al., IJCAI-2007

–Module-3: redundancy-based assessor• Extraction is performed over the entire corpus

• Merge tuples where both entities and relations are the same and count their occurrences in sentences

• Assign a probability based on the occurrence count

Banko et al. Open Information Extraction from the Web.

Page 180: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Balance between Data and Algorithms (cont.)

• Example: Open Information Extraction (OIE)–Banko et al., IJCAI-2007

–Data scale:• Extract facts from a 9 million web page corpus

• 60.5 million tuples were extracted

–Error rate:

Banko et al. Open Information Extraction from the Web.

Average error rate

Correct extractions

TEXTRUNNER 12% 11,476

KNOWITALL 18% 11,631

Page 181: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

New Methodologies & Solutions

• Response to real-world demands

• Familiar with cutting-edge research

• Balance between data and algorithms

• Experimental platforms for real applications

Page 182: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Experimental Platforms for Real Applications

• Academic and industrial circles should work together to establish evaluation platforms that resemble real applications– Industrial circle:

• Collect real-world application requirements

• Release real application data

– E.g., query logs

Page 183: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

• Example-1: Yahoo! learning to rank challenge–A platform for learning-to-rank research

–Dataset:• Sampled from Yahoo! query logs

• Contain <query, url, features, relevance judgment>

– Queries, urls, and feature descriptions are not given, only feature values are

• Volume

Experimental Platforms for Real Applications (cont.)

Page 184: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

• Example-2: Microsoft learning to rank dataset–Similar to the Yahoo! Dataset

– Sampled from Microsoft Bing query logs• MSLR-WEB30K: more than 30,000 queries

• MSLR-WEB10K: 10,000 queries

• Form

– Query ID, url ID, feature descriptions and feature values are given

Experimental Platforms for Real Applications (cont.)

Page 185: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Experimental Platforms for Real Applications (cont.)

• Example-3: Query logs–AOL query logs

• About 20M queries from about 650k users

• Users are represented as IDs

• Both queries and clicked urls are given

– Sogou query logs• 1 month query logs

• Presented info.

– Time, user ID, clicked url (Uc), rank of Uc, sequence number of Uc

Page 186: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Experimental Platforms for Real Applications (cont.)

• Example-4: other released data–Data from Yahoo! Webscope:

• http://webscope.sandbox.yahoo.com/

–Data from Sogou Labs• http://www.sogou.com/labs/

–Google Web 1T 5-gram corpus

–……

Page 187: Natural Language Processing for Web Applicationsir.hit.edu.cn/~zhaosq/paper/NLPforWebApp_Haifeng-Shiqi.pdf- Feature extraction Machine Translation IME NLP Applications Web search Mobile

Thanks!QA