Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Natural Language Processing for Web Applications
Haifeng Wang, Shiqi Zhao
Outline
• Part I–Background
• Part II–New Applications & Trends
• Part III–New Data & Resources
• Part IV–New Difficulties & Challenges
• Part V–New Methodologies & Solutions
Status of the Internet
Search Engine UnionProvider
Customer
Internet Users
- Any Enterprise (Hundreds of thousands, finance, travel,
shipping)- advertisement
- webmaster( vertical website)
-Press
-Website(1.83M in 2011 in
China)
-Hidden web
Infrastructure- Domain name- Telecommunications, - Mobile- Broadband, - Hardware- Internet service provider
-485M+ users in chinaAbout 1 billion in the future
Trends of the Internet
Media Content Source
Internet
A B
Media-TimeTransition
DTime Constraint
C
• Professional Editors User Generated;
Static DynamicallyGenerated
• Simply Surfing Online gaming、video-watching、Social interaction – All
kinds of Application
• Text Image、Video、Audio,
etc.
• Users want real-time
information when
searching
• The Internet and users have changed a lot in the past decade
Traditional Search Engine
Box
Text matching based requirement
analysis
All results returned in
similar formact
Traditional web search
User
Entrance
Query Information
Presentation
Webpages
Crawler Webpages
Internet
New Generation of Search Engines
User
Entrance Box
Requirement Analysis
ResultsIntegration
Execution
Info Apps
Presentation
Query
•Apps Execution•Rich Media Analysis
Real-time data
Individual and group developers
InternetSNS/UGC/Hidden web
Open AppsPlatform
Open DatePlatform
TraditionalWeb Search
Rich MediaSearch
Apps Accurate Structured-results Webpages Rich Media
Queries to Baidu Search Engine
听起来欢乐的歌曲joyous song
现在几点了What time is it
电脑中毒了怎么办How to deal with computer virus
哪能买到漂亮衣服Where could I buy some beautiful clothes
北京哪能找到女朋友Where could I get a girlfriend in Beijing
令人心情愉快的图片Pleasant pictures
Challenge to NLP
RequirementAnalysis
KnowledgeMining
A B
ResultPresentation
DHuman-Computer
Interaction
C
• Hidden web, hiding knowledge
• Structured,semi- structured, unstructured
• Various levels
• Direct answer• Clustering• Summarization• Relation Graph• intelligent push• Rich media
• Complex Query• Diversiform
Requirement
• Suggestion• Extension• Interaction
NLP
NLP for Web Applications
Rule-based Method Statistical / ML MethodMethods
Dictionary Corpus Web data LogResources
Modules
Term
Granularity-Segmentation- Unknown word- Component
Property- Proper noun- Requirement- POS- Phonetic
Relation- Collocation- Similarity- Language model- Ontology
Phra
se Structure - Chunking-Term importance-Trunk parsing
Transformation- Synonym - Semantic norm.- Correction
Classification- Requirement
classification- Topic detection
Sent Syntax Semantic
Doc
Single Document- Topic analysis - Page value analysis
Multi-document- Classification & clustering- Feature extraction
Machine Translation IMENLP Applications Web search
Mobile search
Vertical search
cQA
Wikipedia
Advertising
Recommendation & personalization
Outline
• Part I–Background
• Part II–New Applications & Trends
• Part III–New Data & Resources
• Part IV–New Difficulties & Challenges
• Part V–New Methodologies & Solutions
New Applications & Trends
• NLP for web applications–Main applications
• Web search, online translation, recommendation system, e-business, social networks,…
–Evolution of traditional topics• Word segmentation, machine translation, question
answering,…
–New research topics
Main Applications (cont.)
• NLP for web search–Basic applications
• Segmentation
• Query term importance rating
• Query rewriting
• Query intent analysis
–Advanced applications• Question answering
• Summarization
• Word sense disambiguation
• Clustering
• Information extraction
• Ontology
Main Applications (cont.)
• Examples:–天龙八步 —> 天龙八部
–怎样能有归一证 —> 怎样能有皈依证
–宝马X6价钱 —> 宝马X6报价
–成都的哥罢工 —> 成都出租车罢工
–赞颂母爱的现代诗 —> 母爱的现代诗
–康柏笔记本vista系统一键恢复—>康柏vista一键恢复
Main Applications (cont.)
• NLP for online translation–Webpage
–Query
– communication
–Social network
–E-commerce
–Computer aided translation
–Computer aided learning
–Mobile
Main Applications (cont.)
• Recommendation system
–Recommendation among a single product
–Recommendation cross products
• Recommendation clue
–User profile
–User log
–Content
Main Applications (cont.)
• NLP for e-business– Intent / requirement recognition
• e-business website
– Information extraction• E.g., extract product information
– Information recommendation• Maintain user profiles and recommend products to users
• Advertising
– Sentiment analysis• Analyze comments of products
Main Applications (cont.)
• NLP for social networks–User profile
–Recommendation• Not only information, but also users
–Data mining • Identify emergency events
• Sentiment analysis
• Polls
Main Applications (cont.)
Main applications
Web search
Online translation
Recommendation
E-business
Social networks
Segmentation
Query term importance rating
Query rewriting
Query intent analysis
Question answering
Summarization
Word sense disambiguation
Information extraction
Ontology
Basic applications
Advanced applications
Clustering
Basic Applications for Web Search
• Word segmentation –Basic unit for languages like Chinese
–Segmentation for both queries and web documents
功夫熊猫在线观看
功夫熊猫 在线 观看
功夫 在线 观看熊猫
Basic Applications for Web Search (cont.)
• Query segmentation–Segment a query into chunks
–Scenario: document ranking• What terms must appear contiguously in the retrieved
documents?
功夫 在线 观看熊猫
功夫熊猫 在线观看
功夫熊猫在线观看
Basic Applications for Web Search (cont.)
• Query term importance rating–Estimate importance rating for each query term
–Scenario: document ranking• Which query terms MUST be matched in the retrieved
documents?
功夫熊猫在线观看
term rating功夫熊猫 3在线 2观看 1
Basic Applications for Web Search (cont.)
• Query rewriting–User queries are never perfect
• Contain errors
– E.g., 牛肉顿萝卜
• Too short
– E.g., 牛肉 萝卜
• Too verbose
– E.g., 哪位朋友可以告诉我牛肉炖萝卜该怎么做啊???
• Use less common expressions
– E.g., 牛肉烧萝卜
• ……
Basic Applications for Web Search (cont.)
• Query rewriting (cont.)– Involve the following NLP techniques:
• Error detection & correction
– Collecting query correction pairs
• Query expansion
– Learning expansion term from query logs or web docs
• Query reduction
– Query term importance rating
• Query paraphrasing & entailment
– Synonymous resource extraction
• ……
Basic Applications for Web Search (cont.)
• Query intent analysis
Advanced Applications for Web Search (cont.)
• Question answering
question
answer
Advanced Applications for Web Search (cont.)
• Summarization
Summaries rather than snippets
Advanced Applications for Web Search (cont.)
• Word sense disambiguation
Advanced Applications for Web Search (cont.)
• Search result clustering
Advanced Applications for Web Search (cont.)
• Information extraction
Advanced Applications for Web Search (cont.)
• Ontology
New Applications & Trends
• NLP for web applications–Main applications
• Web search, online translation, recommendation system, e-business, social networks,…
–Evolution of traditional topics• Word segmentation, machine translation, question
answering,…
–New research topics
Evolution of Traditional Topics
• Example-1: word segmentation –Demands from web applications (especially
search engines)• High efficiency
– To process tens of billions of web documents
• Frequently update
– To recognize new terms / concepts / named entities…
• Flexibility
– Different applications requires different segmentation outputs
Evolution of Traditional Topics (cont.)
• Example-1: word segmentation (cont.)– Solutions:
• Light weight model
– Efficient
• Learn new terms and NEs
– Mine web corpora and query logs
– Be easily added to the segmentation dictionary
• Various granularities
– Customize for different applications
Evolution of Traditional Topics (cont.)
• Example-2: machine translation–Machine translation methods
• Rule based systems
– E.g., Systran
• Statistical machine translation
– Google, Baidu …
–Difficulties• Data sparseness
• Too huge model
• Not fast enough
Evolution of Traditional Topics (cont.)
• Example-2: machine translation (cont.)–Online translation
• Web service instead of software
• Collect tens of millions of multilingual parallel / comparable sentence pairs from the web for model training
• Extract translations for idioms, named entities, and new terms from the web
• Both translation model and language model are extremely large
Evolution of Traditional Topics (cont.)
• Example-2: machine translation (cont.)–Challenges to online translation
• Quality control for the automatically collected data
• Model selection
• Model compression
• Distributed storage and computing
• Model update
• Fast decoding
• Domain adaptation
• ……
Evolution of Traditional Topics (cont.)
• Example-2: machine translation (cont.)
Reordering problem
Evolution of Traditional Topics (cont.)
• Example-3: question answering–Stage-1:
• Natural-language interface to expert systems
• Within specific domains
– Stage-2: web-based QA• Open domain
• Built upon large web corpora
• Mainly based on statistical methods
– Correct answers appear more times than incorrect ones
Evolution of Traditional Topics (cont.)
• Example-3: question answering (cont.)– Stage-2: web-based QA (cont.)
• Main modules
– Question classifier
– Search engine
– Answer extractor
• Deep analysis
– Morphological analysis, syntactic analysis, NER, WSD, coreference resolution, inference, reasoning, Ontology,…
• Usually work well on factoid questions
Evolution of Traditional Topics (cont.)
• Example-3: question answering (cont.)– Stage-3: community-based QA
• Users ask questions and wait for other users to answer them
• Answers have higher coverage than those from automatic QA system
• Previously asked questions and answers can be searched by other users
• Can work well on description and subjective questions
New Applications & Trends
• NLP for web applications–Main applications
• Web search, online translation, recommendation system, e-business, social networks,…
–Evolution of traditional topics• Word segmentation, machine translation, question
answering,…
–New research topics
New Research Topics
• Sentiment analysis
• Wikipedia-based research
• Microblog-based research
• Crowdsourcing
• ……
Outline
• Part I–Background
• Part II–New Applications & Trends
• Part III–New Data & Resources
• Part IV–New Difficulties & Challenges
• Part V–New Methodologies & Solutions
New Data & Resources
• Treasures from WWW
Large-scale Web corpora
Query logs with user behaviors
User generated content (UGC)
Large-scale Web Corpora
Large-scale Web Corpora (cont.)
• Large-scale web corpora for NLP–Statistics of distributions
• N-grams / Language model
• Collocations
• Co-occurrence
• ……
–Data mining• Information extraction / answer extraction for QA
• Paraphrase / entailment rules acquisition
• Bilingual data collection
• ……
Large-scale Web Corpora (cont.)
• Statistics of distributions–Example: N-grams / language model
• E.g., Google Web 1-T 5-grams
• Source data:
– 1 trillion word tokens from web pages
• Data size:
Number of tokens 1,024,908,267,229
Number of sentences 95,119,665,584
Number of unigrams 13,588,391
Number of bigrams 314,843,401
Number of trigrams 977,069,902
Number of fourgrams 1,313,818,354
Number of fivegrams 1,176,470,663
Large-scale Web Corpora (cont.)
• Statistics of distributions (cont.)–Applications of web n-grams
•Example: lexical substitution–Substitute words in a sentence with their synonyms
that fit in the given context
–Two stages:
»Extract candidate substitutes from thesauri
»Rank candidates according to their fitness in the given context
Large-scale Web Corpora (cont.)
• Statistics of distributions (cont.)– Lexical substitution
• Stage-2: candidate ranking [Giuliano et al., 2007]
Giuliano et al. FBK-irst: Lexical Substitution Task Exploiting Domain and Syntagmatic Coherence.
H_l w H_r
H_l e H_r
Left context Right context
Count the frequency of the generated fragment using
Google 5-grams
Large-scale Web Corpora (cont.)
• Data mining–Example-1: Learning surface patterns for QA
[Ravichandran and Hovy, ACL-2002]
Ravichandran and Hovy. Learning Surface Text Patterns for a Question Answering System.
Question taxonomy BIRTHDAY
1.00 <NAME> ( <ANSWER> - )0.85 <NAME> was born on <ANSWER>,0.60 <NAME> was born in <ANSWER>0.59 <NAME> was born <ANSWER>0.53 <ANSWER> <NAME> was born0.50 – <NAME> ( <ANSWER>0.36 <NAME> ( <ANSWER> -
Given seed (Mozart, 1756)
scores Paraphrasepatterns
Large-scale Web Corpora (cont.)
• Example-1: Learning surface patterns for QA (cont.)
–Main steps for learning patterns:• Seed selection
– E.g., Mozart 1756 for BIRTHDAY
• Submit the seed to a search engine
• Download the top-1000 web documents
• Retain sentences containing the Q and A terms
• Pass each sentence through a suffix tree constructor
• Retain phrases in the suffix tree that contain both the Q and A terms
• Replace word for Q term as <NAME> and word for A term as <ANSWER>
Ravichandran and Hovy. Learning Surface Text Patterns for a Question Answering System.
Large-scale Web Corpora (cont.)
• Example-1: Learning surface patterns for QA (cont.)
–Calculate precision of each pattern:• Query the search engine with the Q term
– E.g., Mozart
• Download top-1000 web documents
• Retain sentences that contain Q term
• For each learnt pattern, compute the percentage that the correct answer occurs in the <ANSWER> slot
• Return only the patterns matching a sufficient number of seeds (>5)
Ravichandran and Hovy. Learning Surface Text Patterns for a Question Answering System.
Large-scale Web Corpora (cont.)
• Example-2: Mining parallel data –Mining parallel data from bilingual websites [Shi
et al., ACL-2006]
•Basic idea: –One can identify parallel web pages from bilingual
websites, and further extract bilingual sentences from them.
– If two hyperlinks in two parallel web pages are aligned, then their corresponding web pages are also likely to be parallel
Shi et al. A DOM Tree Alignment Model for Mining Parallel Data from the Web.
Large-scale Web Corpora (cont.)
• Example of bilingual websites
Parallel web pages
Large-scale Web Corpora (cont.)
• Example-2: Mining parallel data (cont.) –Mining parallel data from bilingual websites
• Main steps:
– Identify bilingual websites using trigger words
» E.g., English, English Version, 中文, 中文版
– Identify bilingual web pages with a classifier
» Features: length ratio; HTML tag similarity; sentence alignment score
– DOM tree alignment for bilingual parallel web pages
– Sentence alignment within aligned text chunks
– Recursively mine parallel hyperlinks
Shi et al. A DOM Tree Alignment Model for Mining Parallel Data from the Web.
Large-scale Web Corpora (cont.)
• Example-2: Mining parallel data (cont.)– Mining parallel data from bilingual web pages
[Jiang et al., ACL-2009]• Disadvantage of the above method
– The number of bilingual websites is small, thus the volume of extracted bilingual data is limited.
• Basic idea:
– In many bilingual web pages bilingual data appear collectively and follow similar surface patterns.
Jiang et al. Mining Bilingual Data from the Web with Adaptively Learnt Patterns.
Large-scale Web Corpora (cont.)
• Examples of collective bilingual pages
Bilingual terms
Bilingual sentences
Large-scale Web Corpora (cont.)
• Example-2: Mining parallel data (cont.)– Mining parallel data from bilingual web pages
• Main steps:
Jiang et al. Mining Bilingual Data from the Web with Adaptively Learnt Patterns.
Large-scale Web Corpora (cont.)
• Example-2: Mining parallel data (cont.)– Mining parallel data from bilingual web pages
• Preprocessing:
– Parse web documents into DOM trees
– Segment text into snippets according to languages
• Seed mining:
– Judge any adjacent E/C snippet pair
– Compute the likelihood of being a translation pair
» Combines a translation model and a transliteration model
Jiang et al. Mining Bilingual Data from the Web with Adaptively Learnt Patterns.
Large-scale Web Corpora (cont.)
• Example-2: Mining parallel data (cont.)– Mining parallel data from bilingual web pages
• Pattern learning:
– Candidate pattern extraction
– Pattern selection with a SVM classifier
» Features: generality; average translation score; length; irregularity
• Pattern-based translation mining
Jiang et al. Mining Bilingual Data from the Web with Adaptively Learnt Patterns.
7. Don’t worry. 别担心。
[N][P][S][E][P][S][C][P]
Large-scale Web Corpora (cont.)
• Example-3: Improve parsing–Web-scale features for Parsing [Bansal and Klein,
ACL-2011]• Motivation:
– Web counts are powerful syntactic cues
• Basic idea:
– Generate web count features to address the full range of syntactic attachments
Bansal and Klein. Web-Scale Features for Full-Scale Parsing.
raising from is much more frequent on the web than
$ x billion from
Large-scale Web Corpora (cont.)
• Example-3: Improve parsing (cont.)–Web-scale features for Parsing
• Affinity features
– E.g., lexical co-occurrence counts from large corpora
• Paraphrase features
–Web corpus:• Google n-gram corpus
Bansal and Klein. Web-Scale Features for Full-Scale Parsing.
Large-scale Web Corpora (cont.)
• Example-3: Improve parsing (cont.)–Affinity features
• Given a head-argument pair (h,a), define the adjacent count feature:
– ADJ
» Count of query q=ha or q=ah in the web corpus
– ADJ∧POS(h)∧POS(a)
» Specific to each pair of POS tags
– ADJ∧POS(h)∧POS(a)∧b» b is the binned query count
– Other complex features
Bansal and Klein. Web-Scale Features for Full-Scale Parsing.
Large-scale Web Corpora (cont.)
• Example-3: Improve parsing (cont.)– Paraphrase features
• Form:– PARA ∧POS(h) ∧POS(a) ∧c ∧p ∧dir
• Example:– PARA ∧VBG ∧IN ∧it ∧MIDDLE ∧→
• Explanation:
– If frequent occurrences of raising it from indicated a correct attachment between raising and from, frequent occurrences of lowering it with will indicate the correctness of an attachment between lowering and with.
Bansal and Klein. Web-Scale Features for Full-Scale Parsing.
New Data & Resources
• Treasures from WWW
Large-scale Web corpora
Query logs with user behaviors
User generated content (UGC)
Query Logs with User Behaviors
• Search engine query logs:–Log data recording users’ queries along with
corresponding behaviors with the search engine
–What can be mined in query logs?• Query: keywords that users search with the search
engine
• Clicks: urls that users click after submitting queries
• Session: a sequence of queries and clicks from the same user within a short time interval
• Other information: search time, user ID, user browsing…
Query Logs with User Behaviors (cont.)
• Search engine query logs (cont.)–An example:
time id query click2011-08-21-09:02:32 12345 非诚勿扰 -2011-08-21-09:02:40 12345 非诚勿扰电影 url1, title: 非诚勿扰电影 百度视频
2011-08-21-09:04:51 12345 - url2, title: 非诚勿扰-电影-高清在线观看
2011-08-21-09:05:12 12345 非诚勿扰2 url3, title: 非诚勿扰2-高清在线观看
2011-08-21-09:05:18 12346 姚晨微波 -2011-08-21-09:05:39 12346 姚晨微博 url4, title: 姚晨的微博 新浪微博-随时…2011-08-21-09:05:43 12347 中国期刊网 url5, title: 中国知网首页
2011-08-21-09:05:48 12348 非诚勿扰2 在线 url3, title: 非诚勿扰2-高清在线观看
Query expansion
NE extraction
Spelling error correction
Query clustering
Query Logs with User Behaviors (cont.)
• Query logs for IR and NLP–Query clustering
–Query intent recognition
–Query rewriting• Expansion, reduction, synonymous reformulation, error
correction
–Query suggestion
–Learning-to-rank
–Named entity recognition
–Ontology construction
–……
Query Logs with User Behaviors (cont.)
• Example-1: Query clustering–Wen et al., 2002
–Measure query similarity from two aspects:• Content similarity
– Similarity of content words in two queries
• Click-through similarity
– Similarity of user clicked urls for two queries
• Combined similarity
–
Wen et al. Query Clustering Using User Logs.
_* *content cross refsimilarity similarity similarityα β= +
Query Logs with User Behaviors (cont.)
• Example-1: Query clustering (cont.)–Content similarity
• Based on keywords or phrases
–
–
• Based on edit distance
–
Wen et al. Query Clustering Using User Logs.
Word overlap rate
Cosine sim.
Query Logs with User Behaviors (cont.)
• Example-1: Query clustering (cont.)–Click-through similarity
• Through single document
–
• Through document hierarchy
–
–
Wen et al. Query Clustering Using User Logs.
Overlap rate of the clicked URLs
Lowest common parent node
Query Logs with User Behaviors (cont.)
• Example-2: Building NE-query intent taxonomy
–Yin and Shah, WWW-2010
Yin and Shah. Building Taxonomy of Web Search Intents for Name Entity Queries.
Mining intent phrases for NE-
queries and organizing them in a
taxonomy
Query Logs with User Behaviors (cont.)
• Example-2: Building NE-query intent taxonomy (cont.)–Three main steps:
• Identify search intents for a class of entities
– E.g., download, mp3, mv for music
• Infer relationship between two intent phrases
– E.g., synonyms: pictures / pics;
hypernyms: wallpapers / pictures
• Organize intent phrases into a tree
Yin and Shah. Building Taxonomy of Web Search Intents for Name Entity Queries.
Query Logs with User Behaviors (cont.)
• Example-2: Building NE-query intent taxonomy (cont.)– Intent phrase identification
• Extract phrases that co-appear with NEs of a given class from query logs
– Infer relations between intent phrases• Basic idea:
– Given an NE e, its intent phrases w1 and w2, if the clicked urls for “e+w1” also satisfy “e+w2”, then w1 w2
–Organize intent phrases• Three approaches:
– Directed maximum spanning tree; hierarchical agglomerative clustering; Pachinko allocation models
Yin and Shah. Building Taxonomy of Web Search Intents for Name Entity Queries.
⊆
Query Logs with User Behaviors (cont.)
• Example-3: Query substitution – Johns et al., WWW-2006
–Example:
Johns et al. Generating Query Substitutions.
Query-level substitution
Phrase-level substitution
Query Logs with User Behaviors (cont.)
• Example-3: Query substitution (cont.)–Collect reformulation query / phrase pairs
• Using query sessions
• Query pair:
– Successive queries issued by a single user
– E.g., britney spears mp3s -> britney spears lyrics
• Phrase pairs:
– Segment queries into phrases
– Retain queries in which only one segment has changed
– E.g., (britney spears) (mp3s) -> (britney spears) (lyrics)
• Measure relatedness of the query / phrase pairs
– Log likelihood ratio (LLR) score
Johns et al. Generating Query Substitutions.
Query Logs with User Behaviors (cont.)
• Example-3: Query substitution (cont.)–Generate substitutions
• Query-level substitution for frequent queries, while phrase-level substitution for infrequent ones
–Rank candidate substitutions• Machine learning methods
– Linear regression
– Binary classification
• Features:
– Query length, #segments, %Alphabetic characters, edit distance, #segments substituted, #tokens shared, size of prefix overlapping, LLR, frequency, mutual information,…
Johns et al. Generating Query Substitutions.
Query Logs with User Behaviors (cont.)
• Example-4: Optimizing page rank– Joachims, SIGKDD-2002
Joachims. Optimizing Search Engines using Clickthrough Data.
link3 <r* link2
link7 <r* link2
link7 <r* link4
link7 <r* link5
link7 <r* link61,3,7 are clicked
Query Logs with User Behaviors (cont.)
• Example-4: Optimizing page rank (cont.)–Defined as a ranking problem:
• Given:
–
» n: size of training data; qi: query; ri*: target ranking
• Maximize:
–
Joachims. Optimizing Search Engines using Clickthrough Data.
* * *1 1 2 2(q , r ), (q , r ),..., (q , r )n n
*f (q )
1
1(f ) (r , r )i
n
S iin
τ τ=
= ∑
Kendall’s : τ (r , r )a bP QP Q
τ −=
+#concordant pairs
#discordant pairs
Query Logs with User Behaviors (cont.)
• Example-4: Optimizing page rank (cont.)–Ranking SVM
Joachims. Optimizing Search Engines using Clickthrough Data.
Using partial
feedback
Query Logs with User Behaviors (cont.)
• Example-5: Ontology construction–Sekine and Suzuki, WWW-2007
–Basic idea: • Entities belonging to the same class should appear in
similar contexts
–Data: • Query logs (only queries are used)
–Main steps:• Step-1: Extract typical contexts for a given class
• Step-2: Find new entities belonging to the class
Sekine and Suzuki. Acquiring Ontological Knowledge from Query Logs.
Query Logs with User Behaviors (cont.)
• Example-5: Ontology construction (cont.)
Sekine and Suzuki. Acquiring Ontological Knowledge from Query Logs.
Examples of class “Award”
Typical context words for “Award”
New entities for the class “Award”
Query Logs with User Behaviors (cont.)
• Example-6: Paraphrase acquisition–Zhao et al., 2010
–Corpus• Query logs (queries and titles) of a search engine
–Assumption• If a query q hits a title t, then q and t are likely to be
paraphrases
• If queries q1 and q2 hit the same title t, then q1 and q2 are likely to be paraphrases
• If a query q hits titles t1 and t2, then t1 and t2 are likely to be paraphrases
Zhao et al. Paraphrasing with Search Engine Query Logs.
Query Logs with User Behaviors (cont.)
关于 草原 的 诗词
描写 草原 的 诗句
有关 草原 的 诗歌
……
……
q1
t1
t2
赞美 大 草原 的 诗q2
Paraphrases:
<q1, t1>
<q1, t2>
<q2, t1>
<q1,q2>
<t1,t2>
query-title
query-query
title-title
Example:
Zhao et al. Paraphrasing with Search Engine Query Logs.
Query Logs with User Behaviors (cont.)
• Step-1: extracting <q, t> paraphrases– Extracting candidate <q, t> pairs from query logs
– Paraphrase validation based on binary classification
• Combining multiple features
• Step-2: extracting <q, q> paraphrases– Extracting candidate <q, q> from <q, t> paraphrases
– Paraphrase validation based on binary classification
• Step-3: extracting <t, t> paraphrases – Extracting candidate <t, t> from <q, t> paraphrases
– Paraphrase validation based on binary classification
Zhao et al. Paraphrasing with Search Engine Query Logs.
New Data & Resources
• Treasures from WWW
Large-scale Web corpora
Query logs with user behaviors
User generated content (UGC)
User Generated Content (UGC)
UGC
cQA
Blogs / microblogs
Forums
Online encyclopedia
Forums
Forums
• Example: Mining QA pairs from forums–Cong et al., SIGIR-2008
–Find question-answer pairs from forum thread• Motivation:
– Initiating post usually contains questions, while reply posts may contain answers
• Two main stages
– Question detection
– Answer detection
Cong et al. Finding Question-Answer Pairs from Online Forums.
Forums (cont.)
• Example: Mining QA pairs from forums (cont.)–Question detection
• Non-trivial problem
– Cannot simply rely on question marks or question words
• Classification-based method
– Feature: Labeled Sequential Patterns (LSPs) extracted questions and non-questions
» E.g., <what, do, PRP, VB>→Q
Cong et al. Finding Question-Answer Pairs from Online Forums.
Forums (cont.)
• Example: Mining QA pairs from forums (cont.)–Answer detection
• Problems:
– Multiple questions and answers interweaved together
– 1-question vs. n-answer; n-question vs. 1-answer
• Graph-based method, which considers:
– Rank of candidate answers for a given query
– Relationship of candidate answers
– Forum-specific feature: distance of a candidate answer from the question
Cong et al. Finding Question-Answer Pairs from Online Forums.
User Generated Content (UGC)
UGC
cQA
Blogs / microblogs
Forums
Online encyclopedia
Community-based QA (cQA)
• Research on cQA–Search and Recommendation
• cQA retrieval model
• Question similarity computation
• Multi-sentence question segmentation
• ……
–Quality estimation• Question quality
• Answer quality
• User (questioner / answerer) quality
Community-based QA (cQA) (cont.)
• Example-1: CQA retrieval model–Xue et al., SIGIR-2008
–Basic idea: • In cQA retrieval, both question parts and answer parts
should be modeled
–Propose a mixed model:• A translation-based language model for the question
part
• A query likelihood approach for the answer part
Xue et al. Retrieval Models for Question and Answer Archives.
Community-based QA (cQA) (cont.)
• Example-1: CQA retrieval model (cont.)–Translation-based language model for the
question part
Xue et al. Retrieval Models for Question and Answer Archives.
( | ( , )) ( | ( , ))
| ( , ) |( | ( , )) ( | ( , )) ( | )| ( , ) | | ( , ) |
( | ( , )) (1 ) ( | ) ( | ) ( | )
w
mx ml
mx ml mlt q
P q a P w q a
q aP w q a P w q a P w Cq a q a
P w q a P w q P w t P t q
λλ λ
β β
∈
∈
=
= ++ +
= − +
∏
∑
q
qSmoothing
Language model
Translation model trained with Q-A pairs in cQA archives
Community-based QA (cQA) (cont.)
• Example-1: CQA retrieval model (cont.)– Incorporating the answer part
Xue et al. Retrieval Models for Question and Answer Archives.
( | ( , )) (1 ) ( | ) ( | ) ( | )mx ml mlt q
P w q a P w q P w t P t qβ β∈
= − + ∑
( | ( , )) ( | ) ( | ) ( | ) ( | )mx ml ml mlt q
P w q a P w q P w t P t q P w aα β γ∈
= + +∑
Query likelihood model for the answer part
Community-based QA (cQA) (cont.)
• Example-2: Answer quality prediction– Jeon et al., SIGIR-2006
–Background:• There are plenty of bad answers in cQA archives
– Some users answer nonsense
– Some answers contain irrelevant advertisements
– Examples:
Jeon et al. A Framework to Predict the Quality of Answers with Non-Textual Features.
Q: What is the minimum positive real number in Matlab?
A: Your IQ
Q: What is new in Java2.0?
A: Nothing new
Q: Can I get a router if I have a usb dsl modem?
A: Good question but I do not know
Community-based QA (cQA) (cont.)
• Example-2: Answer quality prediction (cont.)–Measure answer quality with non-textual
features
Jeon et al. A Framework to Predict the Quality of Answers with Non-Textual Features.
Answerer’s Acceptance ratio Answer Length
Questioner’s Self Evaluation Answerer’s Activity Level
Answerer’s Category Specialty Print Count
Copy Count Users’ Recommendation
Editor’s Recommendation Sponsor’s Answer
Click Counts Number of Answers
Users’ Dis-Recommendation
Community-based QA (cQA) (cont.)
• Example-2: Answer quality prediction (cont.)–Maximum entropy for answer quality estimation
– Integrated into the cQA retrieval model
Jeon et al. A Framework to Predict the Quality of Answers with Non-Textual Features.
13
1
1( | ) exp ( , )( ) i i
ip y x f x y
Z xλ
=
⎡ ⎤= ⎢ ⎥⎣ ⎦∑
( | ) ( ) ( | )w Q
P D Q P D P w D∈
= ∏
( ) ( | )P D p y x D= =
Prior probability of D
User Generated Content (UGC)
UGC
cQA
Blogs / microblogs
Forums
Online encyclopedia
Online Encyclopedia
• How to make use of online encyclopedia?–Clean and semi-structured data
• Applications: information extraction, relation extraction, summarization…
–Links among entries• Applications: WSD, lexical reference rules extraction,
NE recognition…
–Revision history• Applications: Sentence compression, sentence
simplification…
Online Encyclopedia (cont.)
• Example-1: Information extraction–Wu and Weld, ACL-2010
–Main idea:• Generate relation-specific training examples by
matching Infobox attribute values to corresponding sentences
• Main modules:
– Preprocessor
– Matcher
– Learner
Wu and Weld. Open Information Extraction using Wikipedia.
Online Encyclopedia (cont.)
• Example-1: Information extraction (cont.)–An example
Wu and Weld. Open Information Extraction using Wikipedia.
Online Encyclopedia (cont.)
• Example-1: Information extraction (cont.)– Preprocessor
• Sentence splitting
• NLP annotation
• Compiling synonyms
– Using wikipedia redirection pages and backward links
–Matcher• Match target entity
– Full match, synonym match, partial match, type match, pronoun match…
• Match sentences
– Seek a unique sentence to match the attribute value
Wu and Weld. Open Information Extraction using Wikipedia.
Online Encyclopedia (cont.)
• Example-1: Information extraction (cont.)– Learn two kinds of extractors
• Extractor-1:
– Using dependency parse tree features
– Higher accuracy but slower
• Extractor-2:
– Using only shallow features, such as POS tags
– Lower accuracy but faster
Wu and Weld. Open Information Extraction using Wikipedia.
Online Encyclopedia (cont.)
• Example-2: Multilingual NER–Richman and Schone, ACL-2008
–Main idea:• English NE categorization
– Use category links
• Foreign NE categorization
– Find counterpart in English wikipedia pages
– Use category links for English NE categorization
– Foreign NE and its English counterpart share identical NE type
Richman and Schone. Mining Wiki Resources for Multilingual Named Entity Recognition.
Online Encyclopedia (cont.)
Richman and Schone. Mining Wiki Resources for Multilingual Named Entity Recognition.
Jacqueline Bhabha
Categories: British lawyersJewish American writersIndian Jews
Lawyers by nationality
American writers by ethnic or national origin
Jewish writers
British legal professionals
Indian people by religion
Indian people by ethnic or national originKey phrase for PER
为了帮助保护您的隐私,PowerPoint 禁止自动下载此外部图片。若要下载并显示此图片,请单击消息栏中的 “选项”,然后单击“启用外部内容”。
Online Encyclopedia (cont.)
Richman and Schone. Mining Wiki Resources for Multilingual Named Entity Recognition.
Catégories : Commune des Côtes-d'ArmorVille portuaire de FrancePort de plaisanceStation balnéaire française
French English
Category:Communes of Côtes-d'Armor
Category:Port cities and towns in France
Category:Marinas
Category:Seaside resorts in France
Erquy
Category:Towns in Brittany
Category:Cities in France
Category:Coastal construction
Category:Seaside resorts
Easy to be identified as
GPE
Online Encyclopedia (cont.)
• Example-3: Sentence compression–Yamangil and Nelken, ACL-2008
–Key problem for sentence compression• Data sparseness
• Ziff-Davis corpus: 1067 sentence pairs
–Main idea of this work• Abundant sentence compressions can be extracted from
Wikipedia’s revision history, and used as training data
• >380,000 sentence pairs are extracted
Yamangil and Nelken. Mining Wikipedia Revision Histories for Improving Sentence Compression.
Online Encyclopedia (cont.)
• Example-3: Sentence compression (cont.)–Assumption:
• All edits retain the core meaning of the sentence
–Focus only on sentence-level edits that add or drop words
–Train a lexicalized channel model for sentence compression
Yamangil and Nelken. Mining Wikipedia Revision Histories for Improving Sentence Compression.
User Generated Content (UGC)
UGC
cQA
Blogs / microblogs
Forums
Online encyclopedia
Blogs / Microblogs
• Research on blogs / microblogs–Adaptation of conventional NLP techniques
• Tokenization, POS tagging, spelling error correction,…
–User profile learning
–Search and recommendation
–Sentiment analysis
–Monitoring emergency event
–……
Blogs / Microblogs (cont.)
• Example-1: POS tagging for twitter–Gimpel et al., ACL-2011
–Background:• Conventional NLP tools are typically trained on news
texts, which perform poorly on Twitter
–This work:• Produce an English POS tagger that is designed
especially for Twitter data
Gimpel et al. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments.
Blogs / Microblogs (cont.)
• Example-1: POS tagging for twitter (cont.)–Tagset designed for Twitter
Gimpel et al. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments.
Blogs / Microblogs (cont.)
• Example-1: POS tagging for twitter (cont.)–Twitter-specific features
• TWORTH: Twitter orthography
– Regular expression rules to detect #, @, and U
• NAMES: Frequently-capitalized tokens
– If a token is frequently capitalized
• TAGDICT: Traditional tag dictionary
– Token’s POS tags from conventional corpora
• DISTSIM: Distributional similarity
– Similar words of the token
• METAPH: Phonetic normalization
– E.g., {thanks thangs thanksss …} -> 0NKS
Gimpel et al. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments.
Blogs / Microblogs (cont.)
• Example-2: Sentiment classification– Jiang et al, ACL-2011
–Sentiment classification for Twitter should be target-dependent• E.g.,
Jiang et al. Target-dependent Twitter Sentiment Classification.
People everywhere love Windows & Vista. Bill Gates
Windows 7 is much better than Vista
Blogs / Microblogs (cont.)
• Example-2: Sentiment classification (cont.)– Sentiment classification for Twitter should be
context-aware• E.g.,
Jiang et al. Target-dependent Twitter Sentiment Classification.
[tweet_n] First game: Lakers!
[tweet_n-1] I love Lakers, I love Kobe!!
[Retweet] Lakers won the game, Great!
[Reply] I love Lakers too.
Positive
Positive
Positive
Positive
Blogs / Microblogs (cont.)
• Example-2: Sentiment classification (cont.)–Method of this work
• Target-dependent features
– Use features based on syntactic parse trees
» E.g., verb+target(obj); target(sub)+verb; adjective+target…
– Binary SVM classifier
• Graph-based sentiment optimization
– Three kinds of related tweets:
» Retweets; reply; tweets containing the target from the same person
– Construct a graph with related tweets and classify tweets on the graph
Jiang et al. Target-dependent Twitter Sentiment Classification.
Blogs / Microblogs (cont.)
• Example-3: User grouping –Qu and Liu, ACL-2011
–Motivation:• People in a twitter user’s following list need to be
grouped if the length of the list is large
–Main idea:• Provide some seeding friends for each target class, and
automatically group other friends similar / related to the seeds
Qu and Liu. Interactive Group Suggesting for Twitter.
Blogs / Microblogs (cont.)
• Example-3: User grouping (cont.)–Two sub-systems for the task
• Sub-system 1:
– Content based sub-system
» Compute similarity between each friend and the seeding friends based on their tweet contents
• Sub-system 2:
– Friend based sub-system
» Use the count of bi-directional friend relationships and mentions between each friend and seeding friends as the score for ranking
Qu and Liu. Interactive Group Suggesting for Twitter.
Outline
• Part I–Background
• Part II–New Applications & Trends
• Part III–New Data & Resources
• Part IV–New Difficulties & Challenges
• Part V–New Methodologies & Solutions
New Difficulties & Challenges
• Storage and computing capability
• Rapid evolution of languages
• Noise and errors in the web corpora
• Information credibility
Storage and Computing Capability
• Large-scale data sources– Web page corpora
– Web search query logs
– Language model
– ……
• Two questions:– How to store the data and easily access to them?
– How to efficiently process the tremendous data?
• Solutions:– Pruning and filtering
– Efficient algorithm
– Distributed computing
Storage and Computing Capability (cont.)
• Example-1: machine translation–Scale of data:
• Translation model trained with 100 million sentence pairs
– Size of model: ≈20G
• Language model (5-gram) trained with 100 million sentences
– Size of model: ≈5G
• Models used by online translation systems are usually larger than those above!
Storage and Computing Capability (cont.)
• Example-1: machine translation–Pruning phrase table (PT) [Johnson et al., 2007]
• Observation:
– Many phrase pairs in PT are wrong or will never be used
• Disadvantage of large PT
– Require more resources and time to process
– Requires more features and more sophisticated search
• Work in this paper:
– Prune PT based on significance testing
Johnson et al. Improving Translation Quality by Discarding Most of the Phrasetable.
Storage and Computing Capability (cont.)
• Example-1: machine translation–Pruning phrase table (PT) [Johnson et al., 2007]
• Fisher’s exact test:
• p-value:
( ) ( )( , ) ( ) ( , )
( ( , ))
( )
h
C s N C sC s t C t C s t
p C s tNC t
−⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟−⎝ ⎠⎝ ⎠=
⎛ ⎞⎜ ⎟⎝ ⎠
% %
% % %% %%%
%
Johnson et al. Improving Translation Quality by Discarding Most of the Phrasetable.
( , )( ( , )) ( )h
k C s tp value C s t p k
∞
=
− = ∑%%
%%
Storage and Computing Capability (cont.)
• Example-2: machine translation–Parallel decoding and distributed language model
• Li et al., 2009
• Parallel decoding
– Exploit multi-core and multi-processor architectures
– Translate multiple sentences in separate threads
– Store the language model and translation grammar in shared memory
• Distributed language model
– Reduce memory pressure
– Use larger language model
Li et al. Decoding in Joshua: Open Source Parsing-based Machine Translation.
Storage and Computing Capability (cont.)
• Example-3: paraphrase acquisition–Basic idea:
• Hypothesis: Paraphrases should appear in similar contexts (distributional similarity)
• Extract paraphrase phrases whose contextual vectors are similar with each other
– E.g., X acquired Y and X completed the acquisition of Y
• Data scale:
– 150GB monolingual corpus
– Check any pair of phrases if they are paraphrases
Bhagat and Ravichandran. Large Scale Acquisition of Paraphrases for Learning Surface Patterns.
Storage and Computing Capability (cont.)
• Example-3: paraphrase acquisition (cont.)–Apply Locality Sensitive Hashing (LSH) to speed
up computation [Bhagat and Ravichandran, 2008]• Represents a d dimensional vector by a stream of b bits
and has the property of preserving the cosine similarity between vectors
• Reduce time complexity from to
( )b d
Bhagat and Ravichandran. Large Scale Acquisition of Paraphrases for Learning Surface Patterns.
2( )O n d ( )O nd
n: #phrases
New Difficulties & Challenges
• Storage and computing capability
• Rapid evolution of languages
• Noise and errors in the web corpora
• Information credibility
Rapid evolution of languages
• Languages change rapidly due to WWW:–New words
• E.g., 给力, 雷人
–New senses• E.g., 粉丝, 玉米
–New named entities• E.g., 旭日阳刚, 筷子兄弟
–Chat language• E.g., 818, 有木有
Rapid evolution of languages (cont.)
• Challenges posed to NLP:–Word segmentation
• Out-Of-Vocabulary problem
–Named entity recognition• More categories should be covered
–Word sense disambiguation• New senses of words need to be recognized
–Chat language normalization• Normalize chat language to natural language
Rapid evolution of languages (cont.)
• Example-1: Entity extraction–Pennacchiotti and Pantel, EMNLP-2009
–Two extractors:• Pattern-based extractor
– Find instances of a given relation with seed instances
– E.g., act-in (Actor, Movie)
• Distributional extractor
– Construct a context vector for each noun
– Sample seed entities of a given entity class C
– Compute context vector similarity with the seeds
– Similar nouns are returned and classified into C
–Take the union of the entities extracted above
Pennacchiotti and Pantel. Entity Extraction via Ensemble Semantics.
Rapid evolution of languages (cont.)
• Example-1: Entity extraction–Pennacchiotti and Pantel, EMNLP-2009
–ML-based ranker• Regression model
– Features:• Feature classes:
– Frequency, Co-occurrence, Distributional, Pattern, Termness
• Features are extracted from:
– Web corpus, query logs, web tables, wikipedia
Pennacchiotti and Pantel. Entity Extraction via Ensemble Semantics.
Rapid evolution of languages (cont.)
• Example-2: Entity extraction and clustering– Jain and Pennacchiotti, COLING-2010
–Extract entities from web search query logs• Heuristic:
– Query term sequence with uppercase characters
• Filtering:
– Web-based representation score
» Checks if the case-sensitive representation of the candidate is the most likely representation
– Query-log-based standalone score
» Counts the occurrences of the standalone forms of the candidate in the query logs
Jain and Pennacchiotti. Open Entity Extraction from Web Search Query Logs.
Rapid evolution of languages (cont.)
• Example-2: Entity extraction and clustering– Jain and Pennacchiotti, COLING-2010
–Entity clustering• Features:
– Context feature space
» Hypothesis: similar queries should appear in similar contexts in the query logs
– Clickthrough feature space
» Hypothesis: similar queries should generate clicks on similar urls
– Hybrid feature space
» Normalized union of the two feature spaces above
Jain and Pennacchiotti. Open Entity Extraction from Web Search Query Logs.
Rapid evolution of languages (cont.)
• Example-3: Chinese chat text normalization–Xia et al., ACL-2006
–Problem:• Normalize Chinese chat text to natural language
– E.g., 介里->这里; 偶->我
–Characteristics:• Anomalous
– Anomalous words or anomalous usage
• Dynamic
– Change fast year by year
Xia et al. A Phonetic-Based Approach to Chinese Chat Text Normalization.
Rapid evolution of languages (cont.)
• Example-3: Chinese chat text normalization–Baseline model:
• Source-channel model
–
• Disadvantages:
– Data sparseness
– Training effectiveness is poor due to the dynamic nature
–Observation:• Most Chinese chat terms are created via phonetic
transcription
Xia et al. A Phonetic-Based Approach to Chinese Chat Text Normalization.
( | ) ( )ˆ arg max ( | ) arg max( )C C
p T C p CC p C Tp T
= =
Rapid evolution of languages (cont.)
• Example-3: Chinese chat text normalization–Extended source channel model:
• Inserting phonetic mapping model
•
• Chat term normalization observation model:
–
• Phonetic mapping model:
–
Xia et al. A Phonetic-Based Approach to Chinese Chat Text Normalization.
, ,( | , ) ( | ) ( )ˆ arg max ( | , ) arg max
( )C M C Mp T M C p M C p CC p C M T
p T= =
( | , ) ( | , )i i iip T M C p t m c=∏
( | ) ( | )i iip M C p m c=∏
Phonetic mapping probability
Rapid evolution of languages (cont.)
• Example-3: Chinese chat text normalization–Phonetic mapping model:
•
– : frequency of character c in a standard language corpus
– : phonetic similarity between two characters
»
Xia et al. A Phonetic-Based Approach to Chinese Chat Text Normalization.
( ) ( , )Pr ( , )( ( ) ( , ))
slcpm
slc i ii
fr c ps t cob t cfr c ps t c
×=
×∑( )slcfr c
( , )ps t c
( , ) ( ( ), ( ))( ( ( )), ( ( )))( ( ( )), ( ( )))
ps t c sim py t py csim initial py t initial py csim final py t final py c
==×
New Difficulties & Challenges
• Storage and computing capability
• Rapid evolution of languages
• Noise and errors in the web corpora
• Information credibility
Noise and Errors in the Web Corpora
• Query logs–Queries are not well-formed sentences
– 10-15% of queries contain misspelled terms [Cucerzan and Brill, 2004]
• UGC data–Noise and errors are common in Forums,
community-based QA, blogs, microblogs
• Present challenges to NLP researches–E.g., word segmentation, POS tagging, parsing
Cucerzan and Brill. Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users.
Noise and Errors in the Web Corpora (cont.)
• Spelling correction –Non-word spelling correction
• Word not found in pre-compiled lexicon
• Words similar to the misspelled word are candidate spelling corrections
• Statistical error models are more effective
–Real-word spelling correction• Incorrect usage of a valid word in a given context
– A ___ of cake (peace / piece)
• Generate candidate corrections with a pre-defined confusion set
• Rank candidates according to contextual information
Noise and Errors in the Web Corpora (cont.)
• Query spelling correction–Sun et al., ACL-2010
–Collect training data from clickthrough data of query logs• Collect cases in which a user submits a query and clicks
on the spelling suggestion
– “did you mean” function
– 3 million query-correction pairs are collected in this way
–Using the collected data to train a spelling correction system
Sun et al. Learning Phrase-Based Spelling Error Models from Clickthrough Data.
Noise and Errors in the Web Corpora (cont.)
• Query spelling correction–Sun et al., ACL-2010
–Model:• Ranking model
– Two-layer neural net with 5 hidden nodes
• 96 ranking features
– Language model
– Error model
» Edit distance model, phonetic model
– Phrase-based error model
– ……
Sun et al. Learning Phrase-Based Spelling Error Models from Clickthrough Data.
Noise and Errors in the Web Corpora (cont.)
• Query spelling correction–Sun et al., ACL-2010
–Phrase-based error model• Similar to SMT model
• “translate” a correct query C into a misspelled query Q
• Given:– Segment C into K phrases: c1,…,cK
– T: K replacement phrases: q1,…,qK
• Model:
( , , )( , , *)
( | ) max ( | , )S T M
B C Q A
P Q C P T C S∈
≈
Sun et al. Learning Phrase-Based Spelling Error Models from Clickthrough Data.
1( | , ) ( | )K
k kkP T C S P q c
==∏
alignment
Noise and Errors in the Web Corpora (cont.)
• Query spelling correction–Sun et al., ACL-2010
–Features:• Phrase transformation feature
• Lexical weight feature
'
( , )( | )( , ')
q
N c qP q cN c q
=∑
Sun et al. Learning Phrase-Based Spelling Error Models from Clickthrough Data.
| |
( , )1
1( | , ) ( | )|{ | ( , ) } |
q
w i ji j Ai
P q c A t q cj j i A ∀ ∈=
=∈ ∑∏
New Difficulties & Challenges
• Storage and computing capability
• Rapid evolution of languages
• Noise and errors in the web corpora
• Information credibility
Information Credibility
• Credibility problem– It becomes easy to produce and disseminate
information in the web 2.0 age• Personal web sites
• Forums
• Blogs / microblogs
• ……
– It becomes difficult to control the authority and credibility of the information (sources)• Deceptive, erroneous, subjective, outdated information
Information Credibility (cont.)
• Aspects of information credibility [Metzger, 2007]–Accuracy
–Authority• trustworthiness, expertise
–Objectivity
–Currency• Up-to-date
–Coverage• Sufficient depth and breadth
Metzger. Making Sense of Credibility on the Web: Models for Evaluating Online Information and Recommendations for Future Research.
Information Credibility (cont.)
• Criteria for information credibility– Information contents
• Detect logical consistency and contradiction
– Information sender• Quality and quantity info. the sender has produced
–Document style and superficial characteristics• Sentential style, page layout, links in page,…
–Social evaluation• Mining users’ opinions and comments
Information Credibility Criteria Project: http://kc.nict.go.jp/project1/icc-project-description.html
Information Credibility (cont.)
• Example-1: Search engine–Credibility of returned search results:
• Relatedness
• Web site authority
• Inward links
• User clicks
• Anti-cheating
• Currency
• ……
Information Credibility (cont.)
• Example-1: Search engine (cont.)
Information from the credible sources are
higher ranked
Information Credibility (cont.)
• Example-2: News –Credibility of news:
• Credibility of the information sources
– E.g., Google News
• Example-3: E-commerce–Credibility of products
• Users’ rating
• Users’ comments
Information Credibility (cont.)
• Example-3: E-commerce (cont.)
Users’ rating
Users’ comments
Outline
• Part I–Background
• Part II–New Applications & Trends
• Part III–New Data & Resources
• Part IV–New Difficulties & Challenges
• Part V–New Methodologies & Solutions
New Methodologies & Solutions
• Response to real-world demands
• Familiar with cutting-edge research
• Balance between data and algorithms
• Experimental platforms for real applications
Response to Real-world Demands
• Research rooted in real applications
• Research beyond engineering
• Feasibility & Expansibility–Easy and robust solutions must come first
• Effectiveness vs. Efficiency–Web-scale applications
• Leverage all available data and resources
Response to Real-world Demands (cont.)
• Example-1: Entity linking–Application of word sense disambiguation (WSD)
–Conventional WSD research:• Disambiguate given words in certain contexts
–Entity linking:• Link entities appearing in documents with their
referents in knowledge bases (e.g., Wikipedia)
• WSD is the key problem in entity linking
Response to Real-world Demands (cont.)
• Example-1: Entity linking (cont.)–An example from [Han and Sun, 2011]
Han and Sun. A Generative Entity-Mention Model for Linking Entities with Knowledge Base.
Response to Real-world Demands (cont.)
• Example-2: Paraphrase acquisition – Investigate multiple resources [Zhao et al., 2008]
• Thesaurus
• Monolingual parallel corpora
– Multiple translations of the same foreign novel
• Monolingual comparable corpora
– Comparable news articles reporting on the same event
• Bilingual parallel corpora
• Online dictionary definitions
• Query clusters
– Based on click-through information
Zhao et al. Combining Multiple Resources to Improve SMT-based Paraphrasing Model.
Response to Real-world Demands (cont.)
• Example-2: Paraphrase acquisition (cont.)–Combine multiple resources [Zhao et al., 2008]
• Train a paraphrase table with each resource and combine them with a SMT model, which is then used in paraphrase generation
• Accuracy of the generated paraphrases is improved when combining multiple resources
Zhao et al. Combining Multiple Resources to Improve SMT-based Paraphrasing Model.
Response to Real-world Demands (cont.)
• Example-3: Ensemble Semantics (ES)–Pennacchiotti and Pantel, EMNLP-2009
–Ensemble semantics: • A general framework for modeling information
extraction algorithms that combine multiple sources of information and multiple extractors
• Advantages:
– Multiple sources of knowledge
– Multiple extractors
– Multiple sources of features
Pennacchiotti and Pantel. Entity Extraction via Ensemble Semantics.
Response to Real-world Demands (cont.)
• Example-3: Ensemble Semantics (ES)–Pennacchiotti and Pantel, EMNLP-2009
Pennacchiotti and Pantel. Entity Extraction via Ensemble Semantics.
sources
New Methodologies & Solutions
• Response to real-world demands
• Familiar with cutting-edge research
• Balance between data and algorithms
• Experimental platforms for real applications
Familiar with Cutting-edge Research
• Cutting-edge research–New problem
• E.g., sentiment analysis, entailment
–New solution• E.g., crowdsourcing
–New application• E.g., microblogs
• In-depth analysis before entering a new field
Familiar with Cutting-edge Research (cont.)
• Example: Crowdsourcing–Definition:
• The act of outsourcing tasks, traditionally performed by an employee or contractor, to an undefined, large group of people or community (a "crowd"), through an open call ----Wikipedia
• …it gathers those who are most fit to perform tasks, solve complex problems and contribute with the most relevant and fresh ideas ---- Jeff Howe
–Amazon’s Mechanical Turk:• A crowdsourcing Internet marketplace that enables
computer programmers (known as Requesters) to co-ordinate the use of human intelligence to perform tasks that computers are unable to do yet ---- Wikipedia
Familiar with Cutting-edge Research (cont.)
• Crowdsourcing (cont.)–Crowdsourcing for NLP
• Speech recognition [Novotney and Callison-Burch, ’10]
• Machine translation [Zaidan and Callison-Burch, ’11]
• Paraphrase generation [Madnani, ’10]
• Anaphora resolution [Chamberlain et al., ’09]
• Word sense disambiguation [Akkaya, et al., ’10]
• Lexicon construction [Irvine and Klementiev, ’10]
• Named entity recognition [Finin etal., ’10]
• Grammatical error detection [Madnani et al., ’11]
• ……
Familiar with Cutting-edge Research (cont.)
• Crowdsourcing (cont.)–Advantage:
• Large and low-cost labor force
• Short turnaround time
• Access to foreign markets with native speakers
–Key problem: quality control• Knowledge producers are non-professional
– Post-processing for getting high-quality knowledge
–Machine learning techniques for automatically selecting high-quality knowledge
Familiar with Cutting-edge Research (cont.)
• Crowdsourcing (cont.)–Example: Collect Urdu-English translations
• Zaidan and Callison-Burch (ACL-2011)
Source stn.Collect translations
Post-edit and rank
Quality control
Best translation
Zaidan and Callison-Burch. Crowdsourcing Translation: Professional Quality from Non-Professionals.
Familiar with Cutting-edge Research (cont.)
• Crowdsourcing (cont.)–Example: Collect Urdu-English translations
• Collect English translations for Urdu sentences
– Turkers are mainly from India and Pakistan
– Input sentences are converted into images to avoid cheating by using an automatic MT system
– Collect multiple translations for each source sentence
• Post-editing and ranking
– Turkers should be native English speakers
– Post-edit: Edit the translations to make them more fluent
– Rank: rank all translations
Zaidan and Callison-Burch. Crowdsourcing Translation: Professional Quality from Non-Professionals.
Familiar with Cutting-edge Research (cont.)
• Crowdsourcing (cont.)–Example: Collect Urdu-English translations
• Automatically select the best translation from the candidate translations
• Features:
– Sentence-level features
» Language model features, sentence length features, web n-gram match percentage, web n-gram geometric average, edit rate to other translations
– Worker-level features
» Aggregate features, language ability, worker location
– Ranking features
» Average rank, is-best percentage, is-better percentage
– Worker calibration feature
Zaidan and Callison-Burch. Crowdsourcing Translation: Professional Quality from Non-Professionals.
Familiar with Cutting-edge Research (cont.)
• Crowdsourcing (cont.)–Example: Collect Urdu-English translations
• Evaluation:
– Compute BLEU against professional translations
Zaidan and Callison-Burch. Crowdsourcing Translation: Professional Quality from Non-Professionals.
New Methodologies & Solutions
• Response to real-world demands
• Familiar with cutting-edge research
• Balance between data and algorithms
• Experimental platforms for real applications
Balance between Data and Algorithms
• Traditional research–Limited data
– Sophisticated algorithms• Have to mine “lean ore” for knowledge
• NLP for web applications–Large-scale data
• Information redundancy is critical for statistical methods
–Relatively simple algorithms• Easily acquire enough knowledge with lightweight
methods
• Efficiently process large corpora
Balance between Data and Algorithms (cont.)
• Example: Open Information Extraction (OIE)–Banko et al., IJCAI-2007
–Traditional IE systems:• Use small and homogeneous corpora
– E.g., news corpora
• Rely on heavy linguistic technologies
– E.g., parsing, NER
• Relations of interest are specified beforehand
Banko et al. Open Information Extraction from the Web.
Difficult to scale to the massive and heterogeneous web corpora
Balance between Data and Algorithms (cont.)
• Example: Open Information Extraction (OIE)–Banko et al., IJCAI-2007
–The proposed method:• Module-1: self-supervised learner
– Train a classifier with a small corpus, which labels candidate extractions as “trustworthy” or not
• Module-2: single-pass extractor
– Extract candidate extractions from a large corpus, which are then filtered with the learnt classifier
• Module-3: redundancy-based assessor
– Assign a probability to each retained extraction based on a probabilistic model of redundancy in text
Banko et al. Open Information Extraction from the Web.
Balance between Data and Algorithms (cont.)
• Example: Open Information Extraction (OIE) –Banko et al., IJCAI-2007
–Module-1: self-supervised learner• Parse a small corpus with a syntax parser, and extract
noun phrase pairs with syntax paths
• Automatically label positive / negative examples based on constraints
– Path length, within sentence boundary, not solely pronoun
• Train a Naïve Bayes classifier with the training data
Banko et al. Open Information Extraction from the Web.
Balance between Data and Algorithms (cont.)
• Example: Open Information Extraction (OIE)–Banko et al., IJCAI-2007
–Module-2: single-pass extractor• Identify noun phrases with a lightweight NP chunker
• Relations are found by examining the text between noun phrases
– Non-essential phrases are filtered
• Presented to the classifier and tuples labeled as positive are extracted and stored
Banko et al. Open Information Extraction from the Web.
Balance between Data and Algorithms (cont.)
• Example: Open Information Extraction (OIE)–Banko et al., IJCAI-2007
–Module-3: redundancy-based assessor• Extraction is performed over the entire corpus
• Merge tuples where both entities and relations are the same and count their occurrences in sentences
• Assign a probability based on the occurrence count
Banko et al. Open Information Extraction from the Web.
Balance between Data and Algorithms (cont.)
• Example: Open Information Extraction (OIE)–Banko et al., IJCAI-2007
–Data scale:• Extract facts from a 9 million web page corpus
• 60.5 million tuples were extracted
–Error rate:
Banko et al. Open Information Extraction from the Web.
Average error rate
Correct extractions
TEXTRUNNER 12% 11,476
KNOWITALL 18% 11,631
New Methodologies & Solutions
• Response to real-world demands
• Familiar with cutting-edge research
• Balance between data and algorithms
• Experimental platforms for real applications
Experimental Platforms for Real Applications
• Academic and industrial circles should work together to establish evaluation platforms that resemble real applications– Industrial circle:
• Collect real-world application requirements
• Release real application data
– E.g., query logs
• Example-1: Yahoo! learning to rank challenge–A platform for learning-to-rank research
–Dataset:• Sampled from Yahoo! query logs
• Contain <query, url, features, relevance judgment>
– Queries, urls, and feature descriptions are not given, only feature values are
• Volume
Experimental Platforms for Real Applications (cont.)
• Example-2: Microsoft learning to rank dataset–Similar to the Yahoo! Dataset
– Sampled from Microsoft Bing query logs• MSLR-WEB30K: more than 30,000 queries
• MSLR-WEB10K: 10,000 queries
• Form
– Query ID, url ID, feature descriptions and feature values are given
Experimental Platforms for Real Applications (cont.)
Experimental Platforms for Real Applications (cont.)
• Example-3: Query logs–AOL query logs
• About 20M queries from about 650k users
• Users are represented as IDs
• Both queries and clicked urls are given
– Sogou query logs• 1 month query logs
• Presented info.
– Time, user ID, clicked url (Uc), rank of Uc, sequence number of Uc
Experimental Platforms for Real Applications (cont.)
• Example-4: other released data–Data from Yahoo! Webscope:
• http://webscope.sandbox.yahoo.com/
–Data from Sogou Labs• http://www.sogou.com/labs/
–Google Web 1T 5-gram corpus
–……
Thanks!QA