12
Web IR/NLP Group (WING) @ NUS Min-Yen Kan School of Computing National University of Singapore http://wing.comp.nus.edu.sg/

Web IR/NLP Group (WING) @ NUS Min-Yen Kan School of Computing National University of Singapore

Embed Size (px)

Citation preview

Web IR/NLP Group (WING) @ NUS

Min-Yen KanSchool of Computing

National University of Singaporehttp://wing.comp.nus.edu.sg/

2MSRA Web-Scale NLP Worshop (Daedeok, Korea)

Min-Yen Kan

Web IR/NLP Group @ NUS

Support staff (undergraduate)

• System administrators

• System programmers

Undergraduate Projects

• 4 this year (ask me about topics)

PI: Min-Yen KAN (NLP and IR/DL)

Postdoc: • Su Nam KIM (Multiword Expressions)

PhDs: • Hendra SETIAWAN (Stat MT)

• Long QIU (Scenario Templates)• Yee Fan TAN (Web Record Linkage)• Jin ZHAO (Math IR)• Jesse PRABAWA (UI/HCI for DLs)• Ziheng LIN (Summarization)

One of many groups doing these type of research at NUS

Will go over NLP then DL for today

3MSRA Web-Scale NLP Worshop (Daedeok, Korea)

Min-Yen Kan

Information Extraction

• Keyphase Extraction– Idea: Use section information as evidence (ICADL 07)

•Scenario Template Generation (Long Qiu)

– Aim: to generate database rows from similar news events

Charley landed further south on the Gulf Coast than predicted, … The hurricane … was weakened and is moving over South Carolina

At least 21 missing after the storm hit … But Tokage had weakened by the time it passed over Tokyo, where it had left little damage before moving out to sea.

– Model context and cluster to convergence using EM (EMNLP 06)

4MSRA Web-Scale NLP Worshop (Daedeok, Korea)

Min-Yen Kan

Using less data

• URL Classification (WWW 04)http://www.usatoday.com/stories/080502/ent/hilton.html

http://www.cancersupportgroup.org/forum/230.html

– Classifies 1000’s of URLs per minute, with 2/3rds of full text accuracy

– Useful for focused crawling, web mining applications

5MSRA Web-Scale NLP Worshop (Daedeok, Korea)

Min-Yen Kan

Question-Answering (Hang Cui)• Our Approaches to QA

– Use of external resources from Web & WordNet (SIGIR04)– Employ dependency & SRL for answer extraction (SIGIR05, 06)– Soft pattern analysis of definitional patterns (WWW 05)– Explore temporal relationships and events– Extend techniques to precise passage retrieval– Came 2nd (in 2003, 2004 & 2005) in TREC QA Task– Licensed technology to company in legal search

• Current focus – Relation-based IE & QA – continue focus on linguistic knowledge– Ontology-based Interactive QA – leverage on domain knowledge– Searching for answers and mining terminology from the Web

6MSRA Web-Scale NLP Worshop (Daedeok, Korea)

Min-Yen Kan

Summarization (Ziheng Lin)• Document Concept Lattice Model (IPM 07)

– Aim to find list of sentences that result in minimal info lost– Extract key concept terms, and build concept lattice– Perform sentence extraction that covers max concept terms– Participated in DUC, came in 1st (2005) and 2nd (2006)

• Pioneered iterative construction model for graph-based summarization (DUC 07)

doc1 doc2 doc3

s1

doc1 doc2 doc3

s1

s2

doc1 doc2 doc3

s1

s2

s3

7MSRA Web-Scale NLP Worshop (Daedeok, Korea)

Min-Yen Kan

Statistical Machine Translation (Hendra Setiawan)

表单 是 网页 上 的 数据 输 域 的 集合

表单 是 集合 的 数据 输 域 的 上 网页a page is a coll. of data entry fields on a page

a form is a page on data entry fields of a coll.

上 网页on a page

数据 输 域 的 上 网页on a pagedata entry fields

集合 的 数据 输 域 的 上 网页data entry fields on a pagea coll. of

Function Word Based Reordering (ACL 07)

Function Word Based Reordering (ACL 07)

8MSRA Web-Scale NLP Worshop (Daedeok, Korea)

Min-Yen Kan

Commercial record linkage (Yee Fan Tan)• Addresses

– Dongwon Lee, 110 E. Foster Ave. #410, State College, PA, 16802– LEE Dong, 110 East Foster Avenue Apartment 410, Univ. Park, PA 16802-2343

• Products– Honda Fix vs. Honda Jazz– Apple iPod Nano 4GB vs. 4GB iPod nano 4GB

• Idea: use web as additional context for disambiguation and clustering (JCDL 06, WIDM 07)• Placed 3rd in Web People Search Task (WEPS 2007)

9MSRA Web-Scale NLP Worshop (Daedeok, Korea)

Min-Yen Kan

Multi(ple) Extensions

• Multimodal Alignment – Lyrics with Audio (ACM MM 04)

– Slides with Paper(JCDL 07)

• Current and future work:– Extracted Terminology with User Tagging

Text in Focus Slide in Focus

10MSRA Web-Scale NLP Worshop (Daedeok, Korea)

Min-Yen Kan

Focusing on the User

Understanding user searches better– Known item search (JCDL 2005)– Faceted classification of web queries (WebQ 2007)

• Building better user interfaces (Jesse Prabawa)– Revisiting library catalog interfaces to better support searching(JCDL 2007)

11MSRA Web-Scale NLP Worshop (Daedeok, Korea)

Min-Yen Kan

Putting it all together We’re building a niche academic research repository

– e.g., MS Libra, CiteSeer, DBLP, Google Scholar

What? Another one? What’s the catch?– The user interaction and community involvement is central– Overcome faults of imperfect machine learning– Platform for researching how web-scale NLP actively involves user feedback and mechanisms for channeling this

What about Web NLP / IR?– My group emphasizes practical outcomes and deliverables– Find research within industry and practical problems– Multilingual, multimedia, web-as-data angles likely to continue

12MSRA Web-Scale NLP Worshop (Daedeok, Korea)

Min-Yen Kan

Other pointers (NUS-wide)• Text Processing Seminar (with archived slides)

http://wing.comp.nus.edu.sg/chimetext

• Machine Learning (Graphical Models) Reading Group

http://groups.google.com/group/mlnus/

• NLP Reading Group

http://wing.comp.nus.edu.sg/NLPReading/index.php/Main_Page

<AD>

Shameless plug for my group: http://wing.comp.nus.edu.sg

</AD>

Thanks for listening!