NLP Research at Internet Age

  • Upload
    sanazh

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

  • 7/29/2019 NLP Research at Internet Age

    1/42

    NLP Research at Internet AgeAn Overview of NLP at Microsoft Research Asia

    Ming ZhouManager of Natural Language Group

    Microsoft Research Asia

  • 7/29/2019 NLP Research at Internet Age

    2/42

    Trends of Internet Services Eco system to work with third partys apps

    Apple Apps, Facebook, Twitter, Baidu, Sina, QQ Real time content collection and search Twitter, Facebook, Del.ici.ous, NYT, YouTube

    Mobile search Contextual intent understanding Towards decision making and action taking

    Social power

    Social tags (like) for general search engines Search engines in SNS Social QA

  • 7/29/2019 NLP Research at Internet Age

    3/42

    Impact and Challenge to NLP Research

    Impact

    Biggest database ever connects dataBiggest social network connects people

    Harnessing collective intelligence

    Contextual information processing: User, users socialnetwork, location, time

    Real-time information processing: Collection, index,operation without delay

    ChallengeHow to leverage data, people, contextual information to

    reach real-time information processing?

  • 7/29/2019 NLP Research at Internet Age

    4/42

    Problems of Traditional NLP

    Approaches (NLP 1.0)

    Deep in individual component technologies but reach

    upper bounds Less consider scenarios, users need, market need

    Serious data sparseness with human annotation

    Evaluation bottleneck

    Slow deployment

    Lack effective framework to involve users feedback

    4

  • 7/29/2019 NLP Research at Internet Age

    5/42

    New Strategy of NLP (NLP2.0) Data collection from the web

    Domain specific and open-IE Contextual NLP Maximize on the system level not on the

    individual component Earlier deployment on Internet Make best use of social factors

    5

  • 7/29/2019 NLP Research at Internet Age

    6/42

    Our Vision and Task

    Advanced NLP technologies Word breaker, POS tagging, chunking, syntactic parser, semantic role

    labeling, speller, query suggestion, summarization

    Chinese, Japanese, English Multi-language information access

    Statistical machine translation

    Multi-language search

    Semantic computing Sentiment analysis, event extraction, ontology learning

    Understanding query intent and document

    Contextual NLP

    Understand user and document in any language, for any device

    and any applications

  • 7/29/2019 NLP Research at Internet Age

    7/42

    Text analysis

    Skeleton parser

    Named entity identification

    Pos tagging

    SLM

    Componenttechs

    Machine Translation

    Translation evaluation

    Tran. know. acquisition

    WEB mining for MT

    SMT

    Information Extraction

    Annotation tool

    Machine learning

    Term extraction

    Information Retrieval

    paraphrasing

    Vertical search

    Cross language IR

    NLP enriched Indexing

    and search

    Query-doc relevance

    Text mining

    Data

    NLP (C, J, E) MT (C, J, E)

    MRD

    Translation

    lexicon

    Bilingual corpus

    Bilingual tagged

    corpus

    IR and IE (C,J,E)

    MRD

    Parsing lexicon Tagged corpus

    Balanced corpus

    Applications

    Chinese IME

    Query speller

    English writing wizard News Search

    Twitter SearchPocket translatorJapanese IME

    MSRA NLP Research Overview

    Meta data extraction

    Couplet generation Resume Routing General web search

    Chatbot

    Comparison Shopping

  • 7/29/2019 NLP Research at Internet Age

    8/42

    Research Accomplishment Awards

    MSRA Best Research Team(2010)

    Finalist of WSJ Asian Innovation Awards (2010) MS ARD Best Project (Engkoo) MSRA Best Innovation (1998-2008): IME and Chinese couplets

    Academic impact Best result in NIST 2008 SMT, CWMT 2008 and CWMT 2009

    Best result in SIGHAN 2006 bake off on Chinese word segmentation Best result in cross language information retrieval in TREC-9, NTCIR-III 40 ACL papers, 9 SIGIR, 17 Coling papers (2000-2010) PC Chair, area chair of ACL

    Collaboration with universities HIT Joint lab on NLP, Speech and Search, Tsinghua Joint lab on Mediaand Network

    400 interns in 12 years Summer schools since 2001

    PhD supervisors at universities

    8

  • 7/29/2019 NLP Research at Internet Age

    9/42

    Summer School on Information Extraction

    (Harbin, June, 2005)

    Cheng Niu: Information

    extraction

    Frank Seide: Speech

    information extraction

    and search

    Hwee Tou Ng: Advanced

    topics of information

    extraction

    Chin-Yew Lin:

    Information extraction

    for automaticsummarization

  • 7/29/2019 NLP Research at Internet Age

    10/42

    Projects based on NLP 2.0 Engkoo: Web-based English learning service

    Data mining from the web

    Chinese couplets

    Include users power into system evolvement Semantic analysis and search of micro-

    blogging

    Move to SNS, mobile

  • 7/29/2019 NLP Research at Internet Age

    11/42

    Engkoo

    Parallel data mining from the web

    Video:http://video.sina.com.cn/v/b/37417609-1286528122.html

  • 7/29/2019 NLP Research at Internet Age

    12/42

    Rapidly Changing Language Approximately 1.5 billion people speak English as a

    primary, secondary or business language China: The largest English speaking country with

    250 million English learners and USD 60 billion annual

    expenses Problem: Live language: new words, new meanings

    Key Insight:With billions of translated web pages and sharable repositories

    of language data growing every day, the Internet holds the

    sum of human language knowledge

  • 7/29/2019 NLP Research at Internet Age

    13/42

    www.engkoo.com

    Major Features: Microsoft Products:

    Endless Lexicon with Native Definitions

    State-of-the-Art Machine Translation(NIST OpenMT Winner)

    Real-time Interactive Alignment

    Bing

    Office

    MSN

    Human-Like TTS & Phonetic Search

  • 7/29/2019 NLP Research at Internet Age

    14/42

    Massive Dictionary Mined from the

    Web

  • 7/29/2019 NLP Research at Internet Age

    15/42

    Fresh and Diverse Examples

  • 7/29/2019 NLP Research at Internet Age

    16/42

    Advanced Search with Sentence

    Analysis

  • 7/29/2019 NLP Research at Internet Age

    17/42

  • 7/29/2019 NLP Research at Internet Age

    18/42

    Sentences Classification

  • 7/29/2019 NLP Research at Internet Age

    19/42

  • 7/29/2019 NLP Research at Internet Age

    20/42

  • 7/29/2019 NLP Research at Internet Age

    21/42

    Learn Contextual Usage with Word

    Alignment

  • 7/29/2019 NLP Research at Internet Age

    22/42

    Learn Contextual Usage with Word

    Alignment

  • 7/29/2019 NLP Research at Internet Age

    23/42

    Learn Contextual Usage with Word

    Alignment

  • 7/29/2019 NLP Research at Internet Age

    24/42

    Hints of Easy-Confused Words

  • 7/29/2019 NLP Research at Internet Age

    25/42

  • 7/29/2019 NLP Research at Internet Age

    26/42

    Knowlege Mining Pipeline

    Mined

    Data

    Parsed

    DataLinguistic

    Knowledge

    WebMining

    Indexed

    Data

    Linguistic

    Parsing

    Knowledge

    Mining

    Multi-

    level

    Indexing

    Machine Translation Model

    Paraphrasing Model

    tokenizing: he could hardly afford to waste that golden time.

    skeleton parsing: (Tsub~he~afford) (ModAdv~hardly~afford) (Tobj~waste~afford)

    (Tobj~time~waste) (AdjAttrib~golden~time)

    (Tsub~~) (ModAdv~~)(Tobj~~)

    (AdjAttrib~~)

    alignment: he() could hardly afford to() waste() that()

    golden() time()

    1. words idiomatic usage

    Verb~Noun (decline~offer)

    Verb~Adv (greatly~improve)

    Adj~Noun (arduous~task)

    Adv~Adj (extremely~bad)

    2. paraphrasing

    turn_on~light, switch_on~light

    laborious~task, hard~task

    deeply~moved, deeply~touched

    3. collocation translations ~,make~plan

    ~, book~room

    ~,

    subscribe to ~magazine

    Parallel Sentence:

    He could hardly afford to waste that golden time.

    1. single word

    he, could, hardly, afford etc., , etc.

    2. single word with its POS

    he_Pron, could_Verb,hardly_Adv etc.

    _Pron, _Adv, _Verb etc.

    3. collocation

    Tsub~he~afford , Tobj~time~waste etc.

    Tsub~~, ModAdv~~etc.

  • 7/29/2019 NLP Research at Internet Age

    27/42

    Chinese Couplets

    Include users power into system

    evolvement

  • 7/29/2019 NLP Research at Internet Age

    28/42

    Chinese Couplets (http://duilian.msra.cn)

    http://video.sina.com.cn/v/b/10937201-1452530713.html

  • 7/29/2019 NLP Research at Internet Age

    29/42

    FS and SS Share the Same Style

    (wind)---------------- (water) (blow) --------------- (make)

    (buckwheat) -- ------ (ship)(wave)---------------- (go) (bridge) ------------- (island) (not) ----------------- (not)

    (wave) ---------------(go)

    Repetition of

    pronunciations()

  • 7/29/2019 NLP Research at Internet Age

    30/42

    FS and SS Share the Same Style

    (have)----------------- (lack)

    (son) ------------------- (fish)

    (have) ------------------ (lack) (daughter)------------- (mutton)

    (so) --------------------- (dare)

    (call) --------------------

    (call)(good) -------------------(fresh)

    Decomposition of

    characters ()

  • 7/29/2019 NLP Research at Internet Age

    31/42

    FS and SS Share the Same Style

    (Banqiao)---------------- (Dongpo)

    (produce) ------------------- (live)(bridge) --------------------- (mountain)

    (board)----------------------(east)

    Person

    name

    ()

    Palindrome

    ()

    Banqiao() and Dongpo() are famous litterateurs

    Reading from top to down is identical to down to top

  • 7/29/2019 NLP Research at Internet Age

    32/42

    sky high

    SS Generation Process

    hill

    sky

    high

    deep

    permit

    depend

    insect

    bird

    tiger

    fly

    dance

    t e

    tweedle

    bird fly

    hill high

    Sea wide allow fish jump

    tiger roar

    SMT decoding Reranking

    Linguistic

    filtering

  • 7/29/2019 NLP Research at Internet Age

    33/42

    SS Generation Approach

    A multi-phase SMT approachPhase1: a phrase-based log-linear model

    Phase2: some linguistic filters

    Phase3: a Ranking SVM

    Phrase-based log-

    linear model

    SS output

    Linguistic filters

    FS input

    N-best

    candidates

    Ranking SVM

    model

  • 7/29/2019 NLP Research at Internet Age

    34/42

    Great Examples FS:

    SS:

    FS:

    SS:

    FS:

    SS:

    FS: (+=;+=) SS: (+=;+=)

  • 7/29/2019 NLP Research at Internet Age

    35/42

  • 7/29/2019 NLP Research at Internet Age

    36/42

  • 7/29/2019 NLP Research at Internet Age

    37/42

    Motivation

    Training data is not adequate

    While user log is big(60k/m), increasing, diverse

    What logs we record

    User inputs

    User finalized couplets Second sentences selected out of the candidates provided by our system

    User modified second sentences

    User log for Model Enhancement

  • 7/29/2019 NLP Research at Internet Age

    38/42

    Users Log AnalysisNumber of input sentences 12,322

    Number of unique input sentences 6,698Users directly select from system

    output

    3,459

    User manual modify system output 606

    Save as favorite couplets 109

    Invalid user input 618

    No second sentence generated 2,211

    Banner generation 2,687

    Select the generated banner as

    favorite

    428

    No banner output 265

    Data Source

    Log fromhttp://couplet.msra.

    cn

    Time period

    Aug. 31-Oct. 9,

    2006

  • 7/29/2019 NLP Research at Internet Age

    39/42

    New Framework with Log Data

    Training data

    Source-Channel

    model

    Second sentenceoutput

    Translation

    model

    Log data

    Re-ranking

    First sentence

    input

    Language

    model

    Mutual

    information

    N-best

    candidates

    Translation

    model

    Language

    model

    Mutual

    information

    Useroperation

  • 7/29/2019 NLP Research at Internet Age

    40/42

    Twitter Search

    Move to social internet and mobile

  • 7/29/2019 NLP Research at Internet Age

    41/42

    Tweets

    Noise

    Filtering

    Raw Data

    Semantic

    Role Labeling

    Sentiment

    Analysis

    NE

    Recognition

    Dependency

    ParsingCo-reference

    Text

    NormalizationClassification

    Sentence Boundary

    Detection

    Tweets

    Cluster

    Statistical

    Relationship

    Learning

    News &

    Images Link

    Extraction

    Community Extraction User Influence Measure

    Hot tag, topic Extraction Popular Tweet Extraction

    Top video, music, artists Extraction

    A collection of tweets

    Individual tweet

    Multi-lev

    elIndexing

    Seman

    ticSearch

  • 7/29/2019 NLP Research at Internet Age

    42/42

    Conclusion Internet trends and impacts to NLP

    NLP2.0 strategy Web data mining: Engkoo

    Users power: Couplets SNS and mobile: Twitter search