Upload
martha-hicks
View
233
Download
12
Tags:
Embed Size (px)
Citation preview
Natural Language Natural Language ProcessingProcessing
Verbatim Text Coding andVerbatim Text Coding andData Mining Report GenerationData Mining Report Generation
Josef S.W. LeungJosef S.W. Leung (([email protected]@ieee.org))
Ching-Long YehChing-Long Yeh (([email protected]@cse.ttit.edu.tw))
NLP One of the Top Priority Funding Items
in Computer Science Research -- National Natural Science
Foundation, China
Language
Listen
(Understand)Speak
(Generate)
Natural Language
Internal Representatio
ns
GenerationGeneration
Analysis/ Analysis/ UnderstandingUnderstanding
Natural Language ProcessingNatural Language Processing
Outline of PresentationOutline of Presentation
• NLP IntroductionNLP Introduction– Natural Language Analysis/UnderstandingNatural Language Analysis/Understanding
– Natural Language GenerationNatural Language Generation
• Case 1: Verbatim Text CodingCase 1: Verbatim Text Coding– May need NL analysis techniquesMay need NL analysis techniques
• Case 2: Data Mining Report GenerationCase 2: Data Mining Report Generation– May need NL generation techniquesMay need NL generation techniques
Pre-processing
Tokens
Parsing
Syntactic structure
Semantic Interpretation Semantic
representation
Contextual Interpretation
Knowledge representati
on
Input sentence
Modules of NL Modules of NL UnderstandingUnderstanding
Parsing for Syntactic Parsing for Syntactic AnalysisAnalysis
Grammar Grammar Rules:Rules:
S
NP
VP
NP + VP
ART + N
V + NP
Lexicon:Lexicon:
N
N
V
ART
dog
cat
chased
the
s
NP VP
ART N V NP
dog chased the cat
ART N
the
Syntactic StructureSyntactic Structure
Structural AmbiguityStructural Ambiguity
• Time flies like an arrow.Time flies like an arrow.
• The passage of time is as quick as The passage of time is as quick as an arrow.an arrow.
• A species of flies called ‘time flies’ A species of flies called ‘time flies’ enjoy an arrow.enjoy an arrow.
Structural AmbiguityStructural Ambiguity
• The man saw the girl with The man saw the girl with telescope.telescope.
• The man saw the girl who possessed The man saw the girl who possessed the telescope.the telescope.
• The man saw the girl with the aid of The man saw the girl with the aid of the telescope.the telescope.
User’s Goal
Surface Sentences
Strategic Component
Tactical Component
Domain KB
Planning Operators
User Model
Discourse Model
Linguistic Rules & Lexicon
Text Planning
Linguistic Realizatio
n
Natural Language Natural Language GenerationGeneration
Unification GrammarUnification Grammar
the man sees a the man sees a sheepsheep
S [numb=X, S [numb=X, tense=T]tense=T]
NP [numb=X] VP [numb=X, NP [numb=X] VP [numb=X, tense=T]tense=T]VP[numb=N,tenseVP[numb=N,tense
=M]=M] V [numb=N, tense=M] NPV [numb=N, tense=M] NP
NP NP [numb=Y][numb=Y]
det [numb = Y] noun [numb = det [numb = Y] noun [numb = Y]Y]
manman : : noun [numb = sing]noun [numb = sing] a a :: det [numb = sing]det [numb = sing] the the : : detdetsheepsheep :: nounnounseessees : : [tense = pres, numb = sing][tense = pres, numb = sing]
Migraine abortive Migraine abortive treatment is used to treatment is used to abort migraine.abort migraine.((cat clause)((cat clause) (process ((lex “ (process ((lex “useuse”) (type material)))”) (type material))) (partic ((affected ((cat proper) (partic ((affected ((cat proper) (lex “ (lex “migraine abortive treatmentmigraine abortive treatment”)))”))) (agent none))) (agent none))) (circum ((purpose ((cat clause) (circum ((purpose ((cat clause) (keep-in-order no) (keep-for no) (keep-in-order no) (keep-for no) (position end) (position end) (process ((lex “ (process ((lex “abortabort”)”) (effect-type creative) (effect-type creative) (type material))) (type material))) (partic ((created ((lex “ (partic ((created ((lex “migrainemigraine”)”) (countable no) (countable no) (cat common))))))))))) (cat common)))))))))))
Verbatim Text CodingVerbatim Text Coding
• A text content classification problem.A text content classification problem.
• Group semantically similar answer items.Group semantically similar answer items.
• Develop a code list/tree to represent the Develop a code list/tree to represent the answer item groups.answer item groups.
• Simple NL analysis techniques may help.Simple NL analysis techniques may help.
• Details will be given in the first example of Details will be given in the first example of NLP application.NLP application.
Data Mining Report Data Mining Report GenerationGeneration
• Data mining results are usually in Data mining results are usually in rule or tree formats with obscure rule or tree formats with obscure notations.notations.
• NL generation techniques may help NL generation techniques may help translate the data mining results translate the data mining results into plain natural languages.into plain natural languages.
• Details will be given in the second Details will be given in the second example of NLP application.example of NLP application.
Codia for Verbatim Text Codia for Verbatim Text CodingCoding
Answer Items Code Tree
• Small Small screen/window/textscreen/window/text
• Long list of answer Long list of answer itemsitems
• Difficult to browse/viewDifficult to browse/view
• Worse than paper formWorse than paper form
Codia for Verbatim Text Codia for Verbatim Text CodingCoding
Key Terms
Ranking Answers by SimilarityRanking Answers by Similarity
Items with similar meaning
Text Similarity MeasuresText Similarity Measures
StringString
SemanticsSemantics CoverageCoverage
Text Text Similarity Similarity ScoreScore
Codia for Verbatim Text Codia for Verbatim Text CodingCoding
• A user-interface for classifying answer A user-interface for classifying answer items by drag-and-drop actions.items by drag-and-drop actions.
• NLP reduces time and effort in NLP reduces time and effort in searching, browsing, and selecting searching, browsing, and selecting multiple answer items for multiple answer items for classification.classification.
• There’s still limitations and not fully There’s still limitations and not fully automated.automated.
Technical Issues of CodiaTechnical Issues of Codia
• Improve user-interface.Improve user-interface.
• Use only simple NLP techniques.Use only simple NLP techniques.
• Ambiguity resolution by human.Ambiguity resolution by human.
• Limited by thesaurus.Limited by thesaurus.
• Still cannot handle negatives ‘Not’. Still cannot handle negatives ‘Not’.
• Knowledge engineering is tedious.Knowledge engineering is tedious.
Limitations and Future Limitations and Future ImprovementsImprovements
• Thesaurus has only Thesaurus has only 60,000 terms 60,000 terms classified into 3900 classified into 3900 semantic categories.semantic categories.
• Manual operation Manual operation (ambiguity (ambiguity resolution relies on resolution relies on human).human).
• Similarity measures Similarity measures are too mechanical.are too mechanical.
• Need to update and Need to update and incorporate incorporate frequently used frequently used terms/categories.terms/categories.
• Towards automation Towards automation by using more AI by using more AI such as NLP, GA and such as NLP, GA and NN.NN.
• More adaptive by More adaptive by rule-based or case-rule-based or case-based reasoning.based reasoning.
Data Mining and Knowledge Data Mining and Knowledge DiscoveryDiscovery
PatternsPatterns
KnowledgeKnowledge
DataData
Data Data MiningMining
InterpretatioInterpretationn
KnowledgKnowledge e DiscoveryDiscovery
IfIf q12 = 4 and q12 = 4 and
q31 = 6 and q31 = 6 and
q35 = 3 q35 = 3
thenthen q38 = 3 q38 = 3
IfIf h/h_income = 4 h/h_income = 4
and and city = 6 and city = 6 and
car_owner = 3car_owner = 3
thenthen user = 3 user = 3
say(feature,say(feature,[r1]).[r1]).
The segment of respondents who are The segment of respondents who are product X users is characterized byproduct X users is characterized by
residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household income.high monthly household income.
r1 say(feature, say(feature, [r1]).[r1]).
say(general,say(general,[r1]).[r1]).
say(likely,[r1]).say(likely,[r1]).
say(reason,say(reason,[r1]).[r1]).
Basically, the respondents who are Basically, the respondents who are product X users have product X users have
residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household income.high monthly household income.
r1 say(general, say(general, [r1]).[r1]).
The respondents who are product X users The respondents who are product X users because they have because they have
residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household income.high monthly household income.
r1
say(reason, say(reason, [r1]).[r1]).
It is likely that the people who have It is likely that the people who have
residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household incomehigh monthly household income
are product X usersare product X users.
r1
say(likely, [r1]).say(likely, [r1]).
Limitations and Future Limitations and Future ImprovementsImprovements
• Pre-defined syntactic Pre-defined syntactic category of code labels.category of code labels.
• Single sentence for each Single sentence for each rule.rule.
• Lack visualization.Lack visualization.
• Almost no text planning.Almost no text planning.
• English only.English only.
• Lack knowledge of Lack knowledge of explanation.explanation.
• Automatic recognition of Automatic recognition of the syntax.the syntax.
• Describe rule relationship Describe rule relationship in multiple coherent in multiple coherent sentences.sentences.
• Text + graphics or even Text + graphics or even multimedia generation.multimedia generation.
• Implement text planning.Implement text planning.
• Multilingual.Multilingual.
• Implement NL techniques Implement NL techniques for explanation.for explanation.
Concluding RemarksConcluding Remarks
• NLP techniques are found useful in:NLP techniques are found useful in:– Verbatim text coding and Verbatim text coding and
– Data mining report generation.Data mining report generation.
• Group similar answer items.Group similar answer items.
• Write simple natural language text.Write simple natural language text.
• A pricey technology because few A pricey technology because few tools are available.tools are available.
Natural Language Natural Language ProcessingProcessing
Josef Siu-Wai LeungJosef Siu-Wai Leung ([email protected])([email protected])
Ching-Long YehChing-Long Yeh ([email protected])([email protected])