Upload
innovatube
View
352
Download
0
Embed Size (px)
Citation preview
Examples of using NLP in a small projectTrongTranBa - FramgiaVN BrSE
Tech Talk #2
Examples of using NLP in a small projectTech Talk #2
JUSTNo supercomputer and super data center!
News reading service (Japanese)(My current project in Framgia)
ARTICLE RECOMMENDATION SYSTEM (ATS)
ARTICLE CLASSIFICATION SYSTEM (ACS)
HOW?
I am an engineer of Framgia company
I - PRP: Personal Pronounam - VBP: Verb, non-3rd person singular presentan - DT: Determinerengineer - NN: Noun, singularof - IN: Preposition or subordinating conjunctionFramgia - NNP: Proper noun, singularcompany - NN: Noun, singular
I am an engineer of Framgia company→ I/PRP am/VBP an/DT engineer/NN of/IN Framgia/NNP company/NN
POS (part of speech) tagging
Both ARS and ACS use the result of POS tagging!(Of course, we have tools to do POS tagging)
Strategy for ARS and ACS❖ Long-term contents collection
- Wrote a lib which can extract right content of a news article- Collect news contents day by day
★ In my opinion, the most important thing in developing thing related to NLP data - big enough and high reliability
~12,500,000 articles in 2 yearsHTML
ContentHTML
ContentHTML
ContentHTML
ContentHTML
MainContent
ContentContentContentContentContent DB
Strategy for ARS and ACS❖ POS tagging
- Run POS tagging on each content- Select only nouns (which can mostly describe its topic)
- Select most important 20 words (feature words) using TF-IDF formula- Save them into database
★ Designing DB structure was a big deal!
POS tagging result
word/NN word/INword/NNP word/VBP
word/NNPs………………..
POS tagging result
word/NN word/INword/NNP word/VBP
word/NNPs………………..
POS tagging result
word/NN word/INword/NNP word/VBP
word/NNPs………………..
POS tagging result
word/NN word/INword/NNP word/VBP
word/NNPs………………..
TF-IDF20 feature words
ContentContentContent
POS tagging result word/NN word/IN
word/NNP word/VBP word/NNPs
………………..
Keep only nouns word/NN word/NNsword/NNP word/NNPs word/NNPs word/NN
………………..DB
ARS - ARTICLE RECOMMENDATION SYSTEM
- Using b-Bit Minwise Hashing method- A technique for quickly estimating how similar two sets are- Calculate the minhash (4096 bit binary number) of each article from its feature
words for comparing them in future)
- Relation score between 2 articles- Use XOR operator between 2 MINHASHs and get the percent of similarity
Article 1 Article 220 feature words
20 feature words
10001100010…..
0100100111
11101100110…..
0101000101
XOR
Similarity score
ACS - AUTO CLASSIFICATION SYSTEM★ Just about content-based classification (beside URL/Title-based)
- Training - Making the classification model- 10 difference categories (tech, business, gourmet, etc)- With each of them, collect ~35000 articles (currently) in database- Use feature words (result of POS tagging) to make stats of words count, word
frequency in each category, and in total- Ex: on our current model, the word “iphone” appears 17979 times
{"beauty"=>"9", "gourmet"=>"61", "technology"=>"16473", "sports"=>"9", "lifestyle"=>"77", "business"=>"869", "world"=>"142", "travel"=>"143", "entertainment"=>"155", "local"=>"41"}
- Classification - Detect most suitable category of an article- From its feature words and classification model, use Complement Naive Bayes
Classifier to calculate score for each category- Then just select the MAX value
References
Wikipedia- Natural Language Processing
http://en.wikipedia.org/wiki/Natural_language_processing- TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf- Naive Bayes
https://en.wikipedia.org/wiki/Naive_Bayes_classifier- Minhash
https://en.wikipedia.org/wiki/MinHash- etc. (You can search by some words I mentioned in above slides)
Appendix - General P.O.S tags used in NLP
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
- Example of POS tagger, parser tools- Stanford tagger, parser (English)- Mecab Tagger (Japanese)- Cabocha Parser (Japanese)- Vietnamese?
http://vlsp.vietlp.org(Maybe dead already)