13
Examples of using NLP in a small project TrongTranBa - FramgiaVN BrSE Tech Talk #2

Tech Talk #2: Playing with tons of web content aka NLP in examples

Embed Size (px)

Citation preview

Page 1: Tech Talk #2: Playing with tons of web content aka NLP in examples

Examples of using NLP in a small projectTrongTranBa - FramgiaVN BrSE

Tech Talk #2

Page 2: Tech Talk #2: Playing with tons of web content aka NLP in examples

Examples of using NLP in a small projectTech Talk #2

JUSTNo supercomputer and super data center!

News reading service (Japanese)(My current project in Framgia)

Page 3: Tech Talk #2: Playing with tons of web content aka NLP in examples

ARTICLE RECOMMENDATION SYSTEM (ATS)

Page 4: Tech Talk #2: Playing with tons of web content aka NLP in examples

ARTICLE CLASSIFICATION SYSTEM (ACS)

Page 5: Tech Talk #2: Playing with tons of web content aka NLP in examples

HOW?

Page 6: Tech Talk #2: Playing with tons of web content aka NLP in examples

I am an engineer of Framgia company

I - PRP: Personal Pronounam - VBP: Verb, non-3rd person singular presentan - DT: Determinerengineer - NN: Noun, singularof - IN: Preposition or subordinating conjunctionFramgia - NNP: Proper noun, singularcompany - NN: Noun, singular

Page 7: Tech Talk #2: Playing with tons of web content aka NLP in examples

I am an engineer of Framgia company→ I/PRP am/VBP an/DT engineer/NN of/IN Framgia/NNP company/NN

POS (part of speech) tagging

Both ARS and ACS use the result of POS tagging!(Of course, we have tools to do POS tagging)

Page 8: Tech Talk #2: Playing with tons of web content aka NLP in examples

Strategy for ARS and ACS❖ Long-term contents collection

- Wrote a lib which can extract right content of a news article- Collect news contents day by day

★ In my opinion, the most important thing in developing thing related to NLP data - big enough and high reliability

~12,500,000 articles in 2 yearsHTML

ContentHTML

ContentHTML

ContentHTML

ContentHTML

MainContent

ContentContentContentContentContent DB

Page 9: Tech Talk #2: Playing with tons of web content aka NLP in examples

Strategy for ARS and ACS❖ POS tagging

- Run POS tagging on each content- Select only nouns (which can mostly describe its topic)

- Select most important 20 words (feature words) using TF-IDF formula- Save them into database

★ Designing DB structure was a big deal!

POS tagging result

word/NN word/INword/NNP word/VBP

word/NNPs………………..

POS tagging result

word/NN word/INword/NNP word/VBP

word/NNPs………………..

POS tagging result

word/NN word/INword/NNP word/VBP

word/NNPs………………..

POS tagging result

word/NN word/INword/NNP word/VBP

word/NNPs………………..

TF-IDF20 feature words

ContentContentContent

POS tagging result word/NN word/IN

word/NNP word/VBP word/NNPs

………………..

Keep only nouns word/NN word/NNsword/NNP word/NNPs word/NNPs word/NN

………………..DB

Page 10: Tech Talk #2: Playing with tons of web content aka NLP in examples

ARS - ARTICLE RECOMMENDATION SYSTEM

- Using b-Bit Minwise Hashing method- A technique for quickly estimating how similar two sets are- Calculate the minhash (4096 bit binary number) of each article from its feature

words for comparing them in future)

- Relation score between 2 articles- Use XOR operator between 2 MINHASHs and get the percent of similarity

Article 1 Article 220 feature words

20 feature words

10001100010…..

0100100111

11101100110…..

0101000101

XOR

Similarity score

Page 11: Tech Talk #2: Playing with tons of web content aka NLP in examples

ACS - AUTO CLASSIFICATION SYSTEM★ Just about content-based classification (beside URL/Title-based)

- Training - Making the classification model- 10 difference categories (tech, business, gourmet, etc)- With each of them, collect ~35000 articles (currently) in database- Use feature words (result of POS tagging) to make stats of words count, word

frequency in each category, and in total- Ex: on our current model, the word “iphone” appears 17979 times

{"beauty"=>"9", "gourmet"=>"61", "technology"=>"16473", "sports"=>"9", "lifestyle"=>"77", "business"=>"869", "world"=>"142", "travel"=>"143", "entertainment"=>"155", "local"=>"41"}

- Classification - Detect most suitable category of an article- From its feature words and classification model, use Complement Naive Bayes

Classifier to calculate score for each category- Then just select the MAX value

Page 12: Tech Talk #2: Playing with tons of web content aka NLP in examples

References

Wikipedia- Natural Language Processing

http://en.wikipedia.org/wiki/Natural_language_processing- TF-IDF

https://en.wikipedia.org/wiki/Tf%E2%80%93idf- Naive Bayes

https://en.wikipedia.org/wiki/Naive_Bayes_classifier- Minhash

https://en.wikipedia.org/wiki/MinHash- etc. (You can search by some words I mentioned in above slides)

Page 13: Tech Talk #2: Playing with tons of web content aka NLP in examples

Appendix - General P.O.S tags used in NLP

http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

- Example of POS tagger, parser tools- Stanford tagger, parser (English)- Mecab Tagger (Japanese)- Cabocha Parser (Japanese)- Vietnamese?

http://vlsp.vietlp.org(Maybe dead already)