38
Topic Modelling in Farsi and APIs @aliostad Ali Kheyrollahi Barcelona 2015

Topic Modelling and APIs

Embed Size (px)

Citation preview

Page 1: Topic Modelling and APIs

Topic Modelling in Farsi

and APIs

@aliostad

Ali Kheyrollahi

Barcelona 2015

Page 2: Topic Modelling and APIs

Machine Learning

and APIs

Page 3: Topic Modelling and APIs
Page 4: Topic Modelling and APIs

Topic Modelling

Page 5: Topic Modelling and APIs

Topic Modelling

Page 6: Topic Modelling and APIs

Topic ModellingSearch

Page 7: Topic Modelling and APIs

Topic Modelling*Document to document similarity

Page 8: Topic Modelling and APIs

Farsi

Page 9: Topic Modelling and APIs

Farsi

23rd most popular Spoken Language,

above Italian and Polish

Page 10: Topic Modelling and APIs

Farsi

14th most popular Internet Language,

above Korean and Swedish

Page 11: Topic Modelling and APIs

Acquisition and pre-

My Book

کتاب‌منketaab -e- man

Page 12: Topic Modelling and APIs

What did you just say?

کتاب‌منکک ت اب ‌م ن

باککتکيیکیکم

Page 13: Topic Modelling and APIs

How weird can it get?

Unicode codez

Page 14: Topic Modelling and APIs

And there’s moreکتاب‌من

کتاب منZero-width non-joiner (0x200C)

‌HTML =>

کتابمن

Page 15: Topic Modelling and APIs

Topic Modelling

and LDA

Page 16: Topic Modelling and APIs
Page 17: Topic Modelling and APIs

Latent Dirichlet Allocation (LDA)

✤ Mainly a “clustering” algorithm

✤ Defines topics as latent variables within the documents

✤ Its implementations available in most programming languages

✤ Python => Gensim and Java => Mallet

Page 18: Topic Modelling and APIs

Topic Modelling concepts in LDA

✤ Document: “Bag of words” vs. “Markov chain”

✤ Word: mere an id (“library”=>123, “librarian”=>789)

✤ Dictionary: set of all words

✤ Corpus: set of all documents

✤ Topic: Distribution over words (LDA)

Page 19: Topic Modelling and APIs

Using Latent Dirichlet Allocation

✤ Document as a vector of topic weights {0: 0.01, 12: 0.19, 42: 0.23}

✤ Cosine similarity for document similarity

✤ Document similarity works really well

✤ Not great in some domains [to fix => Hierarchical]

✤ Boosting

Page 20: Topic Modelling and APIs

Topic Model and

APIs

Page 21: Topic Modelling and APIs

Resources: Dictionary

Dictionary

POST languages/farsi/dictionaries HTTP/1.1 Host: example.com

200 OK Location: languages/farsi/dictionaries/123

languages/{lang}/dictionaries/{id}

Create

Page 22: Topic Modelling and APIs

Resources: Dictionary

Dictionary

PUT languages/farsi/dictionaries/123 HTTP/1.1 Content-Type: application/json

{ “documents”: [ {“fullText”: “یک فوق‌تخصص قلب و عروق گفت”}, … ]

}

languages/{lang}/dictionaries/{id}

Add document words

Page 23: Topic Modelling and APIs

Resources: Corpus

Corpus

POST languages/farsi/corpi HTTP/1.1 Host: example.com

200 OK location: languages/farsi/corpi/123

languages/{lang}/corpora/{id}

Create

Page 24: Topic Modelling and APIs

Resources: Corpus

Corpus

PUT languages/farsi/corpi/123 HTTP/1.1 Content-Type: application/json

{ “documents”: [ {“fullText”: “یک فوق‌تخصص قلب و عروق گفت”}, … ]

}

languages/{lang}/corpora/{id}

Add documents

Page 25: Topic Modelling and APIs

Resources: TopicModel

TopicModel

POST languages/farsi/topicmodels?passes=6&alpha=auto HTTP/1.1 Host: example.com

{ “dictionaryId”:123, “corpusId”:456

}

languages/{lang}/topicmodels

Create (request)

Page 26: Topic Modelling and APIs

Resources*: TopicModel

TopicModel

202 Accepted Location: languages/farsi/topicmodels/789

languages/{lang}/topicmodels

Create (response)

Page 27: Topic Modelling and APIs

State of current ML APIs

Page 28: Topic Modelling and APIs

State of current ML APIs

HATEOAS

HypermediaREST

APIs C A C H E

Markov ChainGraph Theory

Deep LearningBayesian

Page 29: Topic Modelling and APIs

Server Authority

State

Page 30: Topic Modelling and APIs

Server Authority

Algorithm

Page 31: Topic Modelling and APIs

Is this really a resource?

Converts

Page 32: Topic Modelling and APIs

Mills

Page 33: Topic Modelling and APIs

Mills

✤ A single piece of work/specialty (& verb)

✤ Encapsulating an “algorithm”

✤ Do not own data (own config tho): Raw data in, processed result out

✤ All calls are safe and idempotent

Page 34: Topic Modelling and APIs

Topic Model Mills: classifier

TopicModellanguages/{lang}/topicmodels/{id}/classifier

classifier (request)

POST languages/farsi/topicmodels/789/classifier HTTP/1.1 Host: example.com

{ “fullText”:“یک فوق‌تخصص قلب و عروق گفت”, “refinement”:”hierarchical”

}

Page 35: Topic Modelling and APIs

Topic Model Mills: classify

TopicModellanguages/{lang}/topicmodels/{id}/classify

classifier (response)OK 200 Content-Type: application/json

{ “15”: 0.03, “123”: 0.2, “390”: 0.09, …

}

Page 36: Topic Modelling and APIs

Thank you!

@aliostadaliostad [at] gmail [dot] com

Page 37: Topic Modelling and APIs

Acknowledgements✤ Windmill picture: https://www.flickr.com/photos/capnkroaker/2473951927/in/photolist-4LBDHz-4Mba53-pnrARE-4Ktk7H

✤ Algorithm picture: https://www.flickr.com/photos/peterrosbjerg/4257452000/in/photolist-7udy9Q-834w2L-dcPgeA-dcPg7s-dcPdHZ-jiRgZs-jiQstc-8qnJKb-8qdWAF-8qh6D1-8qdWEz-b8ADMZ-b8Ausi-ansdvD-dcPgc9-8kNsd1-pNCgk1-b7G8ZT-8pQRPF-8pTvUy-eXse1A-99XXLF-eKT1Y-831n9H-jj7vuo-jiRZij-b7G84Z-b7G78P-fvrqUB-b7GajB-jiPF46-8ERYQY-jiNzQv-jiRkkW-jiNUNm-jiPgPZ-jiPbWU-jiNQQm-jiRhDF-jiPF7m-jiSdJs-jiPqDT-jiSa8u-jiPwr5-jiM5Ac-jiMKze-jiPZCZ-jiNwf6-jiNtMF-jiPHut

✤ Water Tanks picture: https://www.flickr.com/photos/psilver/2280385292/in/photolist-4tvz6N-9h1PHy-cuie7-RJ83k-696owN-85tcrs-74MqFc-pkuu5-o3BsGV-bR11F2-8jNAnA-ep2fVX-8YyHWv-ABECA-av9mMk-7LMozD-dMySvh-7Pipgo-5rXApy-Q8zgi-eFxGYc-7sDbjx-87LdLE-aELtQV-7AnXb7-dJqjNR-XYpHK-nAFVCS-95G4EU-9jxNiT-7F1RPj-68hFop-7VFYSs-nzr9W2-pb3zpe-9j5sua-9962cu-bJ1UED-dp6yqD-8UCQTj-NywAX-kBG3xr-9aTXxq-pVmJui-k8BDsX-7XXtce-7pKUVr-5Hn3CL-rvWcUu-kW6dat

✤ IT Web Jobs: http://www.itjobswatch.co.uk/jobs/uk/machine%20learning.do

✤ Question mark picture: https://www.flickr.com/photos/129627585@N07/15684220620/in/photolist-pTXJmU-4W4Xed-dKRgQ2-LLBYA-8uuSCh-8qk5Q-4y7wzQ-6feu6Z-6EsuSe-f7eVmb-9WAPNR-f75fPR-8Lzt7R-9L2t4y-apWJQR-fhdGoH-4v7kg9-65wR72-7FyjMW-epmYfa-abQKN1-6m1HuV-86Uor8-a64uYL-a61DMr-9oCDYc-dW2Xad-a64vnN-a61A6c-2Zvn7-5pjxSz-9Gd43a-oQRf6d-oQReWf-p8kRm6-p8iYSE-xtEEP-7oxXJg-a64sGS-7U52mA-2Z97S-a61D7e-a64uFd-aiWYZk-2Z9mV-4cmUWW-2Zoxb-2Zg4r-a61Fmi-a61B1T

✤ Timber picture: https://www.flickr.com/photos/simonbleasdale/2797031694/in/photolist-5gaw5o-hLB1BA-9Zafk9-bnjrwy-cSf2fU-cSf4tN-69qu5j-69qzMC-dkby9F-7wjTFp-kRxidH-53cRdq-nDr93N-kRFANh-25iCW3-cjLSKs-9R81XA-4xHk5Z-9R7MXG-5gay3S-baDJJF-bnjrdE-9R7XMf-9R8Knj-9R8Ceq-bAej8Z-bnjrhU-8WcymA-bnjrA5-9R8RgQ-9R5CtZ-9R8e6h-9R8Peh-9R5eca-9R8htw-9R5bbD-9R5yov-9R51kT-9R5hqK-dvvEFb-dvq6fP-dvq7GT-dvvFG7-dvvEyL-dvvGtA-dvq8bT-dvvGAN-dvq6U2-dvvGWQ-dvq5sD

✤ Timber: https://commons.wikimedia.org/wiki/Category:Timber#/media/File:Oregon_BLM_Forestry_10_(6871708937).jpg

Page 38: Topic Modelling and APIs

References

✤ Gensim: https://radimrehurek.com/gensim/

✤ LDA Paper: http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf

✤ Client-Server Domain Separation: http://byterot.blogspot.com.es/2012/11/client-server-domain-separation-csds-rest.html

✤ Mill proposal: Is coming!