Topic Modelling in Farsi
and APIs
@aliostad
Ali Kheyrollahi
Barcelona 2015
Machine Learning
and APIs
Topic Modelling
Topic Modelling
Topic ModellingSearch
Topic Modelling*Document to document similarity
Farsi
Farsi
23rd most popular Spoken Language,
above Italian and Polish
Farsi
14th most popular Internet Language,
above Korean and Swedish
Acquisition and pre-
My Book
کتابمنketaab -e- man
What did you just say?
کتابمنکک ت اب م ن
باککتکيیکیکم
How weird can it get?
Unicode codez
And there’s moreکتابمن
کتاب منZero-width non-joiner (0x200C)
‌HTML =>
کتابمن
Topic Modelling
and LDA
Latent Dirichlet Allocation (LDA)
✤ Mainly a “clustering” algorithm
✤ Defines topics as latent variables within the documents
✤ Its implementations available in most programming languages
✤ Python => Gensim and Java => Mallet
Topic Modelling concepts in LDA
✤ Document: “Bag of words” vs. “Markov chain”
✤ Word: mere an id (“library”=>123, “librarian”=>789)
✤ Dictionary: set of all words
✤ Corpus: set of all documents
✤ Topic: Distribution over words (LDA)
Using Latent Dirichlet Allocation
✤ Document as a vector of topic weights {0: 0.01, 12: 0.19, 42: 0.23}
✤ Cosine similarity for document similarity
✤ Document similarity works really well
✤ Not great in some domains [to fix => Hierarchical]
✤ Boosting
Topic Model and
APIs
Resources: Dictionary
Dictionary
POST languages/farsi/dictionaries HTTP/1.1 Host: example.com
200 OK Location: languages/farsi/dictionaries/123
languages/{lang}/dictionaries/{id}
Create
Resources: Dictionary
Dictionary
PUT languages/farsi/dictionaries/123 HTTP/1.1 Content-Type: application/json
{ “documents”: [ {“fullText”: “یک فوقتخصص قلب و عروق گفت”}, … ]
}
languages/{lang}/dictionaries/{id}
Add document words
Resources: Corpus
Corpus
POST languages/farsi/corpi HTTP/1.1 Host: example.com
200 OK location: languages/farsi/corpi/123
languages/{lang}/corpora/{id}
Create
Resources: Corpus
Corpus
PUT languages/farsi/corpi/123 HTTP/1.1 Content-Type: application/json
{ “documents”: [ {“fullText”: “یک فوقتخصص قلب و عروق گفت”}, … ]
}
languages/{lang}/corpora/{id}
Add documents
Resources: TopicModel
TopicModel
POST languages/farsi/topicmodels?passes=6&alpha=auto HTTP/1.1 Host: example.com
{ “dictionaryId”:123, “corpusId”:456
}
languages/{lang}/topicmodels
Create (request)
Resources*: TopicModel
TopicModel
202 Accepted Location: languages/farsi/topicmodels/789
languages/{lang}/topicmodels
Create (response)
State of current ML APIs
State of current ML APIs
HATEOAS
HypermediaREST
APIs C A C H E
Markov ChainGraph Theory
Deep LearningBayesian
Server Authority
State
Server Authority
Algorithm
Is this really a resource?
Converts
Mills
Mills
✤ A single piece of work/specialty (& verb)
✤ Encapsulating an “algorithm”
✤ Do not own data (own config tho): Raw data in, processed result out
✤ All calls are safe and idempotent
Topic Model Mills: classifier
TopicModellanguages/{lang}/topicmodels/{id}/classifier
classifier (request)
POST languages/farsi/topicmodels/789/classifier HTTP/1.1 Host: example.com
{ “fullText”:“یک فوقتخصص قلب و عروق گفت”, “refinement”:”hierarchical”
}
Topic Model Mills: classify
TopicModellanguages/{lang}/topicmodels/{id}/classify
classifier (response)OK 200 Content-Type: application/json
{ “15”: 0.03, “123”: 0.2, “390”: 0.09, …
}
Thank you!
@aliostadaliostad [at] gmail [dot] com
Acknowledgements✤ Windmill picture: https://www.flickr.com/photos/capnkroaker/2473951927/in/photolist-4LBDHz-4Mba53-pnrARE-4Ktk7H
✤ Algorithm picture: https://www.flickr.com/photos/peterrosbjerg/4257452000/in/photolist-7udy9Q-834w2L-dcPgeA-dcPg7s-dcPdHZ-jiRgZs-jiQstc-8qnJKb-8qdWAF-8qh6D1-8qdWEz-b8ADMZ-b8Ausi-ansdvD-dcPgc9-8kNsd1-pNCgk1-b7G8ZT-8pQRPF-8pTvUy-eXse1A-99XXLF-eKT1Y-831n9H-jj7vuo-jiRZij-b7G84Z-b7G78P-fvrqUB-b7GajB-jiPF46-8ERYQY-jiNzQv-jiRkkW-jiNUNm-jiPgPZ-jiPbWU-jiNQQm-jiRhDF-jiPF7m-jiSdJs-jiPqDT-jiSa8u-jiPwr5-jiM5Ac-jiMKze-jiPZCZ-jiNwf6-jiNtMF-jiPHut
✤ Water Tanks picture: https://www.flickr.com/photos/psilver/2280385292/in/photolist-4tvz6N-9h1PHy-cuie7-RJ83k-696owN-85tcrs-74MqFc-pkuu5-o3BsGV-bR11F2-8jNAnA-ep2fVX-8YyHWv-ABECA-av9mMk-7LMozD-dMySvh-7Pipgo-5rXApy-Q8zgi-eFxGYc-7sDbjx-87LdLE-aELtQV-7AnXb7-dJqjNR-XYpHK-nAFVCS-95G4EU-9jxNiT-7F1RPj-68hFop-7VFYSs-nzr9W2-pb3zpe-9j5sua-9962cu-bJ1UED-dp6yqD-8UCQTj-NywAX-kBG3xr-9aTXxq-pVmJui-k8BDsX-7XXtce-7pKUVr-5Hn3CL-rvWcUu-kW6dat
✤ IT Web Jobs: http://www.itjobswatch.co.uk/jobs/uk/machine%20learning.do
✤ Question mark picture: https://www.flickr.com/photos/129627585@N07/15684220620/in/photolist-pTXJmU-4W4Xed-dKRgQ2-LLBYA-8uuSCh-8qk5Q-4y7wzQ-6feu6Z-6EsuSe-f7eVmb-9WAPNR-f75fPR-8Lzt7R-9L2t4y-apWJQR-fhdGoH-4v7kg9-65wR72-7FyjMW-epmYfa-abQKN1-6m1HuV-86Uor8-a64uYL-a61DMr-9oCDYc-dW2Xad-a64vnN-a61A6c-2Zvn7-5pjxSz-9Gd43a-oQRf6d-oQReWf-p8kRm6-p8iYSE-xtEEP-7oxXJg-a64sGS-7U52mA-2Z97S-a61D7e-a64uFd-aiWYZk-2Z9mV-4cmUWW-2Zoxb-2Zg4r-a61Fmi-a61B1T
✤ Timber picture: https://www.flickr.com/photos/simonbleasdale/2797031694/in/photolist-5gaw5o-hLB1BA-9Zafk9-bnjrwy-cSf2fU-cSf4tN-69qu5j-69qzMC-dkby9F-7wjTFp-kRxidH-53cRdq-nDr93N-kRFANh-25iCW3-cjLSKs-9R81XA-4xHk5Z-9R7MXG-5gay3S-baDJJF-bnjrdE-9R7XMf-9R8Knj-9R8Ceq-bAej8Z-bnjrhU-8WcymA-bnjrA5-9R8RgQ-9R5CtZ-9R8e6h-9R8Peh-9R5eca-9R8htw-9R5bbD-9R5yov-9R51kT-9R5hqK-dvvEFb-dvq6fP-dvq7GT-dvvFG7-dvvEyL-dvvGtA-dvq8bT-dvvGAN-dvq6U2-dvvGWQ-dvq5sD
✤ Timber: https://commons.wikimedia.org/wiki/Category:Timber#/media/File:Oregon_BLM_Forestry_10_(6871708937).jpg
References
✤ Gensim: https://radimrehurek.com/gensim/
✤ LDA Paper: http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf
✤ Client-Server Domain Separation: http://byterot.blogspot.com.es/2012/11/client-server-domain-separation-csds-rest.html
✤ Mill proposal: Is coming!