23
LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media Department Supervised by : Michael Voelske Tim Gollub 2014/2015 WS By : Erdan Genc Yamen Ajjour

Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

LDA Topic Modeling for the webis-csp15 Corpus

Bauhaus Weimar UniversityFaculty of Media Computer Science and Media Department

Supervised by : Michael Voelske

Tim Gollub

2014/2015 WS

By :Erdan Genc

Yamen Ajjour

Page 2: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Overview

● Terminologies● Task● Extracting the subjects from CFP Mails● LDA● Gensim● Results and Conclusion● (Similarity calculation using MapReduce)

Page 3: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

● CFP - Call for Paper; an Email inviting the subscribers to hand in papers regarding different topics

● Subject - Phrase that describes a CFP Topic● BOW - Bag of words● LDA - Latent Dirichlet Allocation, generative Topic Model● webis-csp15 Corpus - Collection of all Webis Computer Science Papers● Projection - Mapping from one space to another, e.g. from BOW to LDA● Paper(s) - Webis Computer Science Paper(s)

Terminologies

Page 4: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Task● Mine CFP Emails● Extract the (CFP) subjects out of the Emails● Compute the LDA model of the webis-csp15 Corpus● Project the plain text papers into LDA space● Project the subjects into LDA space● Compute similarity between subjects and papers

Page 5: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Extracting the CFP Subjects (1)

● First: Mine Mails○ Download from the different mailing lists○ DBWorld provided their archive

● Java program to extract topics○ Convert files to plain text, mails line by line○ Regex to extract the subject-list in the mail○ Blacklist to filter subjects○ Regex to filter subjects

Page 6: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Extracting the CFP Subjects (2)● Difficulties:

○ Extracting the right list○ Distinguish from organizer

and job lists● Accuracy: 40-50%● Output: 220k Subjects, line by line

Regex:(\s*[*|-]\s*[A-Z][\s|\w|,|/|-]*.?[<BREAK>|\n]*\s*)+\s*[*|-][\s|\w|,|/|-]*.?

Example CFP Mail:...to its fit with the workshop theme and established research agendapublished in "Space, Time, and Visual Analytics"<A HREF="http://dx.doi.org/10.1080/13658816.2010.508043">http://dx.doi.org/10.1080/13658816.2010.508043</A>

Example topics include, but are not limited to , the visualization and interactiveanalysis of large data sets representing:

- individual and group movement behaviours, either in physical or virtual spaces dynamics of geo-localised sensor data- spatio-temporal events- remotely sensed data, multi-scale and multi-temporal- large high-dimensional data sets in space and time- streams of spatio-temporal data- volunteered geographic information...

Page 7: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

LDA - Latent Dirichlet allocation (1)

● Idea:○ A document is a BOW, no semantics○ Each word belongs to a topic with a certain

probability○ By analyzing the words, we can infer the topics of a

document

Page 8: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

LDA - Latent Dirichlet allocation (2)

Page 9: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

LDA - Document Generative Model1. Mary likes green and blue things.2. Peter likes red and pink things.3. Jane likes rectangles and triangles.4. Joe likes circles.5. Paul likes green triangles.

Distribution over SentencesSentence 1 and 2: 100% Topic ColorsSentence 3 and 4: 100% Topic ShapesSentence 5: 50% Topic Colors, 50% Topic Shapes

Topic “Colors”: { 20% green, 20% blue, 20% red, 20% pink, .. }Topic “Shapes”: { 30% circles, 30% rectangles , 30% triangles, .. }

How to compute this Distribution?

Page 10: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

LDA - Latent Dirichlet allocation (3)Gibbs sampling:● Set documents with an assumed number of topics T● Randomly assign a topic to each word in every document

For each Document d For each word w Calculate p( t | d) // the proportion of words in document d that are assigned to topic t

Calculate p(w | t) // the proportion of assignments to topic t over all documents that come from this word w.

Assign w a new topic t with probability p (topic t | document d) * p(word w | topic t)

Page 11: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Gensim● Topic modeling toolkit, implemented in the Python programming language● Provides distributed Implementation for LDA ( Multicore , Distributed)● Uses Pyro Technology● Start naming server

python -m Pyro4.naming -n 0.0.0.0 ● Start worker on nodes

python -m gensim.models.lda_worker ● Start a dispatcher on one node

python -m gensim.models.lda_dispatcher ● Start learning the lda model

lda = LdaModel (corpus =c, num_topics=20 ,chunksize= 2, passes=100, update_every=1 , distributed=True)

Page 12: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Birdview of the process

LDA Algorithm

Lda Model

num of topics = 100

Offline Training

Subject Extraction

webis-csp15

Call for papers Subjects

Lda Model

webis-csp15 Subjects

Lda Model

x1x2...

x100

x1x2...

x100

SimilarityCalculation

Online Usage

matched indices

Page 13: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Results - TopicsTopic Example 1 20.5*query + 10.6*queries + 3.1*index + 1.3*database + 1.0*answer + 0.9*indexing + 0.7*databases + 0.6*answers + 0.6*indexes + 0.6*queryingTopic Example 213.0*server + 8.5*client + 5.3*request + 4.1*requests + 3.5*clients + 2.0*session + 1.7*proxy + 1.5*remote + 1.3*response + 1.0*serviceTopic Example 319.5*pp + 10.3*fig + 8.7*proc + 5.2*vol + 3.5*ii + 2.3*conf + 1.9*iii + 1.8*he + 1.6*unit + 1.5*int

Page 14: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Results - Paper ProjectionsPaper Abstract ACM.ID 1626467We present a comparison of two approaches for Arabic-Chinese machine translation using English as a pivot language: sentence pivoting and phrase-table pivoting. Our results show that using English as a pivot in either approach outperforms direct translation from Arabic to Chinese. Our best result is the phrase-pivot system which scores higher than direct translation by 1.1 BLEU points. An error analysis of our best system shows that we successfully handle many complex Arabic-Chinese syntactic variations.

Topic 77 - 60.61%0.042*translation + 0.035*word + 0.023*english + 0.015*alignment + 0.013*phrase + 0.013*chinese + 0.011*sentence + 0.009*mt + 0.009*corpus + 0.009*targetTopic 9 - 20.05%0.020*word + 0.013*corpus + 0.013*sentence + 0.010*lexical + 0.010*verb + 0.009*noun + 0.009*semantic + 0.009*sentences + 0.009*syntactic + 0.008*linguistics

Important: These are only the two highest topics

Page 15: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Results - Subject ProjectionsSubject Example 1 - Electronic CommerceTopic 27 - 67%27 : 0.068*items + 0.058*item + 0.026*trust + 0.024*recommendation + 0.021*ratings + 0.019*rating + 0.018*profile + 0.015*recommendations + 0.014*recommender + 0.014*collaborative

Subject Example 2 - Cloud Federation and IntercloudTopic 28 - 34.00%0.028*job + 0.026*cloud + 0.021*grid + 0.021*jobs + 0.016*parallel + 0.013*tasks + 0.012*gpu + 0.010*mapreduce + 0.010*map + 0.010*mpiTopic 89 - 33.32%0.064*service + 0.040*services + 0.016*layer + 0.010*adaptation + 0.008*middleware + 0.007*platform + 0.006*infrastructure + 0.006*self + 0.006*adaptive + 0.006*resources

Important: These are all the topics, the subject spans

Page 16: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Results - Paper Subject Pairs (1)Paper Abstract ACM.ID 1626467We present a comparison of two approaches for Arabic-Chinese machine translation […] complex Arabic-Chinese syntactic variations.

Subject: HLT* for information system design*Human Language Technology

cosine distance = 0.056

Topic 77 - 60.61%0.042*translation + 0.035*word + 0.023*english + 0.015*alignment + 0.013*phrase + 0.013*chinese + 0.011*sentence + 0.009*mt + 0.009*corpus + 0.009*target

Topic 77 - 50.50%

Page 17: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Results - Paper Subject Pairs (2)Paper Abstract ACM.ID 1900447[...] This paper proposes a novel technique for storing and recovering the server-side components of a website from the [...]

Subject: Knowledge Capture from Social Environments and Contexts

cosine distance = 0.0043

Topic 4 - 69.4%0.011*people + 0.006*collaboration + 0.005*shared + 0.005*collaborative + 0.005*interaction + 0.005*activities + 0.005*he + 0.005*social + 0.004*activity + 0.004*awareness

Topic 47 - 12.13%

Topic 4 - 68.19%

Topic 47 - 12.21%0.072*social + 0.045*game + 0.021*community + 0.020*games + 0.018*players + 0.018*online + 0.016*player + 0.012*communities + 0.011*friends + 0.010*networks

Page 18: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Conclusion● No convincing similarities between subjects and papers● Conceivable reasons:

○ Subjects are too short to get meaningful projections○ Not enough topics (100 topics)○ Not enough iterations in the LDA (20 passes)○ Cosine Distance can be misleading○ Dictionary is not mature enough, contains noisy

words

Page 19: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Thank you for your attention!

If we still got time, we can take a look at the MapReduce Job

Page 20: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Similarity calculation using MapReduce (1)

● 220k*70K*100 = 1.54T Calculations● Using Hadoop Streaming and Python● Mapper:

○ Two inputs: SubjectProjections and PaperProjections○ Output: {documentKey, subjectKey, distance}○ Distance: Cosine or Euclidean Distance

● Reducer:○ Output: Document subject pair with smallest distance

Page 21: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

Similarity calculation using MapReduce (2)

● How to handle two inputs?○ Distribute documents over the nodes manually,

using the same path

Page 22: Faculty of Media Computer Science and Media Department for ... · LDA Topic Modeling for the webis-csp15 Corpus Bauhaus Weimar University Faculty of Media Computer Science and Media

MapReduce Overview

Paper Projections

Mapper Mapper Mapper

Subject Projections

Reducer Reducer Reducer

{documentKey, subjectKey, distance}

{documentKey, subjectKey, min. distance}

220k

70k

{ t1, t2, … tn }

{ t1, t2, … tn }