method and notes

Embed Size (px)

Citation preview

  • 8/13/2019 method and notes

    1/2

    Topic Models from Twitter Hashtags

    December 9, 2013

    Method

    Social networks data reflects homogenization. It has been showed that users act much like information transmitterswhen there exists hot news that they consider of interest for their audience. Events that capture public attention aremore likelihood of being spread what increases its popularity and scope. A lateral effect of this phenomena are thetrending topicswhich represent events (news, people, events) people discusses a lot about, in certain time.

    When it happens, information mass related to topic gets concentrated in time periods that ranges from hours todays as functions of the impact such events have amongst the audience, and beside retransmitting the hot informationthat gives origin to discussion people shares opinions or comments about it. These discussions rarely involve just onepoint of view and it is common that thematic content evolve over time, giving origin to new topics, merging someexisting ones or simply vanishing.

    We are concerned with capturing topic models from twitter data so our approach consist in using relations betweenwords as features that persist during more time, in order to avoid or reduce the decaying of models while allowing usto use it for identify relevantnew messages accurately for more time, even when there exists thematic variations inthe stream of incoming data generated after training of the model.

    Our proposal is to use Latent Associationsthat represent high-order non-linear relations between words that occursin our short documents. Using Latent Associations as features for our models we can capture patterns observed intraining data as pattern environments, this is, families of variations of the identified pattern added with noise orsubject to corruption (variation).

    Two hypothesis sustain our procedure: that such relations can be obtained, and that they allow us to betterrepresents the topic. To demonstrate the former we use an implementation based in a conectionist model, the RestrictedBoltzmann Machine, that is trained and used to transform counting vectors into a new features space. Then we train

    one class classifiersand evaluate in a proposed hashtag stream filtering environment. For the second we analyse andcompare the decaying of our classifiers and relate the performance with a proposed measures that captures the thematic(lexical) degree of variation in a stream. Details of each procedure are described in next subsection

    0.1 Latent Associations

    Traditionally a topic, as defined in techniques like latent semantic indexing or its generalization latent Dirichletallocation consist of a probability distribution over therms. The words in that distribution are grouped togetherbecause appear in similar contexts certain number of times that is statistically significant. When describing whichterms form a topic, words with probabilities over a threshold are listed. It can be said that this models consider fewlinear co-occurrence relations between words which clearly does not cover other kinds of relations that may existsbetween topical related words.

    Latent associationsrepresent non-linear relations between all the words in the vocabulary, defined recursively as

    functions of the appearance/absence in each of the possible contexts. In that sense, Latent associations are morelike pattern environments, it is families of presence/absence patterns in documents vectors over vocabulary, whichtolerates variations due to noise and deformations and includes versions of the pattern with additional unseen wordsand incomplete patterns. So we argue this kind offlexiblerelations can help to construct a new representations fordocuments that allow us to better identify documents within a topic.

    Calculate an optimal set of this kind of relations seems to be an untreatable problem, due to big number ofparameters that must be optimized, so in our approach, an approximate solution is obtained by a stochastic trainingin a conectionist models, which have shown to have the ability in capturing such kind of relations for different problems.We use Restricted Boltzmann Machines trained with Contrastive Divergence to capture such latent associations in thehidden nodes. Next sections specifies their principles and training.

    1

  • 8/13/2019 method and notes

    2/2

    Restricted Boltzmann Machines

    Restricted Boltzmann Machines RBMhave received lot of attention for its applications as basic processing modularblocks in deep architectures thanks to a set of algorithms discovered in recent years that allows to train them efficiently.They are stochastic connectionist models with two layers of processing units, a visible layer that acts as input and ahidden one which is the output of the net, that captures structurally the relations between inputs. Figure depicts theclassical Restricted Boltzmann Machine Architecture.

    Restricted Boltzmann machines are capable of learning underlying constraints that characterize a domain simply

    by being shown examples from it. Their training modifies the strength of its connections to construct an internalgenerative model that produces examples with the same probability distributions as the training examples.RBMs are composed by computing elements called units that are connected from one layer to other by bidirectional

    links. A unit is always in one of two states on or off, and it adopts these states as a probabilistic function of thestates of the units in the layer it is connected to and the weights on its respective links. Weight can take real valuesof either sign. A unit being on or off is taken to mean that the system currently accepts or rejects some elementalhypothesis about the domain. The weight of a link represents a weak pairwise constrain between two hypothesis. Apositive weight indicates that both tend to support each other; if one is currently accepted, accepting the other shouldbe more likely. A negative weight suggest that the two hypothesis should not both be accepted.

    Each global state of the net can be assigned to a single number called the energy of that state. When using theright configuration, the effect of individual states can be made to act for minimizing the global energy. If some unitsare externally forced to particular states to represent a particular input, the system will find the minimum energyconfiguration that is compatible with that input. Energy of a configuration can be interpreted as the extent to which

    combination of hypothesis violates the constrains implicit in the problem domain [Hinton].If we feed the input visible units with vectors of presence/absence of words in documents and interpret the activationof hidden units as active relations between all those words, after properly training a RBM hidden units implicitly expressthe Latent Associationsthat are present in the training examples. When trained with documents that are topical orsemantically related, these associations manifest the implicit high order relations between terms. Each vector projectedonto the new representational space in order to obtain its latent associations representation.

    Our approach consist in training RBMs that capture relations between terms in a topic collection, transform thedocuments to a Latent Associations space and use them to perform tasks.

    Hashtag Stream Filtering Environment

    Evaluating topic models is difficult because clearly determining when words are relevant or not depends on contextualand subjective criteria for most of the cases. Evaluating models that relies in implicit high order associations between

    terms, which can not be visualized or enumerated, is also hard so we propose an evaluating scheme that indirectlymeasures the power of our models through a one class classifier that in turn filters relevant messages for a topic modeltrained from socially labelled messages at certain time.

    As mentioned before, some of the features that makes difficult to capture and use topic models in socially generatedstreams of data are related to the fast life cycle that topics have in such environments. Authors like [Leskov] haveidentified different patterns. In most of them it is recognizable certain common behaviour, like the rise-sustain-decaycomponents come topics have clearly defined. While topics related to events/news have short life cycles that spanfrom hours to days, topics related to more stableentities like people, places and organizations exposes wider durationswith the discourse around them constantly changing, being updated with more recent informations while abandoningpast lapsed words.

    One of the hypothesis about Latent Associationsis concerned with that ephemeral nature of the context for a topicin certain time. We ask if it is possible for relations between words to capture evidence about the changing directionatopic follows in certain time. If it is the case, our model should be able to identify relevant messages with acceptable

    accuracy in longer time spans; if it is not, the proposed model will perform similarly to models that did not considerchanging information. To probe our ideas the filtering environment was set up.

    Let M...

    Stream Broadness Index

    2