uno o dos problemas

Embed Size (px)

Citation preview

  • 8/13/2019 uno o dos problemas

    1/2

    Topic Models from Twitter Hashtags

    December 2, 2013

    1 Problem Definition

    Topic modelling is about capturing knowledge from a set of semantically related documents in a mathematical frame-work that represents and abstraction of what the topic means. Typical problem of topic modelling involves a documentcollection and a procedure to discover which topics occurs in. Its output is statistical information about what topicsappears in documents, normally as a probability distribution (or a set of distributions).

    Several machine learning tasks take advantage of topic models using them as alternative or complementary repre-sentations for items. Topic modelling also allows to discover groups of documents or groups of features that containsrelated information. Groups of terms represents extensive descriptions of what the topics are, while collections actmuch like implicit descriptions to learn from. Some methods try to discover how many topics there are in a collectionwhile for others it is necessary to manually determine the number of topics the algorithm must consider.

    Techniques of topic modelling are usually applied on large collections of extense documents as scientific literatureor news, where the topic stays relatively unchanged so the effect of time can be ignored. However, the popularizationsof user generated real time streams of textual information raises new problems that known procedures does not solvecompletely.

    On the other hand, this kind of sparse and unstructured data are plenty of information that could be exploded if it ispossible to gather together messages that potentially refers to same subjects. Most extensively used applications rangefrom training messages classifiers to user preferences prediction, items and news recommendation and personalizedsearch. High rates at which messages are generated and its availability in real time also motivates the exploitation ofsuch source of news, interests and opinions.

    Some of this documents also contains semi-structural information, as hashtags or social tags, that are usuallyintegrated in models as evidence of the relevance to a given topic, but as many people uses these tags indiscriminately,

    they also represents a potential source of noise.The big picture about the scenario we are trying to mine information from can be summarized in next assertions:

    Social generated messages, as tweets or sms, are short, normally with a fixed maximum length.

    They are noisy in both grammatical and semantic manners

    When containing social labels or hashtags, these are also susceptible of noise, because some users attach tagsthat does not corresponds with the semantic content of the message.

    The number of messages about a given topic or event is function of time and the social impactof the topic.

    The content of the messages about certain topic changes in time accordingly to the public interest or opinions.

    Last two points relates to the dynamic aspect of user generated content and means that information in social

    environments is not static as in closed document collections. Information in social environments grows and is updatedaccording with the development of the events it refers to. In our model of the problem this is called the model decayand clearly is a function of time.

    Once stated the typical scenario for topic modelling and the characteristics of the documents domain we are tryingto capture, our works is concerned in developing a solution that allows to capture models that better describe theinformation in such dynamic social environments. The general and specific questions that guide our works can beenunciated as follows:

    Can higher-order relations between words be used as features to represents items?

    Does they capture more time persistent information that allow us to identify relevant documents generated moretime after the training that traditional methods with acceptable accuracy?

    1

  • 8/13/2019 uno o dos problemas

    2/2

    How can we evaluate the performance of a model through dynamical environments such as real time user generatedstreams of messages?

    How these new attributes relates to the intrinsic attributes of the document corpora?

    To answers these questions, we develop an experimental framework that allows us to test our hypothesis about latentassociations by implementing an algorithm that realizes a feature space transformation. Using this transformations,we follow the temporal transformations of topics in time and analyse the model decays for various topics. Next sectionsexposes a deep explanation of proposed methods, the working hypothesis and the expected results.

    2