Twitter users categorization - Data Mining Project

Embed Size (px)

Citation preview

  • 8/12/2019 Twitter users categorization - Data Mining Project

    1/7

    2014. The copyright of this document resides with its authors.

    It may be distributed unchanged freely in print or electronic forms.

    DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER

    Abstract

    This report addresses the task of user categorization in social media with

    an application to Twitter. !e automatically infer the "alues of user category

    such as company indi"idual professional home user sportsman student or

    teacher. !e employ a machine learning approach which relies on a

    comprehensi"e set of features deri"ed from user#s tweets. $ur results showed

    nearly %0& accuracy across two classifiers 'ai"e (ayes classifier and

    )e*uential +inimal $ptimization ,)+$-.

    Introduction

    )uccessful microblogging ser"ices such as Twitter ha"e become an integral part of the

    daily life of millions of users. In addition to communicating with friends family or

    ac*uaintances microblogging ser"ices are used as recommendation ser"ices realtime

    news sources and content sharing "enues.

    / users eperience with a microblogging ser"ice could be significantly impro"ed if

    information about the demographic attributes or personal interests of the particular user as

    well as the other users of the ser"ice are a"ailable. )uch information could allow forpersonalized recommendations of users to follow or user posts to read e"ents and topics of

    interest to particular communities could be highlighted additionally targeted

    ad"ertisements can also be displayed.

    Categorization of users on Twitter

    Dr. Malik Tahir Hassan

    [email protected]

    Muhammad Usman [email protected]

    Daud [email protected]

    Muhammad in Ul [email protected]

    "id #[email protected]

    Muzamil [email protected]

    %&h''l '( %&ien&e and Te&hn'l')*

    Uni$ersit* '( Mana)ement +Te&hn'l')*,ah're- akistan

  • 8/12/2019 Twitter users categorization - Data Mining Project

    2/7

    2 DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER

    Literature review

    Detecting user attributes based on user communication streams.3re"ious work has

    eplored the impact of peoples profiles on the style patterns and content of theircommunication streams. esearchers in"estigated the detection of gender from well

    written traditional tet ,5erring and 3aolillo 2010 )ingh 2001- blogs ,(urger and

    5enderson 2010- re"iews ,$tterbacher 2010- email ,6arera and 7aro"sky 2008- user

    search *ueries ,9ones et al. 2008 !eber and :astillo 2010- and for Twitter ,ao et al.

    2010-. $ther pre"iously eplored attributes include the users location ,9ones et al. 2008

    ;ink et al. 200ust starting to be eplored for user classification. /dditionally pre"ious work uses a

    miture of sociolinguistic features and ngram models while we focused on content of the

    tweets to achie"e our task of user classification.

    Topic models for Twitter. ,amage 2010- uses large scale topic models to

    represent Twitter feeds and users showing impro"ed performance on tasks such as post

    and user recommendation. !e confirm the "alue of largescale topic models for a different

    set of tasks ,user classification- and analyse their impact as part of a rich feature set.

    Methodology

    $ur methodology of categorization of users consists of two steps? ,i- 3reprocessing ,ii-

    :lassification.

    3reprocessing is further di"ided into four steps? ,i- :on"ersion to /;; format ,ii-

    Tweets ,strings- are con"erted into words ,iii- emo"al of stop words ,i"- +anual remo"al

    of unnecessary words.

    ;igure 1 ,a- aw @ata

  • 8/12/2019 Twitter users categorization - Data Mining Project

    3/7

    DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER 3

    ;igure 1 ,b- @ata $rganization of raw data

    ;igure 1? ,a- show raw data of the training and test set. This is data is processed to get

    remo"e unnecessary attributes and words. ,b- shows data organization of the raw data.

    . !re"processing

    3reprocessing is performed on both training and testing data sets. The gi"en data sets

    were in .tt format. To load these datasets into !eka they are first con"erted into /;;

    format. /n /;; ,/ttributeelation ;ile ;ormat- file is an /):II tet file that describes

    a list of instances sharing a set of attributes. /;; files were de"eloped by the +achine

    =earning 3ro>ect at the @epartment of :omputer )cience of The Ani"ersity of !aikato for

    use with the !eka machine learning software.

    The net step was to remo"e unnecessary attributes from the raw /;; file. This was

    achie"ed by using BCditD function of !eka.

    ;igure 2? ,a- /ttributes to be remo"ed during preprocessing are highlighted

  • 8/12/2019 Twitter users categorization - Data Mining Project

    4/7

    # DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER

    ;igure 2? ,b- @ata $rganization after

    remo"al of unnecessary attributes

    from raw data

    $nce attributes are remo"ed the tweets which were primarily in string format are

    con"erted in words using !eka function )tringTo!ordEector. The complete filter applied

    to con"ert strings into words is as below?

    weka.filters.usu!er"ise#.attri$ute.Stri%T&W&r#'e(t&r )R first)last )W *+++ )!rue)

    rate )*.+ )N + )ste,,er weka.(&re.ste,,ers.NullSte,,er )M * )t&kei-er

    weka.(&re.t&kei-ers.W&r#T&kei-er )#eli,iters / //r////t.01:///2///3456/

    ;igure F? ,a- @ata $rganization after

    con"ersion of strings ,tweets- intowords.

    /fter con"ersion of tweets into words !eka stop words filter is applied on the data set to

    remo"e words with no "alue to our results ,is are of-. )top words are words which are

    filtered out after processing of natural language data ,tet-. =ater few words ,mostly

    outliers- were remo"ed by looking at filtered dataset with naked eye.

  • 8/12/2019 Twitter users categorization - Data Mining Project

    5/7

    DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER $

    .2 %lassification

    /fter preprocessing of training data

    the testing data is also preprocessed

    through (atch filtering so it could be

    loaded in the classifier. (atch filteringis used if a second dataset normally

    the test set needs to be processed

    with the same statistics as the first

    dataset normally the training set.

    ;igure 4? Testing data being supplied

    in the classifier

    &esults and Discussion

    !e applied se"eral classifiers on the test and training data sets howe"er were able to

    achie"e better results with 'aG"e(ayes and )+$. 'ai"e (ayes classifiers are a family of

    simple classifiers based on applying (ayes# theorem with strong ,nai"e- independence

    assumptions between the features while )e*uential +inimal $ptimization ,)+$- is an

    algorithm for efficiently sol"ing the optimization problem which arises during the training

    of support "ector machines.

  • 8/12/2019 Twitter users categorization - Data Mining Project

    6/7

    ' DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER

    ;igure H? ,a- esults by 'aG"e(ayes :lassifier

    ;igure H? ,b- :onfusion matri by 'aG"e(ayes

    ;igure H? ,a- show results gi"en by the 'aG"e(ayes classifier using !eka.. ,b- shows the

    confusion matri on the gi"en training and test data sets by 'aG"e(ayes classifier.

    ;igure %? ,a- esults by )+$ :lassifier

    ;igure %? ,b- :onfusion matri by )+$

  • 8/12/2019 Twitter users categorization - Data Mining Project

    7/7

    DR. M. TAHIR HASSAN: CATEGORIZATION OF USERS ON TWITTER (

    )+$ is a simple algorithm with high classification accuracy for our dataset. It shows high

    performance with balanced distribution training data as input.

    &eferences

    1J (lei @. 'g /. and 9ordan +. 2002. =atent dirichlet allocation. 9+=,F-?