34
CMPE& 239: Web and Data Mining Professor: Chandrasekar Vuppalapati Project: SentiXchange (Sentiment Analysis of Twitter Feeds) Data Insights Inc. Akshay Bapat (008020571) Akshay Wattal (008941816) Mishal Shah (00873194) Shashank Garg (009310418) 04/29/2014

Data Insights - sentiXchange

Tags:

Embed Size (px)

Citation preview

Page 1: Data Insights - sentiXchange

!!!!!!!!!!! !

!

!

CMPE&!239:!Web!and!Data!Mining!

Professor:!!Chandrasekar!Vuppalapati!

!

Project:!SentiXchange!

(Sentiment!Analysis!of!Twitter!Feeds)!

!

Data!Insights!Inc.!

Akshay!Bapat!(008020571)!

Akshay!Wattal!(008941816)!

Mishal!Shah!(00873194)!

Shashank!Garg!(009310418)!

!

04/29/2014!!!

!

Akshay Wattal
Akshay Wattal
Akshay Wattal
Akshay Wattal
Page 2: Data Insights - sentiXchange

2 of 34

! !

Table!of!Contents!1! Project)Description) 3!2! Requirements) 5!3! UI)Design)Rules) 6!4! Data)Mining)Principles)&)Algorithms) 7!5! High)level)Architecture)for)Twitter)API) 9!6! High)Level)architecture)for)Twitter)Streaming)API) 10!7! Front)End,)Middle)Tear,)Data)Store)&)Cloud)Interaction) 11!7.1! Low)level)architecture)for)Twitter)Search)API) 11!7.2! Low)level)architecture)for)Twitter)Streaming)API) 12!

8! Dataflow)diagram:) 14!9! KDD)Principles) 16!10! Data)Tools) 17!11! Wire)Frame)UI) 20!12! Client)Side)Design) 22!13! Load)Testing)(Stress)Test)) 25!14! Design)Patterns)Used) 29!15! Datasets)and)Data)patterns) 32!!

!!

!

!

!!!!!

Page 3: Data Insights - sentiXchange

3 of 34

! !

1 Project!Description!

! There!is!an!abundance!of!data!in!our!world,!exiting!in!various!forms:!textual,!visual,!and!

so! on! that! is! instrumental! in! driving! our! daily! needs! and! requirements.! Thanks! to! the!

proliferation! of! social! media! and! the! rise! of! mobile! and! cloud! technologies,! we! have! an!

exponential! rise! in! the! data! generation! and! accumulation! on! the!World!Wide!Web.!We! can!

derive!several!useful!insights!and!gain!strategic!information!from!information!dissipated!in!this!

form.!!

!

! Take!for!instance!the!case!of!Movie!Reviews.!We!can!predict!the!boxQoffice!performance!

of!a!movie!based!upon!the!number!of!reviews!and!discussions!found!on!various!social!media,!

bulletin!boards!and!discussion! forums.!This!can!also!be!generalized! to!a!specific!product!or!a!

service.!Using!these!online!sources!of!information,!we!can!evaluate!and!assess!the!quality!and!

performance!of!several!products!and!services.!Think!of!the!social!media!as!a!live!pulse!into!the!

minds!of!the!masses,!quite!similar!to!the!Cerebro!used!by!Charles!Xavier!to!read!the!minds!of!

all!the!mutants!as!depicted!in!the!famous!XQMen!series.!!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!

! ! The$Cerebro:$A$mind$reading$device$as$portrayed$in$the$X8Men$series$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

Page 4: Data Insights - sentiXchange

4 of 34

! !

This!brings!us!to!our!chosen!topic,!Sentiment!Exchange! (SentiXchange)!of!Twitter!Feeds.!We!

choose!sentiment!analysis!as!our!project,!because!it!is!an!emerging!field!of!technology!which!is!

used! in! several! domains! including!Media,! Politics,! Finance,! etc.! In! fact,! based!on! the!original!

research!titled!“Linking$Text$Sentiment$to$Public$Opinion$Time$Series”$by!Brendan!O'Connor,!Bryan! R.! Routledge! ,! Ramnath! Balasubramanyan! ,! and! Noah! A.! Smith,! we! came! to! the!

conclusion! that! the! data! gained! through! textual! sources! of! sentiment! of! a! given! stock! as!

actually!a!very!accurate! indicator!of! the!corresponding!performance!of! the!same!stock! in! the!

market.!!

Twitter$Feed$Sentiment$vs$Gallop$Poll$of$Consumer$Confidence!!!!

We!were!inspired!by!the!concept!of!sentiment!data!to!actually!predict!the!behavior!of!

existing!complex!finance!frameworks,!that!we!decided!to!come!up!with!a!“Sentiment$ Index”,!where!we!define!the!commonly!addressed!key!entities!or!hashtags!as!a!‘stock’!whose!relative!

value!(or!sentiment)!is!something!that!can!be!tracked!over!a!given!period!of!time.!We!propose!

a! realQtime! sentiment! index! system,! where! the! user! can! view! the! commonly! occurring!

keywords!or!hashtags!on!social!and!analyze!their!sentiment!in!the!form!of!positive!or!negative!

polarity!in!a!continuously!varying!graph!in!time,!much!like!the!stock!market.!

!

The!business!value!proposition!of!this!system!is!all!the!relevant!stakeholders!in!the!field!

of! Advertising,! Marketing,! Public! Relations,! Finance,! Manufacturing,! Sales,! where! the!

entrepreneurs!can!get!a!realQtime!feedback!of!the!performance!of!a!product!or!a!service.!Take!

for!example,!the!upcoming!Galaxy!brand!of!smartphones.!This!tool!can!be!used!as!an!indicator!

of!the!popularity!of!the!product!before!and!after!its!predicted!launch!date.!In!case!of!political!

campaigns! people! will! get! realQtime! feedback! of! the! general! perception! and! outlook! of! the!

people!towards!certain!political!parties.!Investors!and!Market!players!can!use!this!as!a!tool!for!

guidance!in!their!financial!strategies!to!strengthen!their!decisions.!

!

Page 5: Data Insights - sentiXchange

5 of 34

! !

2 Requirements!!

1. User!Registration:!!The!system!should!allow!a!user!to!login!to!the!SentiXchange!portal,!where!he!can!build!

his!portfolio!of!monitored!keywords!and!tweets.!

!

2. User!Authentication!and!Confidentiality:!!

The!system!should!keep!a!user’s!browsing!sessions!and!search!history!private!and!

confidential,!as!this!is!akin!to!gaining!an!insight!into!somebody’s!strategic!and!analytical!

plans!and!may!lead!into!unforeseen!ramifications!on!the!perception!of!the!monitored!

entity.!

!

3. Hashtag!Lookup!Function!(Twitter!Search!API):!

The!system!should!have!a!provision!for!a!user!to!search!a!particular!keyword!and!thus!

view!a!realQtime!plot!of!its!sentiment!versus!time.!

!

4. Live!Hashtag!Sentiment!Streaming!(Twitter!Stream!API):!!The!system!has!a!mechanism!to!stream!live!tweets!based!on!the!most!popular!tweeted!

keywords,!to!get!the!dynamic,!real!time!analysis!of!tweets!on!the!fly.!

!

5. Twitter!Dynamic!Feeds!!The!system!should!have!a!live!feeds!page!which!continuously!updates!with!the!latest!

feeds!populating!the!fields!based!on!the!user!preferences,!profile!and!search!history.!

!

6. Historical!Analysis!of!Twitter!Feeds.!

The!system!should!allow!the!user!to!create!reports!of!the!chosen!keywords!in!terms!of!

their!performance!in!the!varying!scale!of!daily,!weekly,!monthly,!quarterly!or!yearly.!It!

should!have!a!provision!to!combine!and!aggregate!multiple!keywords!to!create!a!

composite!index,!in!order!to!customize!and!personalize!each!individuals!sentiment!

tracking!based!on!his!portfolio.!

!

!

!

!

!

!

!

!

Page 6: Data Insights - sentiXchange

6 of 34

! !

3 UI!Design!Rules!

• The!Structure!Principle:!!!The! UI! shall! be! designed! in! a! meaningful! way! based! on! models! that! are!

consistent! and! clear! and! those! which! can! be! easily! recognized! by! users! by!

relating! things! that! are! similar! and! keeping! different! things! separate.! Thus!

principle!has!been!adopted!in!our!project!for!the!UI!layer.!

!

• The!Simplicity!Principle:!!!

The!design! should! aim! to!be! as! simple! as!possible! to!make! the!most! common!

tasks! very! easy! to! carry! out! without! needing! much! prior! information! or!

instruction! to! the!user.! The! language! shall!be!very! clear!and!concise!and!must!

match!that!of!the!target!audience!to!achieve!maximum!impact.!

!

• The!Visibility!Principle:!!!

The!design!presented!to!the!user!must!be!such!that!all!the!required!options!and!

means! required! to! execute! a! given! task! must! be! made! available! to! the! ujser!

without!causing!unnecessary!distractions!by!providing!redundant!or!extraneous!

information.!The!key! is! to!avoid!overwhelming! the!user!with!excessive!options!

and!selections.!

!• The!Feedback!Principle:!!

!The!design!should!provide!an!efficient!means!to!keep!the!user!updated!on!the!

actions! performed! or! the! current! interpretations,! or! the! changes! or! particular!

states!or!conditions,!as!well!as!errors!or!exceptions!that!could!be!very!relevant!to!

the!user.!

!

• The!Reuse!Principle:!!!

The! design!must! permit! the! reuse! of! both! the! internal! as!well! as! the! external!

components!and!behaviors!of!the!system,!as!well!as!maintain!the!consistency!of!

the!purpose,!instead!of!arbitrary!consistency,!which!makes!the!users!to!think!less!

and!remember!less.!!

!

!

!

!

!

!

Page 7: Data Insights - sentiXchange

7 of 34

! !

4 Data!Mining!Principles!&!Algorithms! A.!Naïve!Bayesian!Classifier!! These! classifiers! are! based! on! the! Bayes! rule,! which! is! a! formula! that! represents!

conditional!probabilities!of!occurrence!of!an!event!X,!given!the!occurrence!of!an!event!Y.!We!

represent!this!as!P(X|Y).!The!Bayes!rule!states!that!in!order!to!determine!the!probability!of!this!

condition,!all!we!need! is! the!probability!of! the!occurrence!of! the!exactly!opposite!event,!and!

also!the!individual!probabilities!of!occurrence!of!two!elements!as!well.!

! ! ! !

! ! ! ! Thus!he!states!that!P(X!|!Y)!=!P(X)!P(Y!|!X)!/!P(Y)!

!

! This!can!be!helpful!in!case!we!need!to!find!out!the!probability!of!something!based!upon!

probabilities!of! its!occurring.! In!our!case,! if!we!wish! to! identify! if!a!given! tweet! is!positive!or!

negative,! given! its! contents,! we! can! use! Bayes! theorem! to! state! that! the! probability! of!

occurrence!of!a!given!tweet!provided!that!its!predetermined!to!be!either!positive!or!negative.!

This!is!very!convenient!for!the!purpose!of!our!calculation!as!we!already!have!available!examples!

of!positive!and!negative!tweets!based!upon!our!existing!data!set!of!tweets.!This!means!that!we!

are!making!a!very!broad!assumption!that!the!probability!of!the!occurrence!of!a!tweet!is!equal!

to!the!product!of!the!probabilities!of!occurrence!of!all!the!individual!words!within!the!same.!

!

B.!Maximum!Entropy!! This!classification!also!known!as!MaxEnt!or!ME,!serves!as!an!alternative!means!to!Naïve!

Bayesian! technique,! which! has! proved! as! a! very! effective! model! for! several! applications!

involving! natural! language! processing.! Berger! et! al! [22]! has! proved! that! this! classifier! even!

outperforms!Naïve!Bayes!at!times.!Unlike!Naive!Bayesian,!this!classifier!makes!no!assumption!

of! the! independence!of! the!occurrence!of!words.! It! is!best! suited! for! applications!where!not!

much!prior!information!is!known!about!the!data!in!question.!It!is!also!used!in!the!cases!where!

we! should! consider! the! dependence! of! occurrence! on! certain! words! on! another.! The! Max!

Entropy!modeling!technique!provides!a!probability!distribution!curve!that!should!be!as!close!as!

possible!to!input!vector.!The!concept!is!that!we!should!choose!the!model!that!makes!the!most!

minimum!assumptions!about!the!given!data!and!satisfies!the!underlying!constraints!as!well.!We!

use!several!iteration!of!this!algorithm!in!order!to!arrive!at!the!classification!of!sentiments.!

!

C.!Support!Vector!Machines!!! SVMs!have!proved!to!be!highly!effective!when!it!comes!to!categorizing!traditional!text!

and! thus! form! a! suitable! candidate! for! our! selection! of! algorithms! for! sentiment! analysis! of!

twitter!data.!This!algorithm!is!also!known!to!have!outperformed!the!Naïve!Bayes!algorithm!on!

several!occasions,!as!it! is!a!kind!of!a!largeQmargin!classifier! instead!of!a!probabilistic!one!as!in!

the!case!of!Naïve!Bayes!and!Max!Entropy.!The!basic!concept!of!the!SVM!is!to!find!a!hyper!plane,!

which!is!represented!by!a!vector!that!separates!the!feature!vectors!as!either!the!positive!class!

or!the!negative!class.!Not!only!that,!it!also!strives!to!create!a!margin!with!as!much!separation!

between! the! different! categories! as! possible.! While! the! original! problem! of! SVM! would! be!

applicable!for!a!finite!dimensional!space,!it’s!quite!possible!that!the!sets!used!for!the!purposes!

Page 8: Data Insights - sentiXchange

8 of 34

! !

of!separation!are!not! linearly!defined.!For!cases! like!these!we!need!to!map!the!original! finite!

dimensional!space!to!a!much!higher!dimensional!plane.!

!

D.!Recursive!Neural!Tensor!Network!(RNTN)!! The!Recursive!Neural!Model!is!used!to!parse!an!ngram!into!a!binary!tree!in!a!bottom!up!

fashion,!where!each!word!as!leaf!node!is!represented!by!a!vector!and!subsequently!the!vector!

becomes!the!part!of!the!composite!vector! in!the!hierarchy!above!know!as!parent!vector.!The!

binary!tree! is!constructed!considering!parts!of!speech!of!the!twitter.!At!each!node!whether!a!

parent! or! child! the! vector! is! classified! with! the! 5Qclass! classifier! mentioned! earlier,! which!

indicates!the!composite!classification!of!itself!and!its!children.!To!elucidate!let’s!take!a!triQgram.!

!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !

!

Each!vector! is!represented!as!an!nQdimensional!vector.!The!words!are!first!sampled!randomly!

from!a!uniform!distribution!and!then!a!matrix!created!with!all!the!word!vectors!representing!a!

row!of!the!matrix!of!size!n!X!V!where!V! is!the!number!of!words! in!the!vocabulary.!A!softmax!

classifier!then!transforms!the!matrix.!The!sentiment!classification!matrix!WS!is!multiplied!

by! the!word! vector! A.! So!WS! has! dimensions! 5! X! n! and! A! has! n! X! 1.! The! final! label! is! then!

computed!as:!

! ! ! ! ! YA!=!softmax!(WS!A)!!

!

The!process!is!repeated!for!b!&!c!individually!and!then!a!resultant!matrix!is!calculated!for!b!&!c!

combined! and! reaches! to! a! root! while! converging! and! computing! compositely.! The! root!

sentiment!is!the!overall!sentiment!for!the!sentence.!

Page 9: Data Insights - sentiXchange

9 of 34

! !

5 High!level!Architecture!for!Twitter!API!

High$Level$Architecture$for$Twitter$Search$$

Description:!

1. User!searches!for!a!specific!keyword.!

2. REST!web!service!will! check! in!configuration! if! the!keyword!exists,! if!exists! then! it!will!

get!the!tweets!and!sentiments!of!the!same!and!display!it!using!D3.js!charts.!

3. If!it!does!not!exist,!it!spawns!a!new!java!process!with!the!same!keyword.!

4. The!process!will!get!Tweets!using!Twitter!search!API!through!Twitter4J.!

5. That!process!will!analyse!the!sentiments!of!those!tweets!and!store!it!on!HBase.!

6. REST!web!service!will!collect!the!sentiments!and!time!and!display!it!using!D3.js.!

Page 10: Data Insights - sentiXchange

10 of 34

! !

6 High!Level!architecture!for!Twitter!Streaming!API!

High$Level$Architecture$for$Twitter$Stream$$

Description:!

1. User!searches!for!a!specific!keyword.!

2. REST!web!service!will! check! in!configurations! file!of! streaming!keywords,! if! it!exists,! it!

will!get!the!tweets!and!sentiments!of!the!same!and!display!it!using!D3.js!charts.!

3. On! the! backend,! using! Twitter! streaming! APIs,! Apache! flume! collect! the! tweets! by!

specific!keywords.!

4. Apache!flume!will!sink!to!Hadoop!file!system.!!

5. Offline!computation!is!done!on!files!stored!in!HDFS!and!tweets!are!stored!on!HBase.!

6. REST!webservice!will!find!that!keyword!on!HBase!and!displays!the!graph!on!D3.js.!

Page 11: Data Insights - sentiXchange

11 of 34

! !

7 Front!End,!Middle!Tear,!Data!Store!&!Cloud!Interaction!

7.1 Low!level!architecture!for!Twitter!Search!API!

Low$Level$Architecture$for$Twitter$Search$$

Description:!

1. User!signs!in!using!its!login!credentials.!

2. Through! REST! webservices! those! credentials! will! be! validated! from! the! MongoDB!

database.!

3. MongoDB! database! stores! user! specific! details,! it! will! segregate! user! data! with! the!

tweets.!Hence,!totally!different!database!is!used!for!fast!retrieval!of!the!users!that!REST!

web!services!communicate!with!users.!

4. User!queries!a!keyword!using!search!bar.!

5. REST!web!services!find!the!keyword!from!its!configuration!file.!

6. If!it!doesn’t!match!the!keyword,!the!web!service!will!find!that!keyword!from!HBase.!

7. If!it!is!there,!it!finds!the!tweets!and!its!sentiments!and!displays!them!using!D3.js!charts.!

8. If! the! keyword! doesn’t! even! exists! in! the! database! it! will! check! if! there! are! any! java!

process!running!that!searches!the!Tweets!using!that!keyword,!if!so,!it!will!wait!for!a!few!

seconds!and!start!retrieving!the!tweets!and!its!sentiments!from!HBase.!

9. If!not,!the!REST!web!service!will!span!a!new!java!process!that!contains!Twitter4J.!

Page 12: Data Insights - sentiXchange

12 of 34

! !

10. A!new!java!process!that!uses!Twitter4j!collect!tweets!filtered!by!specific!keywords.!11. Recursive!Neural!Tensor!Network!Sentiment!Analysis!of!each!tweets!is!performed.!

12. That!Stanford!NLP!jar!will!evaluate!the!sentiment!and!stores!tweets!with!sentiments!and!

other!details!of!the!tweets!into!HBase!cluster.!

13. This!HBase!cluster!is!coordinated!by!Zookeeper.!!14. HBase!cluster!has!one!master!node!and!other!Region!Servers!that!are!slave.!

15. Zookeeper!coordinates!those!master!and!slave!nodes!on!runtime.!

16. REST!web!service!will!talk!to!HBase!using!master!node!of!HBase.!

17. It! finds! that! keyword! after! a! few! seconds! of! sleep! and! retrieve! the! tweets! with!sentiment!analysis!details.!

18. REST!API!will!display!sentiments!using!D3.js!charts.!

!

7.2 Low!level!architecture!for!Twitter!Streaming!API!

Low$Level$Architecture$for$Twitter$Stream$$

! The!low!level!architecture!of!Twitter!Streaming!API!is!divided!into!Two!parts.!

1. ONLINE!COMPUTE!and!2.!OFFLINE!COMPUTE!

!

!

!

!

Page 13: Data Insights - sentiXchange

13 of 34

! !

!!!!!!ONLINE!COMPUTE!1. User!signs!in!using!its!login!credentials.!

2. Through! REST! web! services! those! credentials! will! be! validated! from! the! MongoDB!

database.!

3. MongoDB! database! stores! user! specific! details,! it! will! segregate! user! data! with! the!

tweets.!Hence,!totally!different!database!is!used!for!fast!retrieval!of!the!users!that!REST!

web!services!communicate!with!users.!

4. User!queries!a!keyword!using!search!bar.!

5. REST!web!services!find!the!keyword!from!its!configuration!file.!

6. It!should!be!from!specific!keywords!of!configuration!file.!As!flume!has!already!defined!

keywords!that!it!is!searching!tweets.!

7. If! it! finds! the! keyword! from! HBase! database,! it! will! retrieve! all! the! tweets! with!

sentiments!and!displays!it!using!D3.js!charts.!

!

OFFLINE!COMPUTATION!1. In!offline!mode!Apache! flume! is! running!which!has!specific!keywords! that! it! filters!

from!Twtter!streaming!APIs.!

2. Twitter!Streaming!APIs!are!used!for!continuous!polling!of!new!tweets.!

3. It! collects! Tweets! filtered! by! Specific! keywords.! Again! filter! its! fields! that! are!

required.!

4. It!establishes!sink!to!HDFS!to!store!those!tweets!in!HDFS!distributed!file!systems.!

5. HDFS!is!a!cluster!that!stores!all!the!details!of!the!tweets!filtered!by!Apache!flume.!

6. There!can!be!thousands!or!Millions!of!Tweets,!so!it!is!advisable!to!store!those!data!

on!some!distributed!System.!

7. In! offline! mode! there! is! one! another! daemon! process! runs! that! polls! HDFS! file!

system!every!5!minutes.!

8. It! takes! those! tweets! with! its! detail! fields! and! perfoms! Recursive! Neural! Tensor!

Network!sentiment!analysis!of!each!tweets!and!evaluate!sentiments!of!each!tweets.!

9. The!tweets!with!Sentiment!analysis!performed!is!stored!on!HBase!cluster.!

10. This!HBase!cluster!is!coordinated!by!Zookeeper.!!

11. HBase!cluster!has!one!master!node!and!other!Region!Servers!that!are!slave.!

12. Zookeeper! coordinates! those!master! and! slave! nodes! on! runtime! and!makes! sure!

they!don’t!get!failed.!

13. These!tweets!will!be!stored!in!HBase!region!servers.!

Page 14: Data Insights - sentiXchange

14 of 34

! !

8 Dataflow!diagram:!

Data!Flow!Diagram!level!0:!

Level!0!is!a!very!high!level!architecture!diagram!of!a!dataflow!in!between!entities,!

process!and!tables/database.!

!

! User!communicates!with!a!process!that!is!web!application.!

! Web! application! communicates! with! BigDataUser! collection,!

searchAPIsentiments,!streamingsentimentanalysis!Htables.!

! The! HTable! searchAPIsentiment! communicates! with! “Tweets! and! sentiments”!

process.!

! The! HTable! steamingSentimentAnalysis! communicates! with! “Steaming! Tweets!

sentiment!analysis”!process.!

! Process!“Tweets!and!sentiments!analysis”!communicates!with!Twitter!Search!API!

that!is!outside!entity.!

! Process! “Steaming! Tweets! sentiment! analysis”! communicates! with! Twitter!

Search!API!that!is!outside!entity.!

!

!

!

!

!

!

!

!

!

!

!

!

!

Page 15: Data Insights - sentiXchange

15 of 34

! !

Data!Flow!Diagram!Level!1:!

Level!1!is!a!very!low!level!architecture!diagram!of!a!dataflow!in!between!user!entities,!

process!and!tables/database.!

!

! User!gives!login!Information!to!User!Authentication!process.!

! User! authentication! process! communicates! with! BigdataUserCollection! and!

transfer!User!information!data.!

! User! gives! keywords! to! Keywod! Query! API! process! that! checks! keyword! from!

available!list.!

! It! sends! keywords! to! searchAPISentiments! table,! streamingSentimentAnalysis!

and!get!Sentiments!process.!

! Both! the! tables! searchAPISentiments! table,! streamingSentimentAnalysis! gives!

back!the!sentiments!of!the!tweets!to!getSentiment!process.!

! getSentiment!process!displays!the!list!of!sentiments!to!the!User.!

Page 16: Data Insights - sentiXchange

16 of 34

! !

9 KDD!Principles! ! KDD!or!Knowledge!Discovery!in!Databases!is!a!multidisciplinary!branch!of!science!which!

deals!with!data! storage,!data!access,! high! scalability! algorithms,!massive!data! sets! as!well! as!

interpreting! results.! The! processes! that! are! usually! included! in! data! warehousing! such! as!

cleansing!and!access!also!aid!the!KDD!process.!Besides!this,!we!also!use!principles!from!Artificial!

Intelligence! by! discovering! pragmatic! laws! from! observations! and! experimentation.! The!

recognized!patterns!through!this!process!must!also!be!valid!on!new!data!with!a!certain!degree!

of!certainty.!!These!patterns!lead!to!new!knowledge!about!the!domain.!

!

Steps!involved!in!the!KDD!process:!

1. Identify!the!customer’s!goal!and!objectives!of!the!KDD!process.!

2. Understand!the!domains!involved!and!also!the!application!domain!knowledge!required.!

3. Select! the! target! data! set,! or! a! subset! of! the! data! under! consideration! for!which!we!

need!to!perform!discovery.!

4. Perform! the! cleansing! and! preQprocessing! of! the! data! by! deploying! strategies! for!

handling!missing!fields!and!change!the!data!as!per!the!specific!requirements.!

5. Simplify! the! data! by! the! elimination! of! unused! variables.! Thereafter,! analyze! the!

features! that! can!be!used!best! to! represent! the!data! in! question,! subject! to! the! final!

goal.!

6. Match!the!available!Data!Mining!methods!with!the!goals!to!suggest!the!possible!hidden!

patterns!that!may!merge!from!the!KDD!process.!!

7. Choose!data!mining!algorithms!to!discover!hidden!patterns.!This! involves!choosing!the!

data!model!and!parameters!best!suited!for!the!KDD!process.!!

8. Search!for!the! interesting!patterns!for!a!specific!representational! form,!which! involves!

classification!rules,!regression,!decision!trees,!and!clustering.!!

9. Do!the!interpretation!of!the!essential!information!from!the!mined!patterns.!

10. Use!the!obtained!knowledge!and!as!part!of!another!system!for!further!processing.!

11. Document! the! observations! and! make! reports! to! be! presented! to! the! relevant!

stakeholders.!!

Page 17: Data Insights - sentiXchange

17 of 34

! !

10 Data!Tools! Cloudera!Hadoop!Manager!

Cloudera!!Hadoop!Cluster!

Page 18: Data Insights - sentiXchange

18 of 34

! !

Apache!Flume!

Apache!HBase!

Page 19: Data Insights - sentiXchange

19 of 34

! !

Amazon!EC2!!

!!MongoDB!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

Page 20: Data Insights - sentiXchange

20 of 34

! !

11 Wire!Frame!UI!

!

Page 21: Data Insights - sentiXchange

21 of 34

! !

Page 22: Data Insights - sentiXchange

22 of 34

! !

12 Client!Side!Design!

1. Home!Page!

Page 23: Data Insights - sentiXchange

23 of 34

! !

2. User!Registration!and!Login!

Page 24: Data Insights - sentiXchange

24 of 34

! !

3. Twitter!Search!and!Stream!

!

Page 25: Data Insights - sentiXchange

25 of 34

! !

13 Load!Testing!(Stress!Test)! The!load!testing!was!conducted!for!our!Web!Application!using!LoadImpact!tool.!Memory!Usage!

of!all!the!applications!at!different!percentage!of!the!load!test!was!conducted.!Following!are!the!

load!test!results!along!with!the!detailed!screenshots:!

!

The!below!screen!shot!illustrates!the!load!test!scenario!getting!initialized:!!

We!captured!the!memory!usage!at!initialize!state!and!the!system!was!under!normal!memory!

utilization,!this!is!show!in!the!below!screenshot.!We!see!78.8%!of!CPU!utilization.!!

Page 26: Data Insights - sentiXchange

26 of 34

! !

Following!are!the!statistics!and!screenshot!for!our!completed!load!test:!

No!of!Requests!sent!to!our!website!on!completion!=!4154!While!testing!the!number!of!request!per!second!was!ranging!from!5!to!35!request/second.!

Data!Received!as!response!from!our!website!=!465.12!MiB!Total!Testing!time!=!4!minutes!(approximately)!

Page 27: Data Insights - sentiXchange

27 of 34

! !

The! below! graph! shows! the!Number! of! Active! Users! in! Green! and!Number! of! Time! to! Load!

website!in!Blue.!Initially!we!see!spikes!for!first!30!seconds,!however!as!the!system!scales!(due!

to!deployment!on!Amazon!EC2)!the!Average!Time!to!Load!reduces!to!1.41s!

The!below!chart!depicts! the!different!number!of!URL! that!were! loaded!as!a!part!of!our! load!

testing.!Because!of!our!deployment!architecture!and!design!we!see!Zero!failed!request.!There!are!30!pages!of!such!URLS!(Not!shown!to!maintain!brevity!of!the!document.)!

Page 28: Data Insights - sentiXchange

28 of 34

! !

The!below!chart!illustrates!the!content!by!distribution!during!the!load!test:!

Finally,! the! system! utilization! during! the! load! test! was! captured,! the! system! CPU! utilization!

increased! from! initial!78.8! to!only!84.4!%.! !Hence,! the!designed!and!deployed!architecture! is!

highly!scalable.!

Page 29: Data Insights - sentiXchange

29 of 34

! !

14 Design!Patterns!Used!

1. The!Proactor!Pattern:!! !

! Since!our!project!deals!with!asynchronous!event!processing!like!the!arrivals!of!

tweets,!we!have!implemented!the!Proactor!pattern!to!help!handle!asynchronous!events!

in!our!system.!The!implementation!is!as!follows:!

• Data!Comes!from!Twitter!Stream!in!the!form!of!an!event!

• There!is!a!dedicated!listener!for!every!possible!keyword,!which!responds!to!an!

event:!arrival!of!a!message!of!a!particular!keyword,!by!passing!it!on!to!the!HDFS!

file!system.!This!has!to!be!done!is!coordinated!fashion!amongst!the!listeners!

keeping!in!mind!the!asynchronous!behavior!of!the!messages.!!

• We!have!implemented!this!functionality!using!Apache!Flume,!which!creates!a!

Pipeline!(Source!"!Channel!"!Sink)!from!the!Twitter!Stream!to!the!HDFS!file!

system!in!our!project!

!

Code!Sample!for!Proactor!Pattern:!

!BigDataServiceConfiguration!configuration!=!new!BigDataServiceConfiguration();!

final!List<String>!twitterStreamingKewordsList!=!configuration.getStompQueueName();$

!StatusListener!listener!=!new!StatusListener()!{!

!@Override!

public!void!onException(Exception!arg0)!{}!!

@Override!public!void!onDeletionNotice(StatusDeletionNotice!arg0)!{}!

!@Override!

public!void!onScrubGeo(long!arg0,!long!arg1)!{}!!

@Override!public!void!onStallWarning(StallWarning!stallWarning)!{}!

!public!void!onStatus(Status!status)!{!RNTN!sentiment!=!new!RNTN();!User!user!=!status.getUser();!

!String!content!=!status.getText();!

System.out.println("Tweet:"!+!content!+"\n");!!

String!keyword!=!"";!

Page 30: Data Insights - sentiXchange

30 of 34

! !

for(String!key:twitterStreamingKewordsList)!if!(content.contains(key))!keyword!=!key;!

!if(!keyword.equals(""))!{!

!//!Get!tweet!Details!&!Sentiment!

String!username!=!status.getUser().getScreenName();!System.out.println("Username:"!+!username);!String!profileLocation!=!user.getLocation();!

System.out.println("Profile!Location:"!+!profileLocation);!long!tweetId!=!status.getId();!

System.out.println("Tweet!ID:"!+tweetId);!int!sentimentOut!=!sentiment.findSentiment(content);System.out.println("Sentiment:"!+!

sentimentOut);!2. The!Factory!Pattern:!

!

! In!objectQoriented!programming,!a!factory!defines!a!template!for!object!creation,!

that!is!it!is!an!object!which!can!be!instantiate!other!objects!based!on!a!certain!method!

call.!In!our!case,!we!want!to!access!the!Twitter!API!in!both!ways:!Search!and!Streaming.!

We!wish!to!instantiate!the!right!object.!i.e.!TwitterSearch!or!TwitterStream!depending!

upon!the!nature!of!the!method!call.!either.!This!is!a!very!fundamental!concept!in!OOP,!

and!forms!the!basis!for!a!number!of!related!software!design!patterns,!in!our!project!as!

well.!

!

TwitterFactory!tf!=!new!TwitterFactory();!Twitter!twitter!=!tf.getInstance();!

TwitterStreamFactory!ts!=!new!TwitterStreamFactory();!TwitterStream!tsi!=!ts.getInstance();!

!!

BigDataServiceConfiguration!configuration!=!new!BigDataServiceConfiguration();!final!List<String>!twitterStreamingKewordsList!=!configuration.getStompQueueName();!

! ! !! !!!!!!!!

!

!

3. The!Singleton!Pattern:!!

! The!Singleton!pattern!is!a!construct!in!software!to!ensure!that!throughout!the!

lifecycle!of!a!software!program!we!have!only!a!single!instance!of!a!given!object.!In!our!

project,!we!have!used!Apache!HBase,!which!uses!a!Java!client!to!interface!with!the!

database.!A!strict!requirement!of!the!client!API!is!that!the!interaction!with!the!database!

should!occur!via!the!HTable!object,!which!shall!be!defined!only!once!throughout!its!

lifetime.!

!

Page 31: Data Insights - sentiXchange

31 of 34

! !

@POST!

@Path("/sentiment")!

@Timed(name!=!"getQsentiment")!

public!String!getSentiment(@QueryParam("keyword")!String!!keyword)!throws!InterruptedException,!IOException!{!

int!flag!=!0;!BigDataServiceConfiguration!configuration!=!new!BigDataServiceConfiguration();!List<String>!twitterStreamingKewordsList!=!configuration.getStompQueueName();!

!

for(String!key:twitterStreamingKewordsList)!

if!!(keyword.equals(key))!{!flag!=!1;!

break;!}!

if!(flag!==!0)!{!Configuration!conf!=!HBaseConfiguration.create();!

HTable!table!=!new!HTable(conf,!"searchAPISentimentAnalysis");!

Get!get!=!new!Get(Bytes.toBytes(keyword));!Result!result!=!table.get(get);!

if(result.size()!==!0)!{!Tweets!twitterSearch!=!new!Tweets();!

twitterSearch.search(keyword);!

}!

}!

return!"Search!Initiated...";!!!!!}!

!!!!!!!!!!!!!!!

Page 32: Data Insights - sentiXchange

32 of 34

! !

15 Datasets!and!Data!patterns! HBase:!

HBase! is!a!columnar!database!that!can!have!millions!of!columns!and!billions!of!rows.! It!has!a!

very! good! feature! that! is! column! family.! So! in! our! data! model! we! have! taken! keywords! as!

column! families.! The!main! reason! of! using! HBase! is,! it! provides! realtime! random! read/write!

access!to!HBase.!

!

1. SearchAPISentiment:!

SearchAPISentiment!HTable!stores!rows!using!TweetIDs.!ColumnFamilies!are!keywords.!

Columns!in!columnfamilies!are!created_at,!name,!location,!text,!sentiments.!!

!

!

!

2. StreamingSentimentAalysis:!

StreamingAPISentiment!HTable!stores!rows!using!TweetIDs.!ColumnFamilies!are!keywords.!

Columns!in!columnfamilies!are!created_at,!name,!location,!text,!sentiments.!

!

!

!

!

!

Page 33: Data Insights - sentiXchange

33 of 34

! !

Examples!of!Datasets!and!Data!patterns:!

1. SearchAPISentiment:!

!

!

2. StreamingSentimentAalysis:!

!

!

!

!

!

!

!

!

Page 34: Data Insights - sentiXchange

34 of 34

! !

MongoDB!collection!!

There!is!a!separated!database!that!stores!all!the!user!information,!so!the!REST!web!service!will!

interact!with!the!database!for!user!information.!Which!is!totally!segregated!from!HBase!that!

stores!big!data,!to!avoid!user!interactions!with!Bigdata!store.!

!

BigdataUsercollection:!

The!collection!has!three!keys:!username,!password,!email!

!

!

!

!!!!!!!!!