Upload
akshay-wattal
View
114
Download
0
Tags:
Embed Size (px)
Citation preview
!!!!!!!!!!! !
!
!
CMPE&!239:!Web!and!Data!Mining!
Professor:!!Chandrasekar!Vuppalapati!
!
Project:!SentiXchange!
(Sentiment!Analysis!of!Twitter!Feeds)!
!
Data!Insights!Inc.!
Akshay!Bapat!(008020571)!
Akshay!Wattal!(008941816)!
Mishal!Shah!(00873194)!
Shashank!Garg!(009310418)!
!
04/29/2014!!!
!
2 of 34
! !
Table!of!Contents!1! Project)Description) 3!2! Requirements) 5!3! UI)Design)Rules) 6!4! Data)Mining)Principles)&)Algorithms) 7!5! High)level)Architecture)for)Twitter)API) 9!6! High)Level)architecture)for)Twitter)Streaming)API) 10!7! Front)End,)Middle)Tear,)Data)Store)&)Cloud)Interaction) 11!7.1! Low)level)architecture)for)Twitter)Search)API) 11!7.2! Low)level)architecture)for)Twitter)Streaming)API) 12!
8! Dataflow)diagram:) 14!9! KDD)Principles) 16!10! Data)Tools) 17!11! Wire)Frame)UI) 20!12! Client)Side)Design) 22!13! Load)Testing)(Stress)Test)) 25!14! Design)Patterns)Used) 29!15! Datasets)and)Data)patterns) 32!!
!!
!
!
!!!!!
3 of 34
! !
1 Project!Description!
! There!is!an!abundance!of!data!in!our!world,!exiting!in!various!forms:!textual,!visual,!and!
so! on! that! is! instrumental! in! driving! our! daily! needs! and! requirements.! Thanks! to! the!
proliferation! of! social! media! and! the! rise! of! mobile! and! cloud! technologies,! we! have! an!
exponential! rise! in! the! data! generation! and! accumulation! on! the!World!Wide!Web.!We! can!
derive!several!useful!insights!and!gain!strategic!information!from!information!dissipated!in!this!
form.!!
!
! Take!for!instance!the!case!of!Movie!Reviews.!We!can!predict!the!boxQoffice!performance!
of!a!movie!based!upon!the!number!of!reviews!and!discussions!found!on!various!social!media,!
bulletin!boards!and!discussion! forums.!This!can!also!be!generalized! to!a!specific!product!or!a!
service.!Using!these!online!sources!of!information,!we!can!evaluate!and!assess!the!quality!and!
performance!of!several!products!and!services.!Think!of!the!social!media!as!a!live!pulse!into!the!
minds!of!the!masses,!quite!similar!to!the!Cerebro!used!by!Charles!Xavier!to!read!the!minds!of!
all!the!mutants!as!depicted!in!the!famous!XQMen!series.!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!
! ! The$Cerebro:$A$mind$reading$device$as$portrayed$in$the$X8Men$series$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
4 of 34
! !
This!brings!us!to!our!chosen!topic,!Sentiment!Exchange! (SentiXchange)!of!Twitter!Feeds.!We!
choose!sentiment!analysis!as!our!project,!because!it!is!an!emerging!field!of!technology!which!is!
used! in! several! domains! including!Media,! Politics,! Finance,! etc.! In! fact,! based!on! the!original!
research!titled!“Linking$Text$Sentiment$to$Public$Opinion$Time$Series”$by!Brendan!O'Connor,!Bryan! R.! Routledge! ,! Ramnath! Balasubramanyan! ,! and! Noah! A.! Smith,! we! came! to! the!
conclusion! that! the! data! gained! through! textual! sources! of! sentiment! of! a! given! stock! as!
actually!a!very!accurate! indicator!of! the!corresponding!performance!of! the!same!stock! in! the!
market.!!
Twitter$Feed$Sentiment$vs$Gallop$Poll$of$Consumer$Confidence!!!!
We!were!inspired!by!the!concept!of!sentiment!data!to!actually!predict!the!behavior!of!
existing!complex!finance!frameworks,!that!we!decided!to!come!up!with!a!“Sentiment$ Index”,!where!we!define!the!commonly!addressed!key!entities!or!hashtags!as!a!‘stock’!whose!relative!
value!(or!sentiment)!is!something!that!can!be!tracked!over!a!given!period!of!time.!We!propose!
a! realQtime! sentiment! index! system,! where! the! user! can! view! the! commonly! occurring!
keywords!or!hashtags!on!social!and!analyze!their!sentiment!in!the!form!of!positive!or!negative!
polarity!in!a!continuously!varying!graph!in!time,!much!like!the!stock!market.!
!
The!business!value!proposition!of!this!system!is!all!the!relevant!stakeholders!in!the!field!
of! Advertising,! Marketing,! Public! Relations,! Finance,! Manufacturing,! Sales,! where! the!
entrepreneurs!can!get!a!realQtime!feedback!of!the!performance!of!a!product!or!a!service.!Take!
for!example,!the!upcoming!Galaxy!brand!of!smartphones.!This!tool!can!be!used!as!an!indicator!
of!the!popularity!of!the!product!before!and!after!its!predicted!launch!date.!In!case!of!political!
campaigns! people! will! get! realQtime! feedback! of! the! general! perception! and! outlook! of! the!
people!towards!certain!political!parties.!Investors!and!Market!players!can!use!this!as!a!tool!for!
guidance!in!their!financial!strategies!to!strengthen!their!decisions.!
!
5 of 34
! !
2 Requirements!!
1. User!Registration:!!The!system!should!allow!a!user!to!login!to!the!SentiXchange!portal,!where!he!can!build!
his!portfolio!of!monitored!keywords!and!tweets.!
!
2. User!Authentication!and!Confidentiality:!!
The!system!should!keep!a!user’s!browsing!sessions!and!search!history!private!and!
confidential,!as!this!is!akin!to!gaining!an!insight!into!somebody’s!strategic!and!analytical!
plans!and!may!lead!into!unforeseen!ramifications!on!the!perception!of!the!monitored!
entity.!
!
3. Hashtag!Lookup!Function!(Twitter!Search!API):!
The!system!should!have!a!provision!for!a!user!to!search!a!particular!keyword!and!thus!
view!a!realQtime!plot!of!its!sentiment!versus!time.!
!
4. Live!Hashtag!Sentiment!Streaming!(Twitter!Stream!API):!!The!system!has!a!mechanism!to!stream!live!tweets!based!on!the!most!popular!tweeted!
keywords,!to!get!the!dynamic,!real!time!analysis!of!tweets!on!the!fly.!
!
5. Twitter!Dynamic!Feeds!!The!system!should!have!a!live!feeds!page!which!continuously!updates!with!the!latest!
feeds!populating!the!fields!based!on!the!user!preferences,!profile!and!search!history.!
!
6. Historical!Analysis!of!Twitter!Feeds.!
The!system!should!allow!the!user!to!create!reports!of!the!chosen!keywords!in!terms!of!
their!performance!in!the!varying!scale!of!daily,!weekly,!monthly,!quarterly!or!yearly.!It!
should!have!a!provision!to!combine!and!aggregate!multiple!keywords!to!create!a!
composite!index,!in!order!to!customize!and!personalize!each!individuals!sentiment!
tracking!based!on!his!portfolio.!
!
!
!
!
!
!
!
!
6 of 34
! !
3 UI!Design!Rules!
• The!Structure!Principle:!!!The! UI! shall! be! designed! in! a! meaningful! way! based! on! models! that! are!
consistent! and! clear! and! those! which! can! be! easily! recognized! by! users! by!
relating! things! that! are! similar! and! keeping! different! things! separate.! Thus!
principle!has!been!adopted!in!our!project!for!the!UI!layer.!
!
• The!Simplicity!Principle:!!!
The!design! should! aim! to!be! as! simple! as!possible! to!make! the!most! common!
tasks! very! easy! to! carry! out! without! needing! much! prior! information! or!
instruction! to! the!user.! The! language! shall!be!very! clear!and!concise!and!must!
match!that!of!the!target!audience!to!achieve!maximum!impact.!
!
• The!Visibility!Principle:!!!
The!design!presented!to!the!user!must!be!such!that!all!the!required!options!and!
means! required! to! execute! a! given! task! must! be! made! available! to! the! ujser!
without!causing!unnecessary!distractions!by!providing!redundant!or!extraneous!
information.!The!key! is! to!avoid!overwhelming! the!user!with!excessive!options!
and!selections.!
!• The!Feedback!Principle:!!
!The!design!should!provide!an!efficient!means!to!keep!the!user!updated!on!the!
actions! performed! or! the! current! interpretations,! or! the! changes! or! particular!
states!or!conditions,!as!well!as!errors!or!exceptions!that!could!be!very!relevant!to!
the!user.!
!
• The!Reuse!Principle:!!!
The! design!must! permit! the! reuse! of! both! the! internal! as!well! as! the! external!
components!and!behaviors!of!the!system,!as!well!as!maintain!the!consistency!of!
the!purpose,!instead!of!arbitrary!consistency,!which!makes!the!users!to!think!less!
and!remember!less.!!
!
!
!
!
!
!
7 of 34
! !
4 Data!Mining!Principles!&!Algorithms! A.!Naïve!Bayesian!Classifier!! These! classifiers! are! based! on! the! Bayes! rule,! which! is! a! formula! that! represents!
conditional!probabilities!of!occurrence!of!an!event!X,!given!the!occurrence!of!an!event!Y.!We!
represent!this!as!P(X|Y).!The!Bayes!rule!states!that!in!order!to!determine!the!probability!of!this!
condition,!all!we!need! is! the!probability!of! the!occurrence!of! the!exactly!opposite!event,!and!
also!the!individual!probabilities!of!occurrence!of!two!elements!as!well.!
! ! ! !
! ! ! ! Thus!he!states!that!P(X!|!Y)!=!P(X)!P(Y!|!X)!/!P(Y)!
!
! This!can!be!helpful!in!case!we!need!to!find!out!the!probability!of!something!based!upon!
probabilities!of! its!occurring.! In!our!case,! if!we!wish! to! identify! if!a!given! tweet! is!positive!or!
negative,! given! its! contents,! we! can! use! Bayes! theorem! to! state! that! the! probability! of!
occurrence!of!a!given!tweet!provided!that!its!predetermined!to!be!either!positive!or!negative.!
This!is!very!convenient!for!the!purpose!of!our!calculation!as!we!already!have!available!examples!
of!positive!and!negative!tweets!based!upon!our!existing!data!set!of!tweets.!This!means!that!we!
are!making!a!very!broad!assumption!that!the!probability!of!the!occurrence!of!a!tweet!is!equal!
to!the!product!of!the!probabilities!of!occurrence!of!all!the!individual!words!within!the!same.!
!
B.!Maximum!Entropy!! This!classification!also!known!as!MaxEnt!or!ME,!serves!as!an!alternative!means!to!Naïve!
Bayesian! technique,! which! has! proved! as! a! very! effective! model! for! several! applications!
involving! natural! language! processing.! Berger! et! al! [22]! has! proved! that! this! classifier! even!
outperforms!Naïve!Bayes!at!times.!Unlike!Naive!Bayesian,!this!classifier!makes!no!assumption!
of! the! independence!of! the!occurrence!of!words.! It! is!best! suited! for! applications!where!not!
much!prior!information!is!known!about!the!data!in!question.!It!is!also!used!in!the!cases!where!
we! should! consider! the! dependence! of! occurrence! on! certain! words! on! another.! The! Max!
Entropy!modeling!technique!provides!a!probability!distribution!curve!that!should!be!as!close!as!
possible!to!input!vector.!The!concept!is!that!we!should!choose!the!model!that!makes!the!most!
minimum!assumptions!about!the!given!data!and!satisfies!the!underlying!constraints!as!well.!We!
use!several!iteration!of!this!algorithm!in!order!to!arrive!at!the!classification!of!sentiments.!
!
C.!Support!Vector!Machines!!! SVMs!have!proved!to!be!highly!effective!when!it!comes!to!categorizing!traditional!text!
and! thus! form! a! suitable! candidate! for! our! selection! of! algorithms! for! sentiment! analysis! of!
twitter!data.!This!algorithm!is!also!known!to!have!outperformed!the!Naïve!Bayes!algorithm!on!
several!occasions,!as!it! is!a!kind!of!a!largeQmargin!classifier! instead!of!a!probabilistic!one!as!in!
the!case!of!Naïve!Bayes!and!Max!Entropy.!The!basic!concept!of!the!SVM!is!to!find!a!hyper!plane,!
which!is!represented!by!a!vector!that!separates!the!feature!vectors!as!either!the!positive!class!
or!the!negative!class.!Not!only!that,!it!also!strives!to!create!a!margin!with!as!much!separation!
between! the! different! categories! as! possible.! While! the! original! problem! of! SVM! would! be!
applicable!for!a!finite!dimensional!space,!it’s!quite!possible!that!the!sets!used!for!the!purposes!
8 of 34
! !
of!separation!are!not! linearly!defined.!For!cases! like!these!we!need!to!map!the!original! finite!
dimensional!space!to!a!much!higher!dimensional!plane.!
!
D.!Recursive!Neural!Tensor!Network!(RNTN)!! The!Recursive!Neural!Model!is!used!to!parse!an!ngram!into!a!binary!tree!in!a!bottom!up!
fashion,!where!each!word!as!leaf!node!is!represented!by!a!vector!and!subsequently!the!vector!
becomes!the!part!of!the!composite!vector! in!the!hierarchy!above!know!as!parent!vector.!The!
binary!tree! is!constructed!considering!parts!of!speech!of!the!twitter.!At!each!node!whether!a!
parent! or! child! the! vector! is! classified! with! the! 5Qclass! classifier! mentioned! earlier,! which!
indicates!the!composite!classification!of!itself!and!its!children.!To!elucidate!let’s!take!a!triQgram.!
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !
!
Each!vector! is!represented!as!an!nQdimensional!vector.!The!words!are!first!sampled!randomly!
from!a!uniform!distribution!and!then!a!matrix!created!with!all!the!word!vectors!representing!a!
row!of!the!matrix!of!size!n!X!V!where!V! is!the!number!of!words! in!the!vocabulary.!A!softmax!
classifier!then!transforms!the!matrix.!The!sentiment!classification!matrix!WS!is!multiplied!
by! the!word! vector! A.! So!WS! has! dimensions! 5! X! n! and! A! has! n! X! 1.! The! final! label! is! then!
computed!as:!
! ! ! ! ! YA!=!softmax!(WS!A)!!
!
The!process!is!repeated!for!b!&!c!individually!and!then!a!resultant!matrix!is!calculated!for!b!&!c!
combined! and! reaches! to! a! root! while! converging! and! computing! compositely.! The! root!
sentiment!is!the!overall!sentiment!for!the!sentence.!
9 of 34
! !
5 High!level!Architecture!for!Twitter!API!
High$Level$Architecture$for$Twitter$Search$$
Description:!
1. User!searches!for!a!specific!keyword.!
2. REST!web!service!will! check! in!configuration! if! the!keyword!exists,! if!exists! then! it!will!
get!the!tweets!and!sentiments!of!the!same!and!display!it!using!D3.js!charts.!
3. If!it!does!not!exist,!it!spawns!a!new!java!process!with!the!same!keyword.!
4. The!process!will!get!Tweets!using!Twitter!search!API!through!Twitter4J.!
5. That!process!will!analyse!the!sentiments!of!those!tweets!and!store!it!on!HBase.!
6. REST!web!service!will!collect!the!sentiments!and!time!and!display!it!using!D3.js.!
10 of 34
! !
6 High!Level!architecture!for!Twitter!Streaming!API!
High$Level$Architecture$for$Twitter$Stream$$
Description:!
1. User!searches!for!a!specific!keyword.!
2. REST!web!service!will! check! in!configurations! file!of! streaming!keywords,! if! it!exists,! it!
will!get!the!tweets!and!sentiments!of!the!same!and!display!it!using!D3.js!charts.!
3. On! the! backend,! using! Twitter! streaming! APIs,! Apache! flume! collect! the! tweets! by!
specific!keywords.!
4. Apache!flume!will!sink!to!Hadoop!file!system.!!
5. Offline!computation!is!done!on!files!stored!in!HDFS!and!tweets!are!stored!on!HBase.!
6. REST!webservice!will!find!that!keyword!on!HBase!and!displays!the!graph!on!D3.js.!
11 of 34
! !
7 Front!End,!Middle!Tear,!Data!Store!&!Cloud!Interaction!
7.1 Low!level!architecture!for!Twitter!Search!API!
Low$Level$Architecture$for$Twitter$Search$$
Description:!
1. User!signs!in!using!its!login!credentials.!
2. Through! REST! webservices! those! credentials! will! be! validated! from! the! MongoDB!
database.!
3. MongoDB! database! stores! user! specific! details,! it! will! segregate! user! data! with! the!
tweets.!Hence,!totally!different!database!is!used!for!fast!retrieval!of!the!users!that!REST!
web!services!communicate!with!users.!
4. User!queries!a!keyword!using!search!bar.!
5. REST!web!services!find!the!keyword!from!its!configuration!file.!
6. If!it!doesn’t!match!the!keyword,!the!web!service!will!find!that!keyword!from!HBase.!
7. If!it!is!there,!it!finds!the!tweets!and!its!sentiments!and!displays!them!using!D3.js!charts.!
8. If! the! keyword! doesn’t! even! exists! in! the! database! it! will! check! if! there! are! any! java!
process!running!that!searches!the!Tweets!using!that!keyword,!if!so,!it!will!wait!for!a!few!
seconds!and!start!retrieving!the!tweets!and!its!sentiments!from!HBase.!
9. If!not,!the!REST!web!service!will!span!a!new!java!process!that!contains!Twitter4J.!
12 of 34
! !
10. A!new!java!process!that!uses!Twitter4j!collect!tweets!filtered!by!specific!keywords.!11. Recursive!Neural!Tensor!Network!Sentiment!Analysis!of!each!tweets!is!performed.!
12. That!Stanford!NLP!jar!will!evaluate!the!sentiment!and!stores!tweets!with!sentiments!and!
other!details!of!the!tweets!into!HBase!cluster.!
13. This!HBase!cluster!is!coordinated!by!Zookeeper.!!14. HBase!cluster!has!one!master!node!and!other!Region!Servers!that!are!slave.!
15. Zookeeper!coordinates!those!master!and!slave!nodes!on!runtime.!
16. REST!web!service!will!talk!to!HBase!using!master!node!of!HBase.!
17. It! finds! that! keyword! after! a! few! seconds! of! sleep! and! retrieve! the! tweets! with!sentiment!analysis!details.!
18. REST!API!will!display!sentiments!using!D3.js!charts.!
!
7.2 Low!level!architecture!for!Twitter!Streaming!API!
Low$Level$Architecture$for$Twitter$Stream$$
! The!low!level!architecture!of!Twitter!Streaming!API!is!divided!into!Two!parts.!
1. ONLINE!COMPUTE!and!2.!OFFLINE!COMPUTE!
!
!
!
!
13 of 34
! !
!!!!!!ONLINE!COMPUTE!1. User!signs!in!using!its!login!credentials.!
2. Through! REST! web! services! those! credentials! will! be! validated! from! the! MongoDB!
database.!
3. MongoDB! database! stores! user! specific! details,! it! will! segregate! user! data! with! the!
tweets.!Hence,!totally!different!database!is!used!for!fast!retrieval!of!the!users!that!REST!
web!services!communicate!with!users.!
4. User!queries!a!keyword!using!search!bar.!
5. REST!web!services!find!the!keyword!from!its!configuration!file.!
6. It!should!be!from!specific!keywords!of!configuration!file.!As!flume!has!already!defined!
keywords!that!it!is!searching!tweets.!
7. If! it! finds! the! keyword! from! HBase! database,! it! will! retrieve! all! the! tweets! with!
sentiments!and!displays!it!using!D3.js!charts.!
!
OFFLINE!COMPUTATION!1. In!offline!mode!Apache! flume! is! running!which!has!specific!keywords! that! it! filters!
from!Twtter!streaming!APIs.!
2. Twitter!Streaming!APIs!are!used!for!continuous!polling!of!new!tweets.!
3. It! collects! Tweets! filtered! by! Specific! keywords.! Again! filter! its! fields! that! are!
required.!
4. It!establishes!sink!to!HDFS!to!store!those!tweets!in!HDFS!distributed!file!systems.!
5. HDFS!is!a!cluster!that!stores!all!the!details!of!the!tweets!filtered!by!Apache!flume.!
6. There!can!be!thousands!or!Millions!of!Tweets,!so!it!is!advisable!to!store!those!data!
on!some!distributed!System.!
7. In! offline! mode! there! is! one! another! daemon! process! runs! that! polls! HDFS! file!
system!every!5!minutes.!
8. It! takes! those! tweets! with! its! detail! fields! and! perfoms! Recursive! Neural! Tensor!
Network!sentiment!analysis!of!each!tweets!and!evaluate!sentiments!of!each!tweets.!
9. The!tweets!with!Sentiment!analysis!performed!is!stored!on!HBase!cluster.!
10. This!HBase!cluster!is!coordinated!by!Zookeeper.!!
11. HBase!cluster!has!one!master!node!and!other!Region!Servers!that!are!slave.!
12. Zookeeper! coordinates! those!master! and! slave! nodes! on! runtime! and!makes! sure!
they!don’t!get!failed.!
13. These!tweets!will!be!stored!in!HBase!region!servers.!
14 of 34
! !
8 Dataflow!diagram:!
Data!Flow!Diagram!level!0:!
Level!0!is!a!very!high!level!architecture!diagram!of!a!dataflow!in!between!entities,!
process!and!tables/database.!
!
! User!communicates!with!a!process!that!is!web!application.!
! Web! application! communicates! with! BigDataUser! collection,!
searchAPIsentiments,!streamingsentimentanalysis!Htables.!
! The! HTable! searchAPIsentiment! communicates! with! “Tweets! and! sentiments”!
process.!
! The! HTable! steamingSentimentAnalysis! communicates! with! “Steaming! Tweets!
sentiment!analysis”!process.!
! Process!“Tweets!and!sentiments!analysis”!communicates!with!Twitter!Search!API!
that!is!outside!entity.!
! Process! “Steaming! Tweets! sentiment! analysis”! communicates! with! Twitter!
Search!API!that!is!outside!entity.!
!
!
!
!
!
!
!
!
!
!
!
!
!
15 of 34
! !
Data!Flow!Diagram!Level!1:!
Level!1!is!a!very!low!level!architecture!diagram!of!a!dataflow!in!between!user!entities,!
process!and!tables/database.!
!
! User!gives!login!Information!to!User!Authentication!process.!
! User! authentication! process! communicates! with! BigdataUserCollection! and!
transfer!User!information!data.!
! User! gives! keywords! to! Keywod! Query! API! process! that! checks! keyword! from!
available!list.!
! It! sends! keywords! to! searchAPISentiments! table,! streamingSentimentAnalysis!
and!get!Sentiments!process.!
! Both! the! tables! searchAPISentiments! table,! streamingSentimentAnalysis! gives!
back!the!sentiments!of!the!tweets!to!getSentiment!process.!
! getSentiment!process!displays!the!list!of!sentiments!to!the!User.!
16 of 34
! !
9 KDD!Principles! ! KDD!or!Knowledge!Discovery!in!Databases!is!a!multidisciplinary!branch!of!science!which!
deals!with!data! storage,!data!access,! high! scalability! algorithms,!massive!data! sets! as!well! as!
interpreting! results.! The! processes! that! are! usually! included! in! data! warehousing! such! as!
cleansing!and!access!also!aid!the!KDD!process.!Besides!this,!we!also!use!principles!from!Artificial!
Intelligence! by! discovering! pragmatic! laws! from! observations! and! experimentation.! The!
recognized!patterns!through!this!process!must!also!be!valid!on!new!data!with!a!certain!degree!
of!certainty.!!These!patterns!lead!to!new!knowledge!about!the!domain.!
!
Steps!involved!in!the!KDD!process:!
1. Identify!the!customer’s!goal!and!objectives!of!the!KDD!process.!
2. Understand!the!domains!involved!and!also!the!application!domain!knowledge!required.!
3. Select! the! target! data! set,! or! a! subset! of! the! data! under! consideration! for!which!we!
need!to!perform!discovery.!
4. Perform! the! cleansing! and! preQprocessing! of! the! data! by! deploying! strategies! for!
handling!missing!fields!and!change!the!data!as!per!the!specific!requirements.!
5. Simplify! the! data! by! the! elimination! of! unused! variables.! Thereafter,! analyze! the!
features! that! can!be!used!best! to! represent! the!data! in! question,! subject! to! the! final!
goal.!
6. Match!the!available!Data!Mining!methods!with!the!goals!to!suggest!the!possible!hidden!
patterns!that!may!merge!from!the!KDD!process.!!
7. Choose!data!mining!algorithms!to!discover!hidden!patterns.!This! involves!choosing!the!
data!model!and!parameters!best!suited!for!the!KDD!process.!!
8. Search!for!the! interesting!patterns!for!a!specific!representational! form,!which! involves!
classification!rules,!regression,!decision!trees,!and!clustering.!!
9. Do!the!interpretation!of!the!essential!information!from!the!mined!patterns.!
10. Use!the!obtained!knowledge!and!as!part!of!another!system!for!further!processing.!
11. Document! the! observations! and! make! reports! to! be! presented! to! the! relevant!
stakeholders.!!
17 of 34
! !
10 Data!Tools! Cloudera!Hadoop!Manager!
Cloudera!!Hadoop!Cluster!
18 of 34
! !
Apache!Flume!
Apache!HBase!
19 of 34
! !
Amazon!EC2!!
!!MongoDB!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
20 of 34
! !
11 Wire!Frame!UI!
!
21 of 34
! !
22 of 34
! !
12 Client!Side!Design!
1. Home!Page!
23 of 34
! !
2. User!Registration!and!Login!
24 of 34
! !
3. Twitter!Search!and!Stream!
!
25 of 34
! !
13 Load!Testing!(Stress!Test)! The!load!testing!was!conducted!for!our!Web!Application!using!LoadImpact!tool.!Memory!Usage!
of!all!the!applications!at!different!percentage!of!the!load!test!was!conducted.!Following!are!the!
load!test!results!along!with!the!detailed!screenshots:!
!
The!below!screen!shot!illustrates!the!load!test!scenario!getting!initialized:!!
We!captured!the!memory!usage!at!initialize!state!and!the!system!was!under!normal!memory!
utilization,!this!is!show!in!the!below!screenshot.!We!see!78.8%!of!CPU!utilization.!!
26 of 34
! !
Following!are!the!statistics!and!screenshot!for!our!completed!load!test:!
No!of!Requests!sent!to!our!website!on!completion!=!4154!While!testing!the!number!of!request!per!second!was!ranging!from!5!to!35!request/second.!
Data!Received!as!response!from!our!website!=!465.12!MiB!Total!Testing!time!=!4!minutes!(approximately)!
27 of 34
! !
The! below! graph! shows! the!Number! of! Active! Users! in! Green! and!Number! of! Time! to! Load!
website!in!Blue.!Initially!we!see!spikes!for!first!30!seconds,!however!as!the!system!scales!(due!
to!deployment!on!Amazon!EC2)!the!Average!Time!to!Load!reduces!to!1.41s!
The!below!chart!depicts! the!different!number!of!URL! that!were! loaded!as!a!part!of!our! load!
testing.!Because!of!our!deployment!architecture!and!design!we!see!Zero!failed!request.!There!are!30!pages!of!such!URLS!(Not!shown!to!maintain!brevity!of!the!document.)!
28 of 34
! !
The!below!chart!illustrates!the!content!by!distribution!during!the!load!test:!
Finally,! the! system! utilization! during! the! load! test! was! captured,! the! system! CPU! utilization!
increased! from! initial!78.8! to!only!84.4!%.! !Hence,! the!designed!and!deployed!architecture! is!
highly!scalable.!
29 of 34
! !
14 Design!Patterns!Used!
1. The!Proactor!Pattern:!! !
! Since!our!project!deals!with!asynchronous!event!processing!like!the!arrivals!of!
tweets,!we!have!implemented!the!Proactor!pattern!to!help!handle!asynchronous!events!
in!our!system.!The!implementation!is!as!follows:!
• Data!Comes!from!Twitter!Stream!in!the!form!of!an!event!
• There!is!a!dedicated!listener!for!every!possible!keyword,!which!responds!to!an!
event:!arrival!of!a!message!of!a!particular!keyword,!by!passing!it!on!to!the!HDFS!
file!system.!This!has!to!be!done!is!coordinated!fashion!amongst!the!listeners!
keeping!in!mind!the!asynchronous!behavior!of!the!messages.!!
• We!have!implemented!this!functionality!using!Apache!Flume,!which!creates!a!
Pipeline!(Source!"!Channel!"!Sink)!from!the!Twitter!Stream!to!the!HDFS!file!
system!in!our!project!
!
Code!Sample!for!Proactor!Pattern:!
!BigDataServiceConfiguration!configuration!=!new!BigDataServiceConfiguration();!
final!List<String>!twitterStreamingKewordsList!=!configuration.getStompQueueName();$
!StatusListener!listener!=!new!StatusListener()!{!
!@Override!
public!void!onException(Exception!arg0)!{}!!
@Override!public!void!onDeletionNotice(StatusDeletionNotice!arg0)!{}!
!@Override!
public!void!onScrubGeo(long!arg0,!long!arg1)!{}!!
@Override!public!void!onStallWarning(StallWarning!stallWarning)!{}!
!public!void!onStatus(Status!status)!{!RNTN!sentiment!=!new!RNTN();!User!user!=!status.getUser();!
!String!content!=!status.getText();!
System.out.println("Tweet:"!+!content!+"\n");!!
String!keyword!=!"";!
30 of 34
! !
for(String!key:twitterStreamingKewordsList)!if!(content.contains(key))!keyword!=!key;!
!if(!keyword.equals(""))!{!
!//!Get!tweet!Details!&!Sentiment!
String!username!=!status.getUser().getScreenName();!System.out.println("Username:"!+!username);!String!profileLocation!=!user.getLocation();!
System.out.println("Profile!Location:"!+!profileLocation);!long!tweetId!=!status.getId();!
System.out.println("Tweet!ID:"!+tweetId);!int!sentimentOut!=!sentiment.findSentiment(content);System.out.println("Sentiment:"!+!
sentimentOut);!2. The!Factory!Pattern:!
!
! In!objectQoriented!programming,!a!factory!defines!a!template!for!object!creation,!
that!is!it!is!an!object!which!can!be!instantiate!other!objects!based!on!a!certain!method!
call.!In!our!case,!we!want!to!access!the!Twitter!API!in!both!ways:!Search!and!Streaming.!
We!wish!to!instantiate!the!right!object.!i.e.!TwitterSearch!or!TwitterStream!depending!
upon!the!nature!of!the!method!call.!either.!This!is!a!very!fundamental!concept!in!OOP,!
and!forms!the!basis!for!a!number!of!related!software!design!patterns,!in!our!project!as!
well.!
!
TwitterFactory!tf!=!new!TwitterFactory();!Twitter!twitter!=!tf.getInstance();!
TwitterStreamFactory!ts!=!new!TwitterStreamFactory();!TwitterStream!tsi!=!ts.getInstance();!
!!
BigDataServiceConfiguration!configuration!=!new!BigDataServiceConfiguration();!final!List<String>!twitterStreamingKewordsList!=!configuration.getStompQueueName();!
! ! !! !!!!!!!!
!
!
3. The!Singleton!Pattern:!!
! The!Singleton!pattern!is!a!construct!in!software!to!ensure!that!throughout!the!
lifecycle!of!a!software!program!we!have!only!a!single!instance!of!a!given!object.!In!our!
project,!we!have!used!Apache!HBase,!which!uses!a!Java!client!to!interface!with!the!
database.!A!strict!requirement!of!the!client!API!is!that!the!interaction!with!the!database!
should!occur!via!the!HTable!object,!which!shall!be!defined!only!once!throughout!its!
lifetime.!
!
31 of 34
! !
@POST!
@Path("/sentiment")!
@Timed(name!=!"getQsentiment")!
public!String!getSentiment(@QueryParam("keyword")!String!!keyword)!throws!InterruptedException,!IOException!{!
int!flag!=!0;!BigDataServiceConfiguration!configuration!=!new!BigDataServiceConfiguration();!List<String>!twitterStreamingKewordsList!=!configuration.getStompQueueName();!
!
for(String!key:twitterStreamingKewordsList)!
if!!(keyword.equals(key))!{!flag!=!1;!
break;!}!
if!(flag!==!0)!{!Configuration!conf!=!HBaseConfiguration.create();!
HTable!table!=!new!HTable(conf,!"searchAPISentimentAnalysis");!
Get!get!=!new!Get(Bytes.toBytes(keyword));!Result!result!=!table.get(get);!
if(result.size()!==!0)!{!Tweets!twitterSearch!=!new!Tweets();!
twitterSearch.search(keyword);!
}!
}!
return!"Search!Initiated...";!!!!!}!
!!!!!!!!!!!!!!!
32 of 34
! !
15 Datasets!and!Data!patterns! HBase:!
HBase! is!a!columnar!database!that!can!have!millions!of!columns!and!billions!of!rows.! It!has!a!
very! good! feature! that! is! column! family.! So! in! our! data! model! we! have! taken! keywords! as!
column! families.! The!main! reason! of! using! HBase! is,! it! provides! realtime! random! read/write!
access!to!HBase.!
!
1. SearchAPISentiment:!
SearchAPISentiment!HTable!stores!rows!using!TweetIDs.!ColumnFamilies!are!keywords.!
Columns!in!columnfamilies!are!created_at,!name,!location,!text,!sentiments.!!
!
!
!
2. StreamingSentimentAalysis:!
StreamingAPISentiment!HTable!stores!rows!using!TweetIDs.!ColumnFamilies!are!keywords.!
Columns!in!columnfamilies!are!created_at,!name,!location,!text,!sentiments.!
!
!
!
!
!
33 of 34
! !
Examples!of!Datasets!and!Data!patterns:!
1. SearchAPISentiment:!
!
!
2. StreamingSentimentAalysis:!
!
!
!
!
!
!
!
!
34 of 34
! !
MongoDB!collection!!
There!is!a!separated!database!that!stores!all!the!user!information,!so!the!REST!web!service!will!
interact!with!the!database!for!user!information.!Which!is!totally!segregated!from!HBase!that!
stores!big!data,!to!avoid!user!interactions!with!Bigdata!store.!
!
BigdataUsercollection:!
The!collection!has!three!keys:!username,!password,!email!
!
!
!
!!!!!!!!!