45
David Taieb STSM - IBM Cloud Data Services Developer advocate [email protected] HANDS-ON SESSION: DEVELOPING ANALYTIC APPLICATIONS USING APACHE SPARK™ AND PYTHON Part 2: Analyzing car twiQer data with Spark and DashDb PyCon 2016, Portland

Spark tutorial py con 2016 part 2

Embed Size (px)

Citation preview

Page 1: Spark tutorial py con 2016   part 2

DavidTaiebSTSM-IBMCloudDataServicesDeveloperadvocatedavid_taieb@us.ibm.com

HANDS-ONSESSION:DEVELOPINGANALYTICAPPLICATIONSUSINGAPACHESPARK™ANDPYTHONPart2:AnalyzingcartwiQerdatawithSparkandDashDb PyCon2016,Portland

Page 2: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Agenda

•  Provisiontheapplica6onservicesonBluemix:Spark,DashDb,IBMInsightforTwiJer

•  LoadcarrelatedtweetsfromIBMInsightforTwiJerintoDashDbwarehouse

•  RunAnaly6csinPythonNotebookanddiscovernewinsights

Page 3: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Sign up for Bluemix •  AccessIBMBluemixwebsiteonhJps://console.ng.bluemix.net•  ClickonGetStartedforFree

•  CompletetheformandclickCreateaccount•  Lookforconfirma6onemailandclickonconfirmyouaccountlink

CreatenewSpace

Page 4: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Create a new space on Bluemix Inprepara6onforrunningtheproject,wecreateanewspaceonBluemix

CreateaSparkInstance

Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse

Page 5: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Create a Spark Instance

Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse

Page 6: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Create New Spark Instance Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse

Page 7: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Acquiring the data

•  Inthenextsec6on,weshowhowtoacquirethetwiJerdataandstoreitintoDashDb.

•  WeusetheTwiJerloadingconnectoravailableasamenuinDashDbconsole

CreateaDashDbinstance

Page 8: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Create an instance of IBM Dash DB on Bluemix

CreateanIBMInsightforTwiJerinstance

Page 9: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Create an instance of IBM Insight for Twitter on Bluemix

Page 10: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Agenda

•  Provisiontheapplica6onservicesonBluemix:Spark,DashDb,IBMInsightforTwiJer

•  LoadcarrelatedtweetsfromIBMInsightforTwiJerintoDashDbwarehouse

•  RunAnaly6csinPythonNotebookanddiscovernewinsights

Page 11: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Launch DashDb Console ClickontheDashDbService6letoopenthisdashboard,thenclickonLaunchbuJon

LoadTwiJerData

Page 12: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Load Twitter Data

DashDbConsoleofferedmul6pledataconnectorsincludingaTwiJerconnectorthatautoma6callyconnectstoIBMInsightforTwiJer

ConnecttoTwiJer

Page 13: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Connect to Twitter

ReusingtheTwiJerserviceinstancecreatedinpreviousstep

Page 14: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Select the data to be loaded TwiJerQuerybeingused:posted:2015-01-01,2015-12-31followers_count:2000listed_count:1000(volkswagenORvwORtoyotaORdaimlerORmercedesORbmwORgmOR"generalmotors"ORtesla)

SpecifytwiJerquery

Providepreviewcountofoutputdata

Page 15: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Select the DashDb Table

Nameoftheschemaunderwhichthetableswillbecreated

Prefix(Namespace)forthecreatedtables

Listoftablesthatwillbecreated

Page 16: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Loading data monitoring page

Warning:loading6memayvarybasedonbandwidth.Itmaytakebetween15mnsand1hour

Page 17: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Complete the load: Statistics

Page 18: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Complete the load: explore the data

Page 19: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Get connection information CopytheUserid,passwordandjdbcurl,you’llneedthisinforma6onlater

Page 20: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Agenda

•  Provisiontheapplica6onservicesonBluemix:Spark,DashDb,IBMInsightforTwiJer

•  LoadcarrelatedtweetsfromIBMInsightforTwiJerintoDashDbwarehouse

•  RunAnaly6csinPythonNotebookanddiscovernewinsights

Page 21: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Create new Notebook from URL

ImportrequiredPythonpackages

• CreatenotebookfromURL• UsehJps://github.com/ibm-cds-labs/spark.samples/raw/master/notebook/DashDB%20TwiJer%20Car%202015%20Python%20Notebook.ipynb

Page 22: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Step 1: Import Python Packages •  Installnltkpackage(Naturallanguagetoolkit)• Wewilluseittofilterstopwordslaterinthetutorial

Page 23: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Import Python modules and setup the SQLContext

Page 24: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Step 2: Define global Variables

Setupvariousdatastructureswe’llneedthroughouttheNotebook

ThisistheSCHEMAandPREFIXyouusedinStep3oftheTwiJerconnectorwizard

Page 25: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Set up some global helper functions

JavaScriptGooglemapvisualiza6on

Mischelperthatfillinmissingdates

Page 26: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Step 3: Acquire the data from DashDB

UserIDandpasswordfromConnec6onpage

UserIDandpasswordfromConnec6onpage

Page 27: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Join the Tweets and Sentiment Table Inthisstep,wewanttoaddasen6mentscoreforeachtweetrecord:•  JointheTweetsandSen6mentstable•  Encodethesen6mentintoanumbere.g.POSITIVE=+1,NEGATIVE=-1,AMBIVALENT=0•  Createanaverageforeachsen6mentassociatedwithatweet•  %6meinstrumentsthecodetoprovideprofileexecu6onstats.

Page 28: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Step 4: Transform the data

CreateacleanWorkingdataframethatwillbeeasiertouseinouranaly6cs

Page 29: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Step 5: Geographic distribution of tweets

GroupBycountriesandaggregatethetweetscount

ConvertSparkSQLdataframetoPandasdatastructureforvisualiza6on

Page 30: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Bar chart visualization of Tweet distribution by Geo

Page 31: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Google map visualization of tweet distribution by Geos

CallGeoChartHelperthatsetuptheJavaScriptcode

Page 32: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Clean up memory before next analytics

ResourcesincludingmemoryontheSparkDrivermachinearenotinfinite.Itisgoodprac6cetocleanupwhendataisnotneededanymore

Page 33: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Step 6: Analyzing tweets sentiment

GroupBySen6mentsandaggregatethetweetscount

ConvertSparkSQLdataframetoPandasdatastructureforvisualiza6on

Page 34: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Sentiment visualization

UseMatplotpiechart

Page 35: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Step 7: Analyze Tweet timeline

ConvertSparkSQLdataframetoPandasdatastructureforvisualiza6on

GroupByPos6ng6meandsen6menttuplesAggregatethetweetcounts

GroupByPos6ng6meandsen6menttuplesAggregatethesumofthetweetcounts

Page 36: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Prepare the timeline data structures

Page 37: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Time series visualization for all tweets

Page 38: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Deep dive into car manufacturers

CreatenewDataFramethatenrichtweetswithextrametadata:-Booleanforeachcarmanufacturer-Booleanforelectriccar-Booleanforselfdrivingcar

Page 39: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Re-analyze tweeter timeline for each car manufacturer

CreatenewDataFrameforeachcarmanufacturerAggregatethetweetcounts,orderbypos6ng6me

Page 40: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Timeline series visualization

No6cethepeakoftweetsforVWbetweenSeptemberandOctober2015

Page 41: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Explain why the peak of tweets for VW between September and October 2015

FilterforallVWtweetsbetweenSeptandOct2015

Piechartvisualiza6onofthetop10wordsbeingusedinthesetweets

Createmapcountofallnon-stopwordsusedinthetweets

UseNLTKstopwordsmoduletofilteroutstopwords

Page 42: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Peak explained

WecanclearlyseefromthelistofmostusedwordsthatthepeakcorrespondtotheVWscandalaroundfraudulentemissionstes6ng

Page 43: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Follow the notebook for many more interesting analytics

Page 44: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Resource

•  hJps://developer.ibm.com/clouddataservices/•  hJps://github.com/ibm-cds-labs/simple-data-pipe•  hJps://github.com/ibm-cds-labs/pipes-connector-flightstats•  hJp://spark.apache.org/docs/latest/mllib-guide.html•  hJps://console.ng.bluemix.net/data/analy6cs/

Page 45: Spark tutorial py con 2016   part 2

©2016IBMCorpora6on �

Thank You