Spark tutorial py con 2016 part 2

Preview:

Citation preview

DavidTaiebSTSM-IBMCloudDataServicesDeveloperadvocatedavid_taieb@us.ibm.com

HANDS-ONSESSION:DEVELOPINGANALYTICAPPLICATIONSUSINGAPACHESPARK™ANDPYTHONPart2:AnalyzingcartwiQerdatawithSparkandDashDb PyCon2016,Portland

©2016IBMCorpora6on �

Agenda

•  Provisiontheapplica6onservicesonBluemix:Spark,DashDb,IBMInsightforTwiJer

•  LoadcarrelatedtweetsfromIBMInsightforTwiJerintoDashDbwarehouse

•  RunAnaly6csinPythonNotebookanddiscovernewinsights

©2016IBMCorpora6on �

Sign up for Bluemix •  AccessIBMBluemixwebsiteonhJps://console.ng.bluemix.net•  ClickonGetStartedforFree

•  CompletetheformandclickCreateaccount•  Lookforconfirma6onemailandclickonconfirmyouaccountlink

CreatenewSpace

©2016IBMCorpora6on �

Create a new space on Bluemix Inprepara6onforrunningtheproject,wecreateanewspaceonBluemix

CreateaSparkInstance

Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse

©2016IBMCorpora6on �

Create a Spark Instance

Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse

©2016IBMCorpora6on �

Create New Spark Instance Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse

©2016IBMCorpora6on �

Acquiring the data

•  Inthenextsec6on,weshowhowtoacquirethetwiJerdataandstoreitintoDashDb.

•  WeusetheTwiJerloadingconnectoravailableasamenuinDashDbconsole

CreateaDashDbinstance

©2016IBMCorpora6on �

Create an instance of IBM Dash DB on Bluemix

CreateanIBMInsightforTwiJerinstance

©2016IBMCorpora6on �

Create an instance of IBM Insight for Twitter on Bluemix

©2016IBMCorpora6on �

Agenda

•  Provisiontheapplica6onservicesonBluemix:Spark,DashDb,IBMInsightforTwiJer

•  LoadcarrelatedtweetsfromIBMInsightforTwiJerintoDashDbwarehouse

•  RunAnaly6csinPythonNotebookanddiscovernewinsights

©2016IBMCorpora6on �

Launch DashDb Console ClickontheDashDbService6letoopenthisdashboard,thenclickonLaunchbuJon

LoadTwiJerData

©2016IBMCorpora6on �

Load Twitter Data

DashDbConsoleofferedmul6pledataconnectorsincludingaTwiJerconnectorthatautoma6callyconnectstoIBMInsightforTwiJer

ConnecttoTwiJer

©2016IBMCorpora6on �

Connect to Twitter

ReusingtheTwiJerserviceinstancecreatedinpreviousstep

©2016IBMCorpora6on �

Select the data to be loaded TwiJerQuerybeingused:posted:2015-01-01,2015-12-31followers_count:2000listed_count:1000(volkswagenORvwORtoyotaORdaimlerORmercedesORbmwORgmOR"generalmotors"ORtesla)

SpecifytwiJerquery

Providepreviewcountofoutputdata

©2016IBMCorpora6on �

Select the DashDb Table

Nameoftheschemaunderwhichthetableswillbecreated

Prefix(Namespace)forthecreatedtables

Listoftablesthatwillbecreated

©2016IBMCorpora6on �

Loading data monitoring page

Warning:loading6memayvarybasedonbandwidth.Itmaytakebetween15mnsand1hour

©2016IBMCorpora6on �

Complete the load: Statistics

©2016IBMCorpora6on �

Complete the load: explore the data

©2016IBMCorpora6on �

Get connection information CopytheUserid,passwordandjdbcurl,you’llneedthisinforma6onlater

©2016IBMCorpora6on �

Agenda

•  Provisiontheapplica6onservicesonBluemix:Spark,DashDb,IBMInsightforTwiJer

•  LoadcarrelatedtweetsfromIBMInsightforTwiJerintoDashDbwarehouse

•  RunAnaly6csinPythonNotebookanddiscovernewinsights

©2016IBMCorpora6on �

Create new Notebook from URL

ImportrequiredPythonpackages

• CreatenotebookfromURL• UsehJps://github.com/ibm-cds-labs/spark.samples/raw/master/notebook/DashDB%20TwiJer%20Car%202015%20Python%20Notebook.ipynb

©2016IBMCorpora6on �

Step 1: Import Python Packages •  Installnltkpackage(Naturallanguagetoolkit)• Wewilluseittofilterstopwordslaterinthetutorial

©2016IBMCorpora6on �

Import Python modules and setup the SQLContext

©2016IBMCorpora6on �

Step 2: Define global Variables

Setupvariousdatastructureswe’llneedthroughouttheNotebook

ThisistheSCHEMAandPREFIXyouusedinStep3oftheTwiJerconnectorwizard

©2016IBMCorpora6on �

Set up some global helper functions

JavaScriptGooglemapvisualiza6on

Mischelperthatfillinmissingdates

©2016IBMCorpora6on �

Step 3: Acquire the data from DashDB

UserIDandpasswordfromConnec6onpage

UserIDandpasswordfromConnec6onpage

©2016IBMCorpora6on �

Join the Tweets and Sentiment Table Inthisstep,wewanttoaddasen6mentscoreforeachtweetrecord:•  JointheTweetsandSen6mentstable•  Encodethesen6mentintoanumbere.g.POSITIVE=+1,NEGATIVE=-1,AMBIVALENT=0•  Createanaverageforeachsen6mentassociatedwithatweet•  %6meinstrumentsthecodetoprovideprofileexecu6onstats.

©2016IBMCorpora6on �

Step 4: Transform the data

CreateacleanWorkingdataframethatwillbeeasiertouseinouranaly6cs

©2016IBMCorpora6on �

Step 5: Geographic distribution of tweets

GroupBycountriesandaggregatethetweetscount

ConvertSparkSQLdataframetoPandasdatastructureforvisualiza6on

©2016IBMCorpora6on �

Bar chart visualization of Tweet distribution by Geo

©2016IBMCorpora6on �

Google map visualization of tweet distribution by Geos

CallGeoChartHelperthatsetuptheJavaScriptcode

©2016IBMCorpora6on �

Clean up memory before next analytics

ResourcesincludingmemoryontheSparkDrivermachinearenotinfinite.Itisgoodprac6cetocleanupwhendataisnotneededanymore

©2016IBMCorpora6on �

Step 6: Analyzing tweets sentiment

GroupBySen6mentsandaggregatethetweetscount

ConvertSparkSQLdataframetoPandasdatastructureforvisualiza6on

©2016IBMCorpora6on �

Sentiment visualization

UseMatplotpiechart

©2016IBMCorpora6on �

Step 7: Analyze Tweet timeline

ConvertSparkSQLdataframetoPandasdatastructureforvisualiza6on

GroupByPos6ng6meandsen6menttuplesAggregatethetweetcounts

GroupByPos6ng6meandsen6menttuplesAggregatethesumofthetweetcounts

©2016IBMCorpora6on �

Prepare the timeline data structures

©2016IBMCorpora6on �

Time series visualization for all tweets

©2016IBMCorpora6on �

Deep dive into car manufacturers

CreatenewDataFramethatenrichtweetswithextrametadata:-Booleanforeachcarmanufacturer-Booleanforelectriccar-Booleanforselfdrivingcar

©2016IBMCorpora6on �

Re-analyze tweeter timeline for each car manufacturer

CreatenewDataFrameforeachcarmanufacturerAggregatethetweetcounts,orderbypos6ng6me

©2016IBMCorpora6on �

Timeline series visualization

No6cethepeakoftweetsforVWbetweenSeptemberandOctober2015

©2016IBMCorpora6on �

Explain why the peak of tweets for VW between September and October 2015

FilterforallVWtweetsbetweenSeptandOct2015

Piechartvisualiza6onofthetop10wordsbeingusedinthesetweets

Createmapcountofallnon-stopwordsusedinthetweets

UseNLTKstopwordsmoduletofilteroutstopwords

©2016IBMCorpora6on �

Peak explained

WecanclearlyseefromthelistofmostusedwordsthatthepeakcorrespondtotheVWscandalaroundfraudulentemissionstes6ng

©2016IBMCorpora6on �

Follow the notebook for many more interesting analytics

©2016IBMCorpora6on �

Resource

•  hJps://developer.ibm.com/clouddataservices/•  hJps://github.com/ibm-cds-labs/simple-data-pipe•  hJps://github.com/ibm-cds-labs/pipes-connector-flightstats•  hJp://spark.apache.org/docs/latest/mllib-guide.html•  hJps://console.ng.bluemix.net/data/analy6cs/

©2016IBMCorpora6on �

Thank You

Recommended