View
3.931
Download
0
Category
Preview:
Citation preview
DavidTaiebSTSM-IBMCloudDataServicesDeveloperadvocatedavid_taieb@us.ibm.com
HANDS-ONSESSION:DEVELOPINGANALYTICAPPLICATIONSUSINGAPACHESPARK™ANDPYTHONPart2:AnalyzingcartwiQerdatawithSparkandDashDb PyCon2016,Portland
©2016IBMCorpora6on �
Agenda
• Provisiontheapplica6onservicesonBluemix:Spark,DashDb,IBMInsightforTwiJer
• LoadcarrelatedtweetsfromIBMInsightforTwiJerintoDashDbwarehouse
• RunAnaly6csinPythonNotebookanddiscovernewinsights
©2016IBMCorpora6on �
Sign up for Bluemix • AccessIBMBluemixwebsiteonhJps://console.ng.bluemix.net• ClickonGetStartedforFree
• CompletetheformandclickCreateaccount• Lookforconfirma6onemailandclickonconfirmyouaccountlink
CreatenewSpace
©2016IBMCorpora6on �
Create a new space on Bluemix Inprepara6onforrunningtheproject,wecreateanewspaceonBluemix
CreateaSparkInstance
Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse
©2016IBMCorpora6on �
Create a Spark Instance
Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse
©2016IBMCorpora6on �
Create New Spark Instance Op6onal:YoucanskipthisstepifyoualreadyhaveaspacewithSparkinstancethatyouwouldliketoreuse
©2016IBMCorpora6on �
Acquiring the data
• Inthenextsec6on,weshowhowtoacquirethetwiJerdataandstoreitintoDashDb.
• WeusetheTwiJerloadingconnectoravailableasamenuinDashDbconsole
CreateaDashDbinstance
©2016IBMCorpora6on �
Create an instance of IBM Dash DB on Bluemix
CreateanIBMInsightforTwiJerinstance
©2016IBMCorpora6on �
Create an instance of IBM Insight for Twitter on Bluemix
©2016IBMCorpora6on �
Agenda
• Provisiontheapplica6onservicesonBluemix:Spark,DashDb,IBMInsightforTwiJer
• LoadcarrelatedtweetsfromIBMInsightforTwiJerintoDashDbwarehouse
• RunAnaly6csinPythonNotebookanddiscovernewinsights
©2016IBMCorpora6on �
Launch DashDb Console ClickontheDashDbService6letoopenthisdashboard,thenclickonLaunchbuJon
LoadTwiJerData
©2016IBMCorpora6on �
Load Twitter Data
DashDbConsoleofferedmul6pledataconnectorsincludingaTwiJerconnectorthatautoma6callyconnectstoIBMInsightforTwiJer
ConnecttoTwiJer
©2016IBMCorpora6on �
Connect to Twitter
ReusingtheTwiJerserviceinstancecreatedinpreviousstep
©2016IBMCorpora6on �
Select the data to be loaded TwiJerQuerybeingused:posted:2015-01-01,2015-12-31followers_count:2000listed_count:1000(volkswagenORvwORtoyotaORdaimlerORmercedesORbmwORgmOR"generalmotors"ORtesla)
SpecifytwiJerquery
Providepreviewcountofoutputdata
©2016IBMCorpora6on �
Select the DashDb Table
Nameoftheschemaunderwhichthetableswillbecreated
Prefix(Namespace)forthecreatedtables
Listoftablesthatwillbecreated
©2016IBMCorpora6on �
Loading data monitoring page
Warning:loading6memayvarybasedonbandwidth.Itmaytakebetween15mnsand1hour
©2016IBMCorpora6on �
Complete the load: Statistics
©2016IBMCorpora6on �
Complete the load: explore the data
©2016IBMCorpora6on �
Get connection information CopytheUserid,passwordandjdbcurl,you’llneedthisinforma6onlater
©2016IBMCorpora6on �
Agenda
• Provisiontheapplica6onservicesonBluemix:Spark,DashDb,IBMInsightforTwiJer
• LoadcarrelatedtweetsfromIBMInsightforTwiJerintoDashDbwarehouse
• RunAnaly6csinPythonNotebookanddiscovernewinsights
©2016IBMCorpora6on �
Create new Notebook from URL
ImportrequiredPythonpackages
• CreatenotebookfromURL• UsehJps://github.com/ibm-cds-labs/spark.samples/raw/master/notebook/DashDB%20TwiJer%20Car%202015%20Python%20Notebook.ipynb
©2016IBMCorpora6on �
Step 1: Import Python Packages • Installnltkpackage(Naturallanguagetoolkit)• Wewilluseittofilterstopwordslaterinthetutorial
©2016IBMCorpora6on �
Import Python modules and setup the SQLContext
©2016IBMCorpora6on �
Step 2: Define global Variables
Setupvariousdatastructureswe’llneedthroughouttheNotebook
ThisistheSCHEMAandPREFIXyouusedinStep3oftheTwiJerconnectorwizard
©2016IBMCorpora6on �
Set up some global helper functions
JavaScriptGooglemapvisualiza6on
Mischelperthatfillinmissingdates
©2016IBMCorpora6on �
Step 3: Acquire the data from DashDB
UserIDandpasswordfromConnec6onpage
UserIDandpasswordfromConnec6onpage
©2016IBMCorpora6on �
Join the Tweets and Sentiment Table Inthisstep,wewanttoaddasen6mentscoreforeachtweetrecord:• JointheTweetsandSen6mentstable• Encodethesen6mentintoanumbere.g.POSITIVE=+1,NEGATIVE=-1,AMBIVALENT=0• Createanaverageforeachsen6mentassociatedwithatweet• %6meinstrumentsthecodetoprovideprofileexecu6onstats.
©2016IBMCorpora6on �
Step 4: Transform the data
CreateacleanWorkingdataframethatwillbeeasiertouseinouranaly6cs
©2016IBMCorpora6on �
Step 5: Geographic distribution of tweets
GroupBycountriesandaggregatethetweetscount
ConvertSparkSQLdataframetoPandasdatastructureforvisualiza6on
©2016IBMCorpora6on �
Bar chart visualization of Tweet distribution by Geo
©2016IBMCorpora6on �
Google map visualization of tweet distribution by Geos
CallGeoChartHelperthatsetuptheJavaScriptcode
©2016IBMCorpora6on �
Clean up memory before next analytics
ResourcesincludingmemoryontheSparkDrivermachinearenotinfinite.Itisgoodprac6cetocleanupwhendataisnotneededanymore
©2016IBMCorpora6on �
Step 6: Analyzing tweets sentiment
GroupBySen6mentsandaggregatethetweetscount
ConvertSparkSQLdataframetoPandasdatastructureforvisualiza6on
©2016IBMCorpora6on �
Sentiment visualization
UseMatplotpiechart
©2016IBMCorpora6on �
Step 7: Analyze Tweet timeline
ConvertSparkSQLdataframetoPandasdatastructureforvisualiza6on
GroupByPos6ng6meandsen6menttuplesAggregatethetweetcounts
GroupByPos6ng6meandsen6menttuplesAggregatethesumofthetweetcounts
©2016IBMCorpora6on �
Prepare the timeline data structures
©2016IBMCorpora6on �
Time series visualization for all tweets
©2016IBMCorpora6on �
Deep dive into car manufacturers
CreatenewDataFramethatenrichtweetswithextrametadata:-Booleanforeachcarmanufacturer-Booleanforelectriccar-Booleanforselfdrivingcar
©2016IBMCorpora6on �
Re-analyze tweeter timeline for each car manufacturer
CreatenewDataFrameforeachcarmanufacturerAggregatethetweetcounts,orderbypos6ng6me
©2016IBMCorpora6on �
Timeline series visualization
No6cethepeakoftweetsforVWbetweenSeptemberandOctober2015
©2016IBMCorpora6on �
Explain why the peak of tweets for VW between September and October 2015
FilterforallVWtweetsbetweenSeptandOct2015
Piechartvisualiza6onofthetop10wordsbeingusedinthesetweets
Createmapcountofallnon-stopwordsusedinthetweets
UseNLTKstopwordsmoduletofilteroutstopwords
©2016IBMCorpora6on �
Peak explained
WecanclearlyseefromthelistofmostusedwordsthatthepeakcorrespondtotheVWscandalaroundfraudulentemissionstes6ng
©2016IBMCorpora6on �
Follow the notebook for many more interesting analytics
©2016IBMCorpora6on �
Resource
• hJps://developer.ibm.com/clouddataservices/• hJps://github.com/ibm-cds-labs/simple-data-pipe• hJps://github.com/ibm-cds-labs/pipes-connector-flightstats• hJp://spark.apache.org/docs/latest/mllib-guide.html• hJps://console.ng.bluemix.net/data/analy6cs/
©2016IBMCorpora6on �
Thank You
Recommended