Upload
ricardo-barros-lourenco
View
209
Download
4
Embed Size (px)
Citation preview
Integration of Facebook Data to MongoDB and R-Studio
Ricardo Barros Lourenço
MSc. Candidate in Predictive Analytics
CAPES Foundation – Ministry of Education of Brazil - BSMP Scholarship # 88888.075449/2013-00
Summary
• Objectives
• DataSift API
• rmongodb: R-Studio integration with MongoDB
• References
• Questions & Answers
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
Objectives
• Ingest a Facebook public data stream using DataSift infrastructure on MongoDB
• Use the extreme flexibility of MongoDB to deal with schema less messages, like those generated in social networks, without concerns on data structures or injection performance
• The data is related with messages with content related to “Obama” and “Obamacare” which are popular topics on these days
• Allow integration of R-Studio (via rmongodb), with MongoDB once the data is already loaded
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
DataSift
• DataSift is a startup based in San Francisco, with offices in New York and London
• They are specialized in social media, as a PaaS, in data sources, filtering and destinations
• They own a firehose connection with Twitter, and a public Facebook data connection
• Their website: http://datasift.com
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
DataSift: Facebook API
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
DataSift: Facebook API
• It’s an API connected to a public data facebook stream (more info at: https://developers.facebook.com/docs/public_feed/)
• It generates a JSON with anonymized data, or public data
• It’s interesting to have a broader view of facebook trending topics in depth
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
DataSift: Facebook API
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
DataSift: MongoDB API
• It’s an API that connects your stream source (in my case a Facebook source), to a MongoDBinstance
• It injects all JSON messages generated by Facebook API into documents, in a determinated database, with optional setting of a collection
• It conserves all data structures that comes from Facebook API source
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
DataSift: MongoDB API
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
DataSift: Task
• Once defined the data source, and data destination, you must start a task
• On a starter account, you receive $10 as test credit, which is really appropriated, because a volume of 1000 messages just costs almost $0.10
• The latency is rounded on 200ms
• The system works with asynchronism, with PUSH messages
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
DataSift: Task
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
DataSift: Task
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
MongoDB: Setup
• Used a local instance of MongoDB (notebook)
• Needed to open firewall ports to Mongod
• Needed to create an access control for the facebookObama database, with the definition of a user and password for external connection
• Needed to create a sample register over the database, just to guarantee the creation of the database facebookObama
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
MongoDB: Setup
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
MongoDB after ingestion
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
MongoDB: Message sample
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
rmongodb: Conecting to MongoDBand displaying a single message
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
rmongodb: Loading all messages
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
rmongodb: Loading all messages(error on filtering by a key)
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
rmongodb: Possibilities
• Once you are able to connect your MongoDBinstance into R-Studio, there are a wide range of options that you could apply for data analysis
• The difficulties rely on data structures, as MongoDB is schema less, so you must need to know all kinds of data structures that a document could handle (even multiple level embedding into it)
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
rmongodb: Possibilities
• ETL activities would consume most of the user efforts, even knowing a sample message “schema”
• R-Studio have a text mining built-in package (called tm ), but it’s necessary to have a very well done job on ETL, avoiding excessive biasing when mining
• Within this text mining, the user should be able to recognize patterns over your data, with proper visualization
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
References
DataSift
• http://dev.datasift.com/docs/push/connectors/mongodb
• http://dev.datasift.com/docs/push/steps
MongoDB
• http://docs.mongodb.org/manual/reference/program/mongod/
• http://docs.mongodb.org/manual/tutorial/add-user-administrator/
• http://docs.mongodb.org/manual/tutorial/add-user-to-database/
R-Studio
• http://cran.r-project.org/web/packages/rmongodb/vignettes/rmongodb_introduction.html
• http://cran.r-project.org/web/packages/rmongodb/vignettes/rmongodb_cheat_sheet.pdf
• http://dugontario.files.wordpress.com/2013/12/qualitative-analysis-in-r.pdf
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas
Questions & Answers
Big Data and NoSQL - Prof. Marco Chou and Prof. Gint
Butenas