17
Data Scientist Enablement DSE 400 - Fast Track to Data Science Week 6 Roadmap Advanced Center of Excellence Modern Renaissance Corporation In Collaboration with SONO team and others Content of this document is under Creative Commons Licence CC BY 4.0

Data scientist enablement dse 400 week 6 roadmap

Embed Size (px)

Citation preview

Data Scientist EnablementDSE 400 - Fast Track to Data Science

Week 6 Roadmap

Advanced Center of ExcellenceModern Renaissance CorporationIn Collaboration with SONO team and others

Content of this document is under Creative Commons Licence CC BY 4.0

AgendaYou can always find the latest version of this document at http://bit.ly/1buuXVO

Week 6 OverviewDiscussions Learning PathActivities AssignmentSubmissionReferencesCitation

“What can’t be measured can’t be managed.” – Peter F. Drucker

Social Discourse:How to use Analytics discussion on SONO, LinkedIn and Facebook. Optional Q&A.

Learning plan: Read Hadoop and Hive. Watch related videos.

Activities:

Install Hadoop data platform (HDP 2.0). Import datasets into HDP and query etc.

Assignment 6:Perform queries on NYSE Stocks dataset

DSE 400 - Week 6 at a glance

Discussion: Read the Forbes.com article Five Steps To Master Big Data and Predictive Analytics in 2014 and discuss how you plan to implement these insights.

Inline with our Open Innovation model, we are expanding our Social Discourse mode to Linkedin and Facebook. Discussions on SONO will continue as planned on DSE 400 Jump Pad. This will allow more choice for participants. We are hoping this will result in the increased social engagement.

We also have recently created Language R - Community of Practice to help you learn/master R (which is the Lingua Franca for DSE program) and accelerate your competence in R and apply it to your and your organization’s needs. Reach out to Olivia Ramirez, Ellen Brock or Manju Rupani if you want to contribute to this community.

Social Engagement - Week 6Linkedin Facebook SONO

Activities<Practice> Check out Visualization of the Day at Data Science Central. As the name suggests, it is going be different everyday. Explore the alternative ways of representing this. Could you have presented this in a better way?

<Required> Install HDP 2.0 following instructions. Hortonworks Data Platform is a sandbox environment we will use in DSE program to learn, explore Hadoop ecosystem and build applications to process large datasets using inexpensive commodity hardware (such as your laptop or desktop). This requires you to first install Oracle Virtualbox or equivalent virtualization platform such as VMware Player. Make sure you download correct HDP sandbox, as it is built differently for each virtualization platform. It is safer to go with vendor’s recommendations.

<Recommended> Import HDP 2.0 sandbox, and start it. This will take several minutes and it varies depending on your machine’s configuration. Finally you will see a screen that looks like the following image on the next slide.

Activities - contd ...

Login to your virtual machine and then login to your sandbox using the credentials provided on the screen. (Typically, user = root and password = hadoop)

When you see the prompt that looks like [root@sandbox ~]# type echo DSE 400. As a result of this command, DSE 400 will be displayed (i.e. echoed) on your screen. You are ready to begin your exploration of Hadoop ecosystem and the features of HDP 2.0 at this stage.

<Optional> <Recommended> Open a new tab on your browser and enter http://127.0.0.1:8888 You will see screen similar to image on next slide. Click Start Tutorials. Following the directions provided, try Hello World example. Import NYSE data into HCatolog and try Hive commands on using Beeswax tool.

Activities - contd ...

Activities - contd ...

Activities - contd ...

Assignment 6 - Submission Requireda) Hadoop/Hive Option

In case you have not already done so, download NYSE-Stocks dataset from Amazon and import it into HCatalog (call it nyse_stocks) in your HDP 2.0 environment. You can also use equivalent Hadoop platforms like, Cloudera, MapR or IBM BigInsights.Using Beeswax, a) describe your nyse_stocks table b) find how many rows in this table have WMT for stock_symbol. c) display WMT stocks.

You may use following command templates:To describe a table use

describe <table name>To get rowcount use

select count(*) from <table name> where <column_name> = “<value>” To display specific rows in a table use

select * from <table name> where <column_name> = “<value>”

You may reach out to Rachel <[email protected]> if you have any difficulties with the assignments.

Assignment 6 - Contd...

Assignment 6 - Submission Requiredb) R-sqldf Option

In your R-Studio environment install sqldf package and load it. Using nyse_sample from DSE Datasets folder, solve the following a) find how many rows in this table have WMT for stock_symbol. b) select all the stocks where volume is less than 10000 c) Find all stocks that have exceeded 75 (i.e. stock_price_high > 75). Display rows with date, symbol, stock_price_high and volume

Once you install and load sqldf in your R-Studio environment, you can execute SQL query as though it is a parameter for sqldf function. For example, to find all rows with IBM as stock symbol you would execute the following query:

> sqldf("select * from nyse_sample where stock_symbol = 'IBM'") You are required to submit your solution to either option a) or b) but encouraged to work on both options as well as submit both versions of your solutions.

You may reach out to Rachel <[email protected]> if you have any difficulties with the assignments.

Submission in PDF format is requiredRecommended Deadline: Saturday, 11:59 PM your local time. If you can’t submit your assignment in time, please complete it and turn it in ASAP. While there is no penalty for late submission, it will help you focus on next week’s lessons if you turn in assignments in time.

Mail Assignment 6 to <[email protected]> Submit a single PDF document showing your queries and results. Include screenshots as necessary. Naming convention DSE 400 - Assignment 6 - Your Full Name is required for your document. No document links should be sent. Just one single PDF document, please. Only PDF format is accepted. You have to resubmit in PDF format if you send it in any other format. Please add DSE 400 > Assignment 6 in the subject line.

References, Resources and Additional Reading

Modern Data Architecture for Non-Stop HadoopHadoop Definitive Guide. 3rd Edition. Tom White. O’Reilly Publications. 2012Programming Hive. Capriolo et. al. O’Reilly Publications. 2012[MIT OCW] How to Process, Analyze and Visualize Data. Marcus & Wu. 2012Language R-Community of Practice

Citation NYSE_Stocks dataset used in this week’s activities and assignment comes from Amazon.

Content that appears as is, on this document only, is under Creative Commons License CC BY 4.0 This license may not necessarily apply to other material referenced here in this document.

Content from Hortonworks, Amazon, Tableau Software, O’Reilly Media and Forbes.com etc. is excluded from the above Creative Commons License.

For More Information

Week 6 discussions take place during this week on DSE 400 forums on Linkedin, Facebook and SONO. There is also an active Q&A session for everyone's benefit. Also check out Language R- Community of Practice if you would like to advance your competence in R or if you would like to contribute to this community.

<Mentoring On Demand> You may reach out to Rachel <[email protected]> if you have any difficulties with the assignments or looking for more challenging activities. If you need a mentor or someone to help you accelerate along the DSE program, you may reach out to Vishal <[email protected]> or Ligia Buzan<[email protected]>

We welcome questions, thoughts and suggestions. Post these on SONO in the right forum/discussion or write to us at <[email protected]>

You can always find the latest version of this document and other roadmaps at http://bitly.com/bundles/o_4ldaljhta4/1

Thank You