58
PYTHON FOR DATA SCIENCE Gabriel Moreira Machine Learning Engineer @gspmoreira PythonBrasil 2015

Python for Data Science - Python Brasil 11 (2015)

Embed Size (px)

Citation preview

Page 1: Python for Data Science - Python Brasil 11 (2015)

PYTHON FOR DATA SCIENCE

Gabriel MoreiraMachine Learning Engineer

@gspmoreira

PythonBrasil 2015

Page 2: Python for Data Science - Python Brasil 11 (2015)

Why so much buzz?

Page 3: Python for Data Science - Python Brasil 11 (2015)

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

Page 4: Python for Data Science - Python Brasil 11 (2015)

WHAT IS DATA SCIENCE

http://drewconway.com

Page 5: Python for Data Science - Python Brasil 11 (2015)

TYPES OF ANALYTICS

Investigative Analytics Operational AnalyticsConsumers: Humans Consumers: Machines

http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/https://hbr.org/2014/08/the-question-to-ask-before-hiring-a-data-scientist/

Page 6: Python for Data Science - Python Brasil 11 (2015)

[Hillary Mason, Data Scientist]

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

DATA SCIENCE IS IOSEMN

Page 7: Python for Data Science - Python Brasil 11 (2015)

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

PYTHON IS IOSEMN

jsOutsider

Page 8: Python for Data Science - Python Brasil 11 (2015)

ANALYTICS CASE CORPORATE SOCIAL NETWORKS

Page 9: Python for Data Science - Python Brasil 11 (2015)

Full Data Analysis demo available in IPython Notebookbit.ly/python4ds_nb

Page 10: Python for Data Science - Python Brasil 11 (2015)

Investigative AnalyticsConsumers: Humans

Page 11: Python for Data Science - Python Brasil 11 (2015)

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

Page 12: Python for Data Science - Python Brasil 11 (2015)

INQUIRE

1. Which communities are more popular?

2. Is the user engagement increasing?

3. What is the distribution of user interactions?

4. Is there a relationship between publishing hour and number of interactions?

Page 13: Python for Data Science - Python Brasil 11 (2015)

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

Page 14: Python for Data Science - Python Brasil 11 (2015)

OBTAIN

•Download data from another location (e.g., a web page or server)

•Query data from a database (e.g., MySQL or Oracle)•Extract data from an API (e.g., Twitter, Facebook) •Extract data from another file (e.g., an HTML file or

spreadsheet) •Generate data yourself (e.g., reading sensors or

taking surveys)

Page 15: Python for Data Science - Python Brasil 11 (2015)

READING INTERACTIONS FROM CVS

Page 16: Python for Data Science - Python Brasil 11 (2015)

READING POSTS FROM JSON LINES

Page 17: Python for Data Science - Python Brasil 11 (2015)

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

Page 18: Python for Data Science - Python Brasil 11 (2015)

SCRUB

Page 19: Python for Data Science - Python Brasil 11 (2015)

SCRUB

Page 20: Python for Data Science - Python Brasil 11 (2015)

SCRUB

Page 21: Python for Data Science - Python Brasil 11 (2015)

SCRUB

Dealing with nulls

Page 22: Python for Data Science - Python Brasil 11 (2015)

SCRUB

Page 23: Python for Data Science - Python Brasil 11 (2015)

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

Page 24: Python for Data Science - Python Brasil 11 (2015)

1 - WHICH COMMUNITIES ARE MORE POPULAR?

Page 25: Python for Data Science - Python Brasil 11 (2015)

1 - WHICH COMMUNITIES ARE MORE POPULAR?

Page 26: Python for Data Science - Python Brasil 11 (2015)

2 - IS USER ENGAGEMENT INCREASING?

Page 27: Python for Data Science - Python Brasil 11 (2015)

2 - IS USER ENGAGEMENT INCREASING?

Page 28: Python for Data Science - Python Brasil 11 (2015)

3 - HOW IS THE DISTRIBUTION OF USER INTERACTIONS?

Page 29: Python for Data Science - Python Brasil 11 (2015)

3 - HOW IS THE DISTRIBUTION OF USER INTERACTIONS?

Page 30: Python for Data Science - Python Brasil 11 (2015)

3 - HOW IS THE DISTRIBUTION OF USER INTERACTIONS?

Page 31: Python for Data Science - Python Brasil 11 (2015)

4 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?

Page 32: Python for Data Science - Python Brasil 11 (2015)

4 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?

Page 33: Python for Data Science - Python Brasil 11 (2015)

4 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?

Page 34: Python for Data Science - Python Brasil 11 (2015)

4 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?

http://viverdeblog.com/melhoresahorarios-para-postar-nas-redes-sociais/

Page 35: Python for Data Science - Python Brasil 11 (2015)

Operational AnalyticsConsumers: Machines

Page 36: Python for Data Science - Python Brasil 11 (2015)

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

Page 37: Python for Data Science - Python Brasil 11 (2015)

1. Discover the most relevant words in the posts

2. Find related posts, with similar content

Operational Analytics Tasks example

Find Related Posts

Page 38: Python for Data Science - Python Brasil 11 (2015)

1 - RELEVANT WORDS IN A POST

TF-IDF - More “relevant" terms in a document are frequent terms in the document and rare in other documents

Page 39: Python for Data Science - Python Brasil 11 (2015)

1 - RELEVANT WORDS IN A POST

Page 40: Python for Data Science - Python Brasil 11 (2015)

1 - RELEVANT WORDS IN A POST

Page 41: Python for Data Science - Python Brasil 11 (2015)

1 - RELEVANT WORDS IN A POST

Page 42: Python for Data Science - Python Brasil 11 (2015)

BONUS - GLOBAL RELEVANT TERMS [ALL POSTS]

Page 43: Python for Data Science - Python Brasil 11 (2015)

2 - SIMILAR POSTS

Cosine Similarity Measure of similarity between two vectors being the cosine of the angle between them.

Page 44: Python for Data Science - Python Brasil 11 (2015)

2 - SIMILAR POSTS

Page 45: Python for Data Science - Python Brasil 11 (2015)

2 - SIMILAR POSTSOriginal Post Did you ever wonder how great it would be if you could write your jmeter tests in ruby ? This projects aims to do so. If you use it on your project just let me now. On the Architecture Academy you can read how jmeter can be used to validate your Architecture. modulo 13 arch definition architecture validation | academia de arquiteturaMost similar post (cosine similarity = 0.30) Foram disponibilizados no site Enterprise Architecture, na parte de Knowledge Base de performance, alguns how-tos relacionados a testes de performance.Entre eles, como definir os requisitos (throughput, cálculo de threads para o JMeter etc.), utilização do JMeter, geração de massa de dados e monitoramento. planning and executing performance testing | enterprise architecture - how to identify performance acceptance criteria | enterprise architecture - how to geracao de massa de dados | enterprise architecture - how to jmeter | enterprise architecture - how to monitoramento | enterprise architecture

Page 46: Python for Data Science - Python Brasil 11 (2015)

SIMILAR PEOPLE!

Page 47: Python for Data Science - Python Brasil 11 (2015)

Inquire(

Obtain(

Scrub(

Explore(

Model(

iNterpret(

Page 48: Python for Data Science - Python Brasil 11 (2015)

INTERPRET

•Drawing conclusions from your data

•Evaluating what your results mean

•Communicating your result

Page 49: Python for Data Science - Python Brasil 11 (2015)

DATA PRODUCTS“If information has context and the context is interactive, insights are not predictable."

[Agile Data Science, O’Reilly, 2014]

Page 50: Python for Data Science - Python Brasil 11 (2015)

SENTIMENT ANALYSIS

bit.ly/eleicoes2014debatesbt

Analytical Dashboard

Page 51: Python for Data Science - Python Brasil 11 (2015)

SENTIMENT ANALYSISAnalytical Dashboard

bit.ly/eleicoes2014debatesbt

Page 52: Python for Data Science - Python Brasil 11 (2015)

NETWORK ANALYSIS

https://linkedjazz.org/network/js

Page 53: Python for Data Science - Python Brasil 11 (2015)

What about Python for Big Data?

Page 54: Python for Data Science - Python Brasil 11 (2015)

PYTHON FOR BIG DATA

Streaming

HADOOPY

Pig UDFs in Jython

Page 55: Python for Data Science - Python Brasil 11 (2015)

DATA SCIENCE COURSES• Introduction to Data Science (Univ. of Washington)

• Data Science specialization (Johns Hopkins)

• Intro to Hadoop and MapReduce (Cloudera)

• Machine Learning (Stanford)

• Statistical Learning (Stanford)

• Mining Massive Datasets (Stanford)

• Scalable Machine Learning (Berkeley)

http://workingsweng.com.br/2014/04/cursos-mooc-e-especializacoes-em-data-science/

Page 56: Python for Data Science - Python Brasil 11 (2015)

BOOKS

Page 57: Python for Data Science - Python Brasil 11 (2015)

Happy data geeking!

Page 58: Python for Data Science - Python Brasil 11 (2015)

Gabriel Moreira@gspmoreira

http://about.me/gspmoreira

Thank you!PYTHON FOR DATA SCIENCE

Slides: http://bit.ly/python4ds_pybr11

PythonBrasil 2015