Upload
gabriel-moreira
View
13.650
Download
7
Embed Size (px)
Citation preview
PYTHON FOR DATA SCIENCE
Gabriel MoreiraMachine Learning Engineer
@gspmoreira
PythonBrasil 2015
Why so much buzz?
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
TYPES OF ANALYTICS
Investigative Analytics Operational AnalyticsConsumers: Humans Consumers: Machines
http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/https://hbr.org/2014/08/the-question-to-ask-before-hiring-a-data-scientist/
[Hillary Mason, Data Scientist]
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
DATA SCIENCE IS IOSEMN
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
PYTHON IS IOSEMN
jsOutsider
ANALYTICS CASE CORPORATE SOCIAL NETWORKS
Full Data Analysis demo available in IPython Notebookbit.ly/python4ds_nb
Investigative AnalyticsConsumers: Humans
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
INQUIRE
1. Which communities are more popular?
2. Is the user engagement increasing?
3. What is the distribution of user interactions?
4. Is there a relationship between publishing hour and number of interactions?
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
OBTAIN
•Download data from another location (e.g., a web page or server)
•Query data from a database (e.g., MySQL or Oracle)•Extract data from an API (e.g., Twitter, Facebook) •Extract data from another file (e.g., an HTML file or
spreadsheet) •Generate data yourself (e.g., reading sensors or
taking surveys)
READING INTERACTIONS FROM CVS
READING POSTS FROM JSON LINES
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
SCRUB
SCRUB
SCRUB
SCRUB
Dealing with nulls
SCRUB
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
1 - WHICH COMMUNITIES ARE MORE POPULAR?
1 - WHICH COMMUNITIES ARE MORE POPULAR?
2 - IS USER ENGAGEMENT INCREASING?
2 - IS USER ENGAGEMENT INCREASING?
3 - HOW IS THE DISTRIBUTION OF USER INTERACTIONS?
3 - HOW IS THE DISTRIBUTION OF USER INTERACTIONS?
3 - HOW IS THE DISTRIBUTION OF USER INTERACTIONS?
4 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?
4 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?
4 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?
4 - RELATIONSHIP BETWEEN PUBLISHING TIME AND NUMBER OF INTERACTIONS?
http://viverdeblog.com/melhoresahorarios-para-postar-nas-redes-sociais/
Operational AnalyticsConsumers: Machines
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
1. Discover the most relevant words in the posts
2. Find related posts, with similar content
Operational Analytics Tasks example
Find Related Posts
1 - RELEVANT WORDS IN A POST
TF-IDF - More “relevant" terms in a document are frequent terms in the document and rare in other documents
1 - RELEVANT WORDS IN A POST
1 - RELEVANT WORDS IN A POST
1 - RELEVANT WORDS IN A POST
BONUS - GLOBAL RELEVANT TERMS [ALL POSTS]
2 - SIMILAR POSTS
Cosine Similarity Measure of similarity between two vectors being the cosine of the angle between them.
2 - SIMILAR POSTS
2 - SIMILAR POSTSOriginal Post Did you ever wonder how great it would be if you could write your jmeter tests in ruby ? This projects aims to do so. If you use it on your project just let me now. On the Architecture Academy you can read how jmeter can be used to validate your Architecture. modulo 13 arch definition architecture validation | academia de arquiteturaMost similar post (cosine similarity = 0.30) Foram disponibilizados no site Enterprise Architecture, na parte de Knowledge Base de performance, alguns how-tos relacionados a testes de performance.Entre eles, como definir os requisitos (throughput, cálculo de threads para o JMeter etc.), utilização do JMeter, geração de massa de dados e monitoramento. planning and executing performance testing | enterprise architecture - how to identify performance acceptance criteria | enterprise architecture - how to geracao de massa de dados | enterprise architecture - how to jmeter | enterprise architecture - how to monitoramento | enterprise architecture
SIMILAR PEOPLE!
Inquire(
Obtain(
Scrub(
Explore(
Model(
iNterpret(
INTERPRET
•Drawing conclusions from your data
•Evaluating what your results mean
•Communicating your result
DATA PRODUCTS“If information has context and the context is interactive, insights are not predictable."
[Agile Data Science, O’Reilly, 2014]
SENTIMENT ANALYSIS
bit.ly/eleicoes2014debatesbt
Analytical Dashboard
SENTIMENT ANALYSISAnalytical Dashboard
bit.ly/eleicoes2014debatesbt
What about Python for Big Data?
PYTHON FOR BIG DATA
Streaming
HADOOPY
Pig UDFs in Jython
DATA SCIENCE COURSES• Introduction to Data Science (Univ. of Washington)
• Data Science specialization (Johns Hopkins)
• Intro to Hadoop and MapReduce (Cloudera)
• Machine Learning (Stanford)
• Statistical Learning (Stanford)
• Mining Massive Datasets (Stanford)
• Scalable Machine Learning (Berkeley)
http://workingsweng.com.br/2014/04/cursos-mooc-e-especializacoes-em-data-science/
BOOKS
Happy data geeking!
Gabriel Moreira@gspmoreira
http://about.me/gspmoreira
Thank you!PYTHON FOR DATA SCIENCE
Slides: http://bit.ly/python4ds_pybr11
PythonBrasil 2015