49
Sarah Guido @sarah_gui do Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON

Analyzing Data With Python

Embed Size (px)

DESCRIPTION

Given at OSCON 2014 http://www.oscon.com/oscon2014/public/schedule/detail/34255

Citation preview

Page 1: Analyzing Data With Python

Sarah Guido@sarah_guidoReonomyOSCON 2014

ANALYZING DATA WITH PYTHON

Page 2: Analyzing Data With Python

Data scientist at ReonomyUniversity of Michigan graduateNYC Python organizerPyGotham organizer

ABOUT ME

Page 3: Analyzing Data With Python

Bird’s-eye overview: not comprehensive explanation of these tools!

Take data from start-to-finishPreprocessing: PandasAnalysis: scikit-learnAnalysis: nltkData pipeline: MRjobVisualization: matplotlib

What next?

ABOUT THIS TALK

Page 4: Analyzing Data With Python

So many toolsPreprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability

Community support“Easy” language to learnBoth a scripting and production-ready

language

WHY PYTHON?

Page 5: Analyzing Data With Python

How to find the best tool(s)?The 90/10 ruleSimple is better than complex

FROM POINT A TO POINT…X?

Page 6: Analyzing Data With Python

Available resourcesDocumentation, tutorials, books, videos

Ease of use (with a grain of salt)Community support and continuous

developmentWidely used

WHY I CHOSE THESE TOOLS

Page 7: Analyzing Data With Python

The importance of data preprocessingAKA wrangling, munging, manipulating, and so on

Preprocessing is also getting to know your dataMissing values? Categorical/continuous? Distribution?

PREPROCESSING

Page 8: Analyzing Data With Python

Data analysis and modelingSimilar to R and ExcelEasy-to-use data structures

DataFrameData wrangling tools

Merging, pivoting, etc

PANDAS

Page 9: Analyzing Data With Python

Keep everything in PythonCommunity support/resourcesUse for preprocessing

File I/0, cleaning, manipulation, etcCombinable with other modules

NumPy, SciPy, statsmodel, matplotlib

PANDAS

Page 10: Analyzing Data With Python

File I/O

PANDAS

Page 11: Analyzing Data With Python

Finding missing values

PANDAS

Page 12: Analyzing Data With Python

Removing missing values

PANDAS

Page 13: Analyzing Data With Python

Pivoting

PANDAS

Page 14: Analyzing Data With Python

Other thingsStatistical methodsMerge/join like SQLTime seriesHas some visualization functionality

PANDAS

Page 15: Analyzing Data With Python

Application of algorithms that learn from examples

Representation and generalizationUseful in everyday lifeEspecially useful in data analysis

MACHINE LEARNING

Page 16: Analyzing Data With Python

Supervised learningClassification and regression

Unsupervised learningClustering and dimensionality reduction

MACHINE LEARNING

Page 17: Analyzing Data With Python

Machine learning moduleOpen-sourceBuilt-in datasetsGood resources for learning

SCIKIT-LEARN

Page 18: Analyzing Data With Python

Scikit-learn: your data has to be continuous

Here’s what one observation/label looks like:

SCIKIT-LEARN

Page 19: Analyzing Data With Python

Transform categorical values/labels

SCIKIT-LEARN

Page 20: Analyzing Data With Python

Classification

SCIKIT-LEARN

Page 21: Analyzing Data With Python

Classification

SCIKIT-LEARN

Page 22: Analyzing Data With Python

Other thingsVery comprehensive of machine learning algorithms

Preprocessing toolsMethods for testing the accuracy of your model

SCIKIT-LEARN

Page 23: Analyzing Data With Python

Concerned with interactions between computers and human languages

Derive meaning from textMany NLP algorithms are based on

machine learning

NATURAL LANGUAGE PROCESSING

Page 24: Analyzing Data With Python

Natural Language ToolKitAccess to over 50 corpora

Corpus: body of textNLP tools

Stemming, tokenizing, etcResources for learning

NLTK

Page 25: Analyzing Data With Python

Stopword removal

NLTK

Page 26: Analyzing Data With Python

Stopword removal

NLTK

Page 27: Analyzing Data With Python

Stemming

NLTK

Page 28: Analyzing Data With Python

Other thingsLemmatizing, tokenization, tagging, parse trees

ClassificationChunkingSentence structure

NLTK

Page 29: Analyzing Data With Python

Data that takes too long to process on your machineNot “big data” but larger data

Solution: MapReduce!Processing large datasets with a parallel, distributed algorithm

Map stepReduce step

PROCESSING LARGE DATA

Page 30: Analyzing Data With Python

Map stepTakes series of key/value pairs Ex. Word counts: break line into words, return word and count within line

Reduce stepOnce for each unique key: iterates through values associated with that key

Ex. Word counts: returns word and sum of all counts

PROCESSING LARGE DATA

Page 31: Analyzing Data With Python

Write MapReduce jobs in PythonTest code locally without installing

HadoopLots of thorough documentationA few things to know

Keep everything in one classMRJob program in a separate fileOutput to new file if doing something like word counts

MRJOB

Page 32: Analyzing Data With Python

Stemmed file

Line 1: (‘miss’, 2), (‘taylor’, 1)Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’,

1)And so on…

MRJOB

Page 33: Analyzing Data With Python

MapLine 1: (‘miss’, 2),

(‘taylor’, 1)Line 2: (‘taylor’, 1),

(‘first’, 1), (‘wed’, 1)Line 3: (‘first’, 1),

(‘wed’, 1)Line 4: (‘father’, 1)Line 5: (‘father’, 1)

Reduce(‘miss’, 2)(‘taylor’, 2)(‘first’, 2)(‘wed’, 2)(‘father’, 2)

MRJOB

Page 34: Analyzing Data With Python

Let’s count all words in the Gutenberg file

Map step

MRJOB

Page 35: Analyzing Data With Python

Reduce (and run) step

MRJOB

Page 36: Analyzing Data With Python

ResultsMapped counts reducedKey/val pairs

MRJOB

Page 37: Analyzing Data With Python

Other thingsRun on Hadoop clustersCan write highly complex jobsWorks with Elasticsearch

MRJOB

Page 38: Analyzing Data With Python

The “final step”Conveying your results in a meaningful

wayLiterally see what’s going on

DATA VISUALIZATION

Page 39: Analyzing Data With Python

2D visualization libraryVery VERY widely usedWide variety of plotsEasy to feed in results from other

modules (like Pandas, scikit-learn, NumPy, SciPy, etc)

MATPLOTLIB

Page 40: Analyzing Data With Python

Remember this?

MATPLOTLIB

Page 41: Analyzing Data With Python

Bar chart of distribution

MATPLOTLIB

Page 42: Analyzing Data With Python

Let’s graph our word count frequencies(Hint: It’s a power law distribution!)

MATPLOTLIB

Page 43: Analyzing Data With Python

High frequency of low numbers, low frequency of high numbers

MATPLOTLIB

Page 44: Analyzing Data With Python

Other thingsMany different kinds of graphsCustomizableTime series

MATPLOTLIB

Page 45: Analyzing Data With Python

Phew!Which tool to choose depends on your

needsWorkflow:

PreprocessAnalyzeVisualize

WHAT NEXT?

Page 46: Analyzing Data With Python

Pandashttp://pandas.pydata.org/

scikit-learnhttp://scikit-learn.org/

NLTKhttp://www.nltk.org/

MRJobhttp://mrjob.readthedocs.org/

matplotlibhttp://matplotlib.org/

RESOURCES

Page 47: Analyzing Data With Python

Twitter@sarah_guido

LinkedInhttps://www.linkedin.com/in/sarahguido

NYC Pythonhttp://www.meetup.com/nycpython/

CONTACT ME!

Page 48: Analyzing Data With Python

AND FINALLY…

Page 49: Analyzing Data With Python

Questions?

THE END!