52
Python for applied practitioners F. Curci, F. Masera, T. Rodríguez Martínez February 24th, 2017 The two Fedes and Tomas Python Applied reading group 1 / 40

Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

  • Upload
    buihanh

  • View
    233

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Python for applied practitioners

F Curci F Masera T Rodriacuteguez Martiacutenez

February 24th 2017

The two Fedes and Tomas Python Applied reading group 1 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 2 40

What Python is

Python is a widely used programming languare forgeneral-purpose programming It has a syntax which allowsprogrammers to express concepts in fewer lines of code thanpossible in languages such as C++ or JavaFreeUsers contribution librariesMany platforms use Python as communicating way (becominguniversal language)Very powerful in dealing with strings

The two Fedes and Tomas Python Applied reading group 3 40

Good way to learn Python

Coursera course Python for everybody specialization Universityof MichiganBook Python for informatics

The two Fedes and Tomas Python Applied reading group 4 40

Use

How to obtain it and languageDownloadLanguages currently used Python 2 and Python 3Anaconda

Two methodsNotepad+command prompt

Need to save file as pyType filepy in command prompt

Compiler Pycharm Atom Spyder

The two Fedes and Tomas Python Applied reading group 5 40

Web browsing

Python allows you to write a web browserLibrary to open resources by URL urllibExample we want to Python to go to web to obtain FIPS code ofTexas counties

Counties

The two Fedes and Tomas Python Applied reading group 6 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Ways to extract information from the web

API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands

ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy

The two Fedes and Tomas Python Applied reading group 8 40

API

Talk to web services URLs that are designed explicitly to hand databack cleaned for your application

Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications

XMLJSON

LibrariesxmletreeElementTreejson

Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits

2500 requests per day50 requests per second

The two Fedes and Tomas Python Applied reading group 9 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 2: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 2 40

What Python is

Python is a widely used programming languare forgeneral-purpose programming It has a syntax which allowsprogrammers to express concepts in fewer lines of code thanpossible in languages such as C++ or JavaFreeUsers contribution librariesMany platforms use Python as communicating way (becominguniversal language)Very powerful in dealing with strings

The two Fedes and Tomas Python Applied reading group 3 40

Good way to learn Python

Coursera course Python for everybody specialization Universityof MichiganBook Python for informatics

The two Fedes and Tomas Python Applied reading group 4 40

Use

How to obtain it and languageDownloadLanguages currently used Python 2 and Python 3Anaconda

Two methodsNotepad+command prompt

Need to save file as pyType filepy in command prompt

Compiler Pycharm Atom Spyder

The two Fedes and Tomas Python Applied reading group 5 40

Web browsing

Python allows you to write a web browserLibrary to open resources by URL urllibExample we want to Python to go to web to obtain FIPS code ofTexas counties

Counties

The two Fedes and Tomas Python Applied reading group 6 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Ways to extract information from the web

API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands

ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy

The two Fedes and Tomas Python Applied reading group 8 40

API

Talk to web services URLs that are designed explicitly to hand databack cleaned for your application

Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications

XMLJSON

LibrariesxmletreeElementTreejson

Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits

2500 requests per day50 requests per second

The two Fedes and Tomas Python Applied reading group 9 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 3: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

What Python is

Python is a widely used programming languare forgeneral-purpose programming It has a syntax which allowsprogrammers to express concepts in fewer lines of code thanpossible in languages such as C++ or JavaFreeUsers contribution librariesMany platforms use Python as communicating way (becominguniversal language)Very powerful in dealing with strings

The two Fedes and Tomas Python Applied reading group 3 40

Good way to learn Python

Coursera course Python for everybody specialization Universityof MichiganBook Python for informatics

The two Fedes and Tomas Python Applied reading group 4 40

Use

How to obtain it and languageDownloadLanguages currently used Python 2 and Python 3Anaconda

Two methodsNotepad+command prompt

Need to save file as pyType filepy in command prompt

Compiler Pycharm Atom Spyder

The two Fedes and Tomas Python Applied reading group 5 40

Web browsing

Python allows you to write a web browserLibrary to open resources by URL urllibExample we want to Python to go to web to obtain FIPS code ofTexas counties

Counties

The two Fedes and Tomas Python Applied reading group 6 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Ways to extract information from the web

API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands

ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy

The two Fedes and Tomas Python Applied reading group 8 40

API

Talk to web services URLs that are designed explicitly to hand databack cleaned for your application

Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications

XMLJSON

LibrariesxmletreeElementTreejson

Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits

2500 requests per day50 requests per second

The two Fedes and Tomas Python Applied reading group 9 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 4: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Good way to learn Python

Coursera course Python for everybody specialization Universityof MichiganBook Python for informatics

The two Fedes and Tomas Python Applied reading group 4 40

Use

How to obtain it and languageDownloadLanguages currently used Python 2 and Python 3Anaconda

Two methodsNotepad+command prompt

Need to save file as pyType filepy in command prompt

Compiler Pycharm Atom Spyder

The two Fedes and Tomas Python Applied reading group 5 40

Web browsing

Python allows you to write a web browserLibrary to open resources by URL urllibExample we want to Python to go to web to obtain FIPS code ofTexas counties

Counties

The two Fedes and Tomas Python Applied reading group 6 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Ways to extract information from the web

API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands

ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy

The two Fedes and Tomas Python Applied reading group 8 40

API

Talk to web services URLs that are designed explicitly to hand databack cleaned for your application

Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications

XMLJSON

LibrariesxmletreeElementTreejson

Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits

2500 requests per day50 requests per second

The two Fedes and Tomas Python Applied reading group 9 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 5: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Use

How to obtain it and languageDownloadLanguages currently used Python 2 and Python 3Anaconda

Two methodsNotepad+command prompt

Need to save file as pyType filepy in command prompt

Compiler Pycharm Atom Spyder

The two Fedes and Tomas Python Applied reading group 5 40

Web browsing

Python allows you to write a web browserLibrary to open resources by URL urllibExample we want to Python to go to web to obtain FIPS code ofTexas counties

Counties

The two Fedes and Tomas Python Applied reading group 6 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Ways to extract information from the web

API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands

ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy

The two Fedes and Tomas Python Applied reading group 8 40

API

Talk to web services URLs that are designed explicitly to hand databack cleaned for your application

Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications

XMLJSON

LibrariesxmletreeElementTreejson

Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits

2500 requests per day50 requests per second

The two Fedes and Tomas Python Applied reading group 9 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 6: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Web browsing

Python allows you to write a web browserLibrary to open resources by URL urllibExample we want to Python to go to web to obtain FIPS code ofTexas counties

Counties

The two Fedes and Tomas Python Applied reading group 6 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Ways to extract information from the web

API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands

ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy

The two Fedes and Tomas Python Applied reading group 8 40

API

Talk to web services URLs that are designed explicitly to hand databack cleaned for your application

Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications

XMLJSON

LibrariesxmletreeElementTreejson

Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits

2500 requests per day50 requests per second

The two Fedes and Tomas Python Applied reading group 9 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 7: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Ways to extract information from the web

API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands

ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy

The two Fedes and Tomas Python Applied reading group 8 40

API

Talk to web services URLs that are designed explicitly to hand databack cleaned for your application

Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications

XMLJSON

LibrariesxmletreeElementTreejson

Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits

2500 requests per day50 requests per second

The two Fedes and Tomas Python Applied reading group 9 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 8: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Ways to extract information from the web

API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands

ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy

The two Fedes and Tomas Python Applied reading group 8 40

API

Talk to web services URLs that are designed explicitly to hand databack cleaned for your application

Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications

XMLJSON

LibrariesxmletreeElementTreejson

Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits

2500 requests per day50 requests per second

The two Fedes and Tomas Python Applied reading group 9 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 9: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Web browsing

The two Fedes and Tomas Python Applied reading group 7 40

Ways to extract information from the web

API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands

ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy

The two Fedes and Tomas Python Applied reading group 8 40

API

Talk to web services URLs that are designed explicitly to hand databack cleaned for your application

Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications

XMLJSON

LibrariesxmletreeElementTreejson

Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits

2500 requests per day50 requests per second

The two Fedes and Tomas Python Applied reading group 9 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 10: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Ways to extract information from the web

API (Application programming interface)Structured way to access data from websitesUsually limited number of queriesNot all website provide itTwitter Google maps etcSome Stata commands

ScrapingTransformation of unstructured data (HTML format) on the webinto structured data (database)Stata command copy

The two Fedes and Tomas Python Applied reading group 8 40

API

Talk to web services URLs that are designed explicitly to hand databack cleaned for your application

Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications

XMLJSON

LibrariesxmletreeElementTreejson

Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits

2500 requests per day50 requests per second

The two Fedes and Tomas Python Applied reading group 9 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 11: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

API

Talk to web services URLs that are designed explicitly to hand databack cleaned for your application

Need to agree on wire protocol for python and javaTwo wire formats to exchange data between applications

XMLJSON

LibrariesxmletreeElementTreejson

Google map APIPossible to retrieve address latitude and longitude of placesDocumentationLimits

2500 requests per day50 requests per second

The two Fedes and Tomas Python Applied reading group 9 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 12: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 13: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Google map API

The two Fedes and Tomas Python Applied reading group 10 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 14: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt

lttitlegtThis is the titlelttitlegtltheadgtltbodygt

lta href =wwwwikipediaes gt Wikipedia linkltagtltbodygt

lthtmlgt

The two Fedes and Tomas Python Applied reading group 11 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 15: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 16: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Web scraping

HTMLStandard markup language for creating web pages and webapplications Messy

lthtmlgtltheadgt︸ ︷︷ ︸Start tag

lttitlegtThis is the titlelttitlegt︸ ︷︷ ︸Element

ltheadgt︸ ︷︷ ︸End tag

ltbodygtlta href︸︷︷︸

Attribute name

=wwwwikipediaes︸ ︷︷ ︸Attribute value

gtWikipedia link︸ ︷︷ ︸Content

ltagt

ltbodygtlthtmlgt

Possible to retrieve all html with pythonLibraries BeautifulSoup HTMLParser

The two Fedes and Tomas Python Applied reading group 12 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 17: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 18: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 19: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 20: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Looking for house prices

The two Fedes and Tomas Python Applied reading group 13 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 21: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 22: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Looking for house prices

The two Fedes and Tomas Python Applied reading group 14 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 23: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 24: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 25: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Looking for house prices

The two Fedes and Tomas Python Applied reading group 15 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 26: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 27: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Looking for house prices

The two Fedes and Tomas Python Applied reading group 16 40

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 28: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

PDF Scraping

Easier when PDF is not an imagePossible to convert entire PDF to txt filePossible to dump the internal contents of a PDF file inpseudo-XML formatLibrary PDFMiner

The two Fedes and Tomas Python Applied reading group 17 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 29: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 30: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Scraping the little prince

The two Fedes and Tomas Python Applied reading group 18 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 31: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 19 40

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 32: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Introduction

Python can be used both in QGIS and ArcGIS as a language tointeract with these programs

These will be useful toAutomated processes that you can already do by clicking in yourpreferred GIS program (Very important for using data from STATAor creating data for STATA use)STATA can automate many of this operations already (not all ofthem) but GIS programs much more efficientPerform some new function only available if using Python

The two Fedes and Tomas Python Applied reading group 20 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 33: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 21 40

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 34: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Where to Find Python (QGIS)

Plugins rArr Python Console

The two Fedes and Tomas Python Applied reading group 22 40

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 35: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Learn to Write GIS Python Code

1 Follow the course found in qgistutorialscom2 Search in stackexchangecom3 Processing rArr ldquoGraphical modelerrdquo (a drag and drop tool that

generates a Python Code)4 Write your own py file

The two Fedes and Tomas Python Applied reading group 23 40

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 36: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Example 1 - Automation of Presentation of Results

You have shapefiles of all the crimes in counties in the US 1960-2014For exploring the data you want to create a map for each year Changethe color of places with a high violent crime rate

You can write your own py code1 For loop for each file (for yy in year_available)2 open shp file - addMapLayer( )3 select high crime places - getFeatures( )4 Change color - setSelectionColor( )5 save map

If you have multiple maps you can even created videos of maps

The two Fedes and Tomas Python Applied reading group 24 40

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 37: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Example 2 - Automation Getting Inputs from STATA

Example Data on number of police officers for all police departmentsand want to perform some operation with QGIS using this dataFor QGIS to read data from STATA needs a csv file with an identifier(ID of police department) and the variable of interest (number of policeofficers)

Python can easily automate all this in case you make some change toany variable you do not need to manually redo everything

1 Open shp file and csv files - addMapLayer( )2 Join the 2 files using the common identifier - addJoin3 Perform whatever operation you wish

The two Fedes and Tomas Python Applied reading group 25 40

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 38: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Example 3 - Summary Statistics by Distance

Calculate number of police officers in a radius of 20km around a policedepartmentThis could be done in STATA by calculating the distance of all policedept from all police dept and summing up the number of policeofficers if the distance is less than 20km This is computationallyunfeasible (15000 police dept)

QGIS is very efficient when dealing with GIS data and distances Wecan write a code on Python

1 Open the file we created in example 2 - addMapLayer( )2 Create a buffer of 20km around all police departments - buffer()3 Select only police departments that intersect with the buffer -

intersects()4 Sum up police officers if it intersects and save in a new variable

The two Fedes and Tomas Python Applied reading group 26 40

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 39: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Example 3 - Summary Statistics by Distance

Why is it useful1 Much faster than STATA (sometimes is just unfeasible on STATA)2 Once you have the Python code is easy to change buffers or

variables3 You can then save the data and use this newly created variable

back in STATA - writeAsVectorFormat()

The two Fedes and Tomas Python Applied reading group 27 40

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 40: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Example 4 - Neighbors of Polygons

All the previous examples the GIS data was points (policedepartment) Many times data is in polygons (Counties States )

Letrsquos say that we have population level data at the county level Wewant to calculate the total population of the neighbors of all counties(where neighbors means to have one border in common) Python cando that

Is a long code but the main tricky thing is how to determine neighborsYou do with the following function - intersects(geomboundingBox())

Again this new variable created in QGIS can be then exported inSTATA

The two Fedes and Tomas Python Applied reading group 28 40

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 41: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

A Taste of More Complex Things you Can Do inPython

Excellent Guide ldquoNight Lights and ArcGIS A Brief Guiderdquo by MattLowe

Letrsquos say you want to create a 20kmX20Km pixel-year panel ofluminosity

You can download a raster NASA NOOA satellite data of luminosity atnight of the earth (data for many years and many types ofmeasurements)You can do it in Python with CreateFishnet_management()

Given that the data is already on Python you can add a lot ofinformation to this pixels dataset (Country Distance to the coast )and then save it in a readable file for STATA

The two Fedes and Tomas Python Applied reading group 29 40

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 42: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Sketch presentation

How Python can be useful to applied practitionersFede C

What Python is and scrapingFede M

GIS (geography and maps) applicationTomaacutes

Machine learning tools

The two Fedes and Tomas Python Applied reading group 30 40

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 43: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Machine Learning with Scikit-Learn

Idea How to use a couple of methods of ImbensAthey NBERsummer lectures with the library scikit-learn

Machine Learningis more about fitprediction

cares a lot about scalability of the algorithm (big data manycovariates)

is neither about causality or inference (Susan Athey disagrees)

nor the formal asymptotic properties

Focus on Out of sample performance cross-validationregularization

Why Python and Scikit-learn

The two Fedes and Tomas Python Applied reading group 31 40

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 44: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

How to learn

Machine LearningCoursera Stanford course with Andrew NG (matlab) WashingtonUniversity specialization (python)edX Columbia course from the MicroMasters ProgramBook The Elements of Statistical Learning Data mining Inferenceand Prediction Trevor Hastie Robert Tibshirani and JeromeFriedman

Basic packages to handle data on PythonPanda (Most popular library for data manipulation) ResourcesNumpy (Efficient python matrix handler) Tutorial lecture notes

Scikit-LearnLinks Official Tutorial video lectures other Tutorial Book Learning scikit-learn Machine Learning in Python

The two Fedes and Tomas Python Applied reading group 32 40

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 45: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Ex1 Classification + Regularization

Built in scikit learn Lasso Ridge regression Least angleregression Lars Elastic Nets

Example Reviews from an Amazon product Which word hasthe highest predictive power of a positive review

Method Regularized logistic regression

Scikit implementation First get your matrix of words using theCountVectorizer then use LogisticRegressionCV with a Lasso likepenalty lasso

LogisticRegressionCV has a built in cross validation to find theldquooptimalrdquo penalty parameter α

The two Fedes and Tomas Python Applied reading group 33 40

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 46: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Ex1 Classification + Regularization

166752 data points with 141224 covariates ldquoOnlyrdquo around 3300are nonzero

Positive Words Negative Words1 oustanding 212 dissapointed -2582 saves 203 worst -2473 lifesaver 190 worthless -2404 skeptical 180 theory -2385 adores 179 unusable -2336 south 169 disappointing -2267 con 167 ripoff -2258 penny 165 poorly -2229 awesome 163 ineffective -22010 pleasantly 163 useless -202

The two Fedes and Tomas Python Applied reading group 34 40

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 47: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Ex2 Regression Trees

Built in scikit learn Random Forests Bagging Boosting(Adaboost)

Example When to give a loan Data on goodbad loans +covariates of the individual (employment debt to income ratio)

Method Random Forests Fit different regression trees in randomsubsamples with replacement and apply a weighted averagebetween them

Scikit implementation RandomForestClassifier

forest =RandomForestClassifier(max_depth=4)fit(XY)

forestpredict_proba(X[1])

print forestfeature_importances_

The two Fedes and Tomas Python Applied reading group 35 40

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 48: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Ex2 Regression Trees

The two Fedes and Tomas Python Applied reading group 36 40

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 49: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Ex3 Clustering

Built in scikit learn K-means Gaussian Mixtures

Example Clustering Samuel L Jackson movies by quality andrevenue box office

Method K-Means

Scikit implementation KMeans

kmeans = KMeans(n_clusters=3random_state=10)fit(data)

centroids = kmeanscluster_centers_

labels = kmeanslabels_

The two Fedes and Tomas Python Applied reading group 37 40

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 50: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Ex3 Clustering

The two Fedes and Tomas Python Applied reading group 38 40

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 51: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Code Appendix

Classifier + Regularization

Word Matrix

vectorizer = CountVectorizer(token_pattern=rbw+b)

words_matrix = vectorizerfit_transform(products[review_clean])

name = vectorizerget_feature_names()

Model

sentiment_model = linear_modelLogisticRegressionCV(penalty=l1

solver=liblinear)

sentiment_modelfit(words_matrix dependent_var)

coefficients = sentiment_modelcoef_

The two Fedes and Tomas Python Applied reading group 39 40

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix
Page 52: Python for applied practitioners - Jan Stuhler | Universidad … · Python for applied practitioners F. Curci, ... Machine learning tools ... Lasso, Ridge regression, Least angle

Appendix 1

Lasso

minβ

Nsumi

(Yi minusXiβ)2 +α||β||1 (1)

The two Fedes and Tomas Python Applied reading group 40 40

  • Appendix