56
Session 7 Wharton Summer Tech Camp Scrapy Big Data in Empirical Business Research

Session 7 Wharton Summer Tech Camp

Embed Size (px)

DESCRIPTION

Session 7 Wharton Summer Tech Camp. Scrapy Big Data in Empirical Business Research. What’s Scrapy ? And Why?. Application Framework for crawling websites and scraping & extracting data using APIs Basically a set of pre-defined classes and instructions for efficiently writing scraping code - PowerPoint PPT Presentation

Citation preview

Page 1: Session 7 Wharton Summer Tech Camp

Session 7Wharton Summer Tech Camp

Scrapy Big Data in Empirical Business

Research

Page 2: Session 7 Wharton Summer Tech Camp

What’s Scrapy? And Why?

• Application Framework for crawling websites and scraping & extracting data using APIs– Basically a set of pre-defined classes and instructions for

efficiently writing scraping code• It’s In Python • Simple once you know the framework • Fast, Extensible, Many built-in functions, good sized

online support community• Some companies use this commercially. It’s that

powerful.

Page 3: Session 7 Wharton Summer Tech Camp

Scrapy Architecture

Page 4: Session 7 Wharton Summer Tech Camp

Scrapy Components• Engine

– Main engine that passes around items and requests throughout the framework

• Scheduler– Gets requests from the engine and enqueues them for further requests

• Downloader– Downloads the raw http files and feeds them into spider

• Spiders– Receives the downloaded raw http files and extracts information

• Item Pipeline– Collects extracted items from the spider and post-process them– Built in modules & typical uses

• Clean html • Validate data & check for duplicates • Store the data into a database

Page 5: Session 7 Wharton Summer Tech Camp

Scrapy Framework

1. Use command line to create project folder

2. Define Item.py3. Define Spider for crawling the

website 4. (Option) Write the item pipeline

Page 6: Session 7 Wharton Summer Tech Camp

Usage: scrapy <command> [options] [args]

Available commands: bench Run quick benchmark test crawl Run a spider fetch Fetch a URL using the Scrapy downloader runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy

[ more ] More commands available when run from project directory

Scrapy commands

Page 7: Session 7 Wharton Summer Tech Camp

Example & Tutorial

• scrapy startproject “foldername”• scrapy startproject tutorial

tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...

Page 8: Session 7 Wharton Summer Tech Camp

Fully working example

• git clone https://github.com/scrapy/dirbot.git• Or go download the zip version and extract on

your working directory • https://github.com/scrapy/dirbot

Page 9: Session 7 Wharton Summer Tech Camp

1-min HTML

• Hyper Text Markup Language • Describe webpage – mostly with tags<!DOCTYPE html><html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>

Page 10: Session 7 Wharton Summer Tech Camp

HTML Parts

• Each element can have attributes• <a href="http://www.w3schools.com">This is a

link</a>• href is an attribute• You can have

– Class– ID – Style – etc

Page 11: Session 7 Wharton Summer Tech Camp

XPath

• Language for finding information in an XML(HTML) document

Page 12: Session 7 Wharton Summer Tech Camp

XPath<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book> <title lang="en">Harry Potter</title> <price>29.99</price></book>

<book> <title lang="en">Learning XML</title> <price>39.95</price></book>

</bookstore>

Page 13: Session 7 Wharton Summer Tech Camp

XPath

Page 14: Session 7 Wharton Summer Tech Camp

Snipplr

• http://snipplr.com/all/tags/scrapy/

Page 15: Session 7 Wharton Summer Tech Camp

Big Data

Page 16: Session 7 Wharton Summer Tech Camp
Page 18: Session 7 Wharton Summer Tech Camp

“Data is the new Oil. Data is just like crude. It’s valuable, but if unrefined it cannot really be used.” – Clive Humby, DunnHumby

"The goal is to turn data into information, and information into insight." – Carly Fiorina, former chief executive of Hewlett-Packard Company or HP.

"You can have data without information, but you cannot have information without data.” -Daniel Keys Moran

“We live in the caveman era of Big Data” – Rick Smolan

“From the beginning of recorded time until 2003, we created 5 exabytes of data” (5 billion gigabytes) – Eric Schmidt

In 2011, same amount was generated in 2 days In 2013, the same amount was expected to be generated in 10 minutes

Fun Fact:In 2014, World cup final match was estimated to generate 4.3 Exabytes of internet traffic

Page 19: Session 7 Wharton Summer Tech Camp

BIG DATA EMBODIES NEW DATA CHARACTERISTICS CREATED BY TODAY’S DIGITIZED MARKETPLACE

Characteristics of big data

19

Big data characteristics

Source: IBM methodology

Page 20: Session 7 Wharton Summer Tech Camp

BIG DATA EMBODIES NEW DATA CHARACTERISTICS CREATED BY TODAY’S DIGITIZED MARKETPLACE

Characteristics of big data

20

Big data characteristics

Source: IBM methodology

Computer Scientists

Statisticians

Big Companies IBM, Intel,

etc

US!!!

Page 21: Session 7 Wharton Summer Tech Camp

What’s Scrapy? And Why?

Page 22: Session 7 Wharton Summer Tech Camp

BIG DATA: THIS IS JUST THE BEGINNING

2010

Volu

me

in E

xaby

tes

9000

2015

Percentage of uncertain data

Percent of uncertain data

2012

Sensors & Devices

VoIP

Enterprise Data

Social Media

3000

6000

100

0

50

22

Veracity

Source: IBM Global Technology Outlook 2012 IBM source data is based on analysis done by the IBM Market Intelligence Department. IBM Market Intelligence datais provided for illustrative purposes and is not intended to be a guarantee of future growth rates or market opportunity

Volume

Variety

Page 23: Session 7 Wharton Summer Tech Camp

Current Stage of Big Data

• “This is the caveman era of the big data”• What’s cool is cool because we are looking at

these for the first time and even correlation is cool sometimes! Mash up of different big data makes things scary sometimes (CMU Face app)

• Scientific process always begins with correlation then moves onto causality when mature

Page 24: Session 7 Wharton Summer Tech Camp

Big Data: Predictive Analytics VS Causal Inference

Agenda1) What’s the deal here? 2) Why should you be aware? 3) What kind of development is going on right now?4) “Big Data and You”

Page 25: Session 7 Wharton Summer Tech Camp

Predictive vs Causal

Statistics

Causal Inference Predictive analytics

Econometrics Machine Learning

Page 26: Session 7 Wharton Summer Tech Camp

The Rise of Predictive Models

• Statistics & Computer Science (Logical AI -> Statistical AI)• Overflowing data + computational power

– Better prediction– Model free – no theory backing– Blackbox algorithms – Statistical algorithms

• Goal: Predict well (with big enough data, it works)• Techniques: MANY

– Take CIS 520:Machine Learning for basic intro. At least audit! It will open up your eyes

– Stat 9XX- Statistical Learning Theory if offered! Also great – will be a lot of probability/stat theory (Sasha Rakhlin)

– * Online courses: Andrew Ng’s course, John’s Hopkins Data Science Course, etc

Page 27: Session 7 Wharton Summer Tech Camp

Good Old Causal Inference

• Statistics & Econometrics • Explore -> Develop Theory -> Test with Statistical

Inference models ( Linear Models / Graphical Models / etc)

• Requirement for X Causes Y– X must temporally come before Y (NOT in Predictive

model)– X must have significant statistical relation to Y – Association between X and Y must not be due to other

omitted variable (NOT in Predictive model)• Theory is from economics/sociology/psychology etc

Page 28: Session 7 Wharton Summer Tech Camp

Predictive Analytics VS Causal Inference

• Predictive analytics (Machine Learning, Algorithms)– Art of prediction – RMSE/Error functions

• Causal Inference (Rubin Causal Model, Structural)– Theory building – Testing theory with statistical tools and robust design of experiment or

techniques to deal with observational data• Statistics/Comp Sci (Algorithms and Data mining, Machine Learning)• Statistics/Econometrics (Causality – different school of thoughts

even within causal inference groups. For brief fun intro, see http://leedokyun.com/obs.pdf)

• Paradigm-Building – Kuhnian sense & Falsify existing beliefs – Popperian Sense– Causal inference can do both. Predictive Models cannot

Page 29: Session 7 Wharton Summer Tech Camp

Resources for Causal Inference

• Andrew Gelman: Bayesian Statistician at Columbia U– http://andrewgelman.com/– The great fight of 2009 between the Pearlian vs Rubinian!

• “Boy, these academic disputes are fun! Such vitriol! Such personal animosity! It's better than reality TV. Did Rubin slap Pearl's mom, or perhaps vice versa?”

• “With all due respect, I think you are wrong that Judea does not understand the Rubin approach.” – Larry Wasserman

• Judea Pearl “Causality”• Observational Learning books by Paul Rosenbaum• Miguel Hernan and Jamie Robins “Causal Inference” free now

– http://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Page 30: Session 7 Wharton Summer Tech Camp

Arguments About the Big data Movement in Industry

• Great portion of Start-ups & Many big data firms these days– Companies are trying to collect everything about

everyone. Becomes unwieldy beast!– http://xkcd.com/882/

Page 31: Session 7 Wharton Summer Tech Camp

Big Data AnalyticsIterative & ExploratoryData is the structure

Traditional AnalyticsStructured & Repeatable

Structure built to store data

THE BIG DATA APPROACH TO ANALYTICS IS DIFFERENT (INDUSTRY)

?AnalyzedInformation

Question

DataAnswer

Hypothesis

Start with hypothesisTest against selected data

Data leads the way Explore all data, identify correlations

Data

Correlation

All Information

Exploration

Actionable Insight

Analyze after landing… Analyze in motion…

Page 33: Session 7 Wharton Summer Tech Camp

IN ACADEMIA, WE SHOULD STAY SOMEWHERE IN THE MIDDLE

Raw Data Pattern & Correlation

All Information

Exploration & Reduction & Structure

Hypotheses FormationBased on extant theory

Causal Data Analysis

All Information

Answer, Actionable Insights, and Theory

Data Mining & Machine Learning

“ETET” – Empirical Theory Empirical Theory

Page 34: Session 7 Wharton Summer Tech Camp

Some notable examples

Not Causal

Causal

Small Large

RevolutionR

Hal Varian: GoogleSusan Athey: MS

Angrist, Krueger

UCLA Stat bookTargeted learning

Economists

Machine Learning

What are you doing here?

NetflixGoogle

Data Mining

Structural Modeling

Information Systems Management

Lab experiment

Fraud detection

Hans Rosling

Marketing

Association rule

Finance Research

Page 35: Session 7 Wharton Summer Tech Camp

When dealing with unstructured/big data:Causal inference without data mining is myopic and data mining without theory-driven causal inference is blind

Page 36: Session 7 Wharton Summer Tech Camp

Quick Overview of Predictive Analytics (Machine Learning) and Applications

Page 37: Session 7 Wharton Summer Tech Camp

Machine Learning - Types

• Supervised Learning– Use labeled training data and identify classes or attributes

of new data. Calculate predictive models.– USES: Predictive models– Regression, Neural networks, support vector machine, etc

• Unsupervised Learning – Find structure in unlabeled data – USES: Exploratory analysis, organization, visualization– Clustering, feature extraction, self organizing map

• Semi-Supervised Learning

Page 38: Session 7 Wharton Summer Tech Camp

Machine Learning Broad Applications

– Face Detection– Spam Detection– Song Recognition– Signature/Zipcode Recognition– Micro Array – Astrophysics – Medical – Consumer segmentation/targeting– Recommendation Algorithms

Page 39: Session 7 Wharton Summer Tech Camp

Machine Learning Business (research) Applications

• Unstructured data -> Structured data – Natural Language Processing– Spoken Language Processing– Computer Vision

• Exploratory Analysis – Clustering– Anomaly detection

• Visualization – Dimensionality reduction– Multi-Dimensional Scaling

• Some people have started to incorporate machine learning techniques into causal inference – Machine learning in matching (PSM)– Targeted Learning, 2012 Springer Series

• (http://www.targetedlearningbook.com/)

Page 40: Session 7 Wharton Summer Tech Camp

Intro to Practical Natural Language Processing

Agenda1) Brief light-hearted Intro to NLP (What is it and why should I care?)2) Basic ideas in NLP3) Usage in Business Research

Page 41: Session 7 Wharton Summer Tech Camp

Quick Overview

• What is Natural (Spoken) Language Processing (NLP)?

• Examples• How this technology may affect:

– Industry– Academics

Page 42: Session 7 Wharton Summer Tech Camp

Natural Language Processing

• Natural Language Processing is an interdisciplinary field composed of techniques and ideas from computer science, statistics and linguistics that are concerned with making computers able to parse, understand (knowledge representation), store (knowledge database), and ultimately interact (convey information) in natural language (human language such as English)

• Methods: machine learning, bayesian statistics, algorithms, higher order logic, linguistics.

Page 43: Session 7 Wharton Summer Tech Camp

Subcategories of NLP

• Information Retrieval: Google. Optimizing text database search.

• Information Extraction: Crude basic form is Web Crawling + REGEX. Really sophisticated form, you’ll see later – Thomson Reuters

• Machine Translation• Sentiment Analysis and more

Page 44: Session 7 Wharton Summer Tech Camp

Cool Applications• NSA - uses NLP to detect anomalous activity in internet and

phone calls for terrorist activities (and us…)• Lie detection via spoken language processing • Automatic plagiarism detector

• ETS Testing - since 1999 “e-rater” automatic essay scoring on GMAT, GRE, TOEFL.

• Shazam – song discovery (application of spoken language processing)

• News aggregators based on topic

• Entertainment - Cleverbot (Turing test 59.3% VS real human 63.3%) Really evolved from dumb predecessors ELIZA, Smarter child etc.

Page 45: Session 7 Wharton Summer Tech Camp

Business Applications• Marketing - sentiment analysis and demand analysis of

products from reviews and blogs e.g. movies, consumer products

• Marketing – Opinion Mining/Subjectivity analysis/Emotion Detection/Opinion Spam Detection etc

• Finance - Quantitative Qualitative high frequency trading ( Thomson Reuters, Bloomberg)

• Management – Resume filtering and firm-employee matching

• Legal Studies – legal document search engines

• E-Commerce – help chat bots

Page 46: Session 7 Wharton Summer Tech Camp

Main stream Applications

• Siri (dumb) - preprogrammed. No learning

• IBM Watson/ Wolfram Alpha (smart):

– semantic representation of concepts

– acquisition of knowledge

– logical inference machine

• As of 2011, Watson had knowledge equivalent of a second year medical student (which isn’t saying much but still cool due to the speed Watson learns)

Page 47: Session 7 Wharton Summer Tech Camp

Main stream Applications

• Siri (dumb) - preprogrammed. No learning

• IBM Watson/ Wolfram Alpha (smart):

– semantic representation of concepts

– acquisition of knowledge

– logical inference machine

• As of 2011, Watson had knowledge equivalent of a second year medical student (which isn’t saying much but still cool due to the speed Watson learns)

Page 48: Session 7 Wharton Summer Tech Camp

Watson gets an attitude

IBM Watson learned urban dictionary in 2013…

“Watson couldn't distinguish between polite language and profanity -- which the Urban Dictionary is full of. Watson picked up some bad habits from reading Wikipedia as well. In tests it even used the word "bullshit" in an answer to a researcher's query.

Ultimately, Brown's 35-person team developed a filter to keep Watson from swearing and scraped the Urban Dictionary from its memory.”

Well no $@!# Sherlock! You mea@#$%s can bite

my shiny metal !@$

Page 49: Session 7 Wharton Summer Tech Camp

Some fun facts

• 15,000

– Average number of words spoken by an average person per day (various sociology, linguistics studies). approximately 15 words per min assuming 8 hour sleep.

• 100Million~300Million:

– Average number of words spoken by an average person in a lifetime.

• 100 TRILLION:

– approx number of words on internet in 2007 by Peter Norvig (leads google research & AI scientist).

Page 50: Session 7 Wharton Summer Tech Camp

Reasons why you should at least acknowledge NLP and keep it in mind for

the rest of your life

1. It will definitely be a disrupting technology and a large part of everyday life affecting most type of business (already has disrupted finance, marketing, management, etc)

2. Text Data: Explosion of web, Company performance report, news, security filings etc

3. Even in business research outside of Information Systems Management and Marketing, more and more researchers are utilizing NLP

Page 51: Session 7 Wharton Summer Tech Camp

Example Focus (Finance)• Thomson Reuters (Automation Team) and Bloomberg

• Business Wire: 60 stories per second

– “Apple also announced that Scott Forstall will be leaving Apple next year and will serve as an advisor to CEO Tim Cook in the interim”

– Lake Shore Bancorp, Inc. (the “Company”) (NASDAQ Global Market: LSBK), the holding company for Lake Shore Savings Bank (the “Bank”), announced third quarter 2012 net income of $863,000, or $0.15 per diluted share, compared to net income of $1.2 million, or $0.20 per diluted share, for third quarter 2011. The Company had net income of $2.8 million, or $0.48 per diluted share, for the nine months ended September 30, 2012, compared to net income of $3.1 million, or $0.54 per diluted share for the same period in 2011.

• Extract relevant information -> computer readable format such as XML/JSON

• KEY: the format of where information is, and how to extract is not preprogrammed. The NLP engine learns as new information comes in. Initially, it learns how to extract and what is important by humans tagging many articles. (semi-supervised learning)

Page 52: Session 7 Wharton Summer Tech Camp

Lake Shore Bancorp, Inc. (the “Company”) (NASDAQ Global Market: LSBK), the holding company for Lake Shore Savings Bank (the “Bank”), announced third quarter 2012 net income of $863,000, or $0.15 per diluted share, compared to net income of $1.2 million, or $0.20 per diluted share, for third quarter 2011. The Company had net income of $2.8 million, or $0.48 per diluted share, for the nine months ended September 30, 2012, compared to net income of $3.1 million, or $0.54 per diluted share for the same period in 2011. [...]

• Named Entity Recognition: Has to realize that “Lake Shore Bancorp, Inc.” is a name of a company

• Coreference resolution: “the company” is Lake Shore Bancorp, Inc • Morphological segmentation: breaking of words into basic parts and

meaning “lexeme” e.g. Announced is past tense of lexeme “announce” with inflection rule -ed

• Part of Speech Tagging and Grammar Parsing • Chunking and Breaking: e.g. A and B of X and Y is(A,X) and is(B,Y)

Example Focus(Finance)

Page 53: Session 7 Wharton Summer Tech Camp

<company name=”Lake Shore Bancorp”><Alias>The Company</Alias><Holds>Lake Shore Savings Bank</Holds><Holds>The Bank</Holds> <Q year=”2012” period=”third”>863000</Q><Q year=”2011” period=”third”>1.2 Million</Q><Net year=”2012” month=”9”>2.8 Million</Net><Net year=”2011” month=”9”>3.1 Million</Net>.........</company>

Example Focus(Finance)

Lake Shore Bancorp, Inc. (the “Company”) (NASDAQ Global Market: LSBK), the holding company for Lake Shore Savings Bank (the “Bank”), announced third quarter 2012 net income of $863,000, or $0.15 per diluted share, compared to net income of $1.2 million, or $0.20 per diluted share, for third quarter 2011. The Company had net income of $2.8 million, or $0.48 per diluted share, for the nine months ended September 30, 2012, compared to net income of $3.1 million, or $0.54 per diluted share for the same period in 2011. [...]

Page 54: Session 7 Wharton Summer Tech Camp

Real example XML from Thomson Reuters

Page 55: Session 7 Wharton Summer Tech Camp

Bottom Line• NLP can do lots of cool stuff• Unstructured text data is huge and is growing faster

than ever. And it will continue to grow as online population increases

• NLP is an important tool for anyone to be aware of• Jurafsky & Martin “Speech and Language

Processing” for deep theory• Bing Liu’s two books: http://www.cs.uic.edu/~liub/• Practical NLTK books: an NLTK cookbook by Jacob

Perkins and “NLP with python” by Steven Birds et al

Page 56: Session 7 Wharton Summer Tech Camp

Next Sessions• Overview of Machine Learning and Data Mining• You’ll see NLP in action (specific tasks)• Actual codes using NLTK (install this!)• Example research – Uses NLP and machine

learning techniques to content-code large scale social media data.

• You can read appendix of the paper “The Effect of Social Media Marketing Content on Consumer Engagement: Evidence from Facebook”

• ssrn.com/abstract=2290802