Upload
cally-willis
View
33
Download
4
Embed Size (px)
DESCRIPTION
Session 7 Wharton Summer Tech Camp. Scrapy Big Data in Empirical Business Research. What’s Scrapy ? And Why?. Application Framework for crawling websites and scraping & extracting data using APIs Basically a set of pre-defined classes and instructions for efficiently writing scraping code - PowerPoint PPT Presentation
Citation preview
Session 7Wharton Summer Tech Camp
Scrapy Big Data in Empirical Business
Research
What’s Scrapy? And Why?
• Application Framework for crawling websites and scraping & extracting data using APIs– Basically a set of pre-defined classes and instructions for
efficiently writing scraping code• It’s In Python • Simple once you know the framework • Fast, Extensible, Many built-in functions, good sized
online support community• Some companies use this commercially. It’s that
powerful.
Scrapy Architecture
Scrapy Components• Engine
– Main engine that passes around items and requests throughout the framework
• Scheduler– Gets requests from the engine and enqueues them for further requests
• Downloader– Downloads the raw http files and feeds them into spider
• Spiders– Receives the downloaded raw http files and extracts information
• Item Pipeline– Collects extracted items from the spider and post-process them– Built in modules & typical uses
• Clean html • Validate data & check for duplicates • Store the data into a database
Scrapy Framework
1. Use command line to create project folder
2. Define Item.py3. Define Spider for crawling the
website 4. (Option) Write the item pipeline
Usage: scrapy <command> [options] [args]
Available commands: bench Run quick benchmark test crawl Run a spider fetch Fetch a URL using the Scrapy downloader runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Scrapy commands
Example & Tutorial
• scrapy startproject “foldername”• scrapy startproject tutorial
tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Fully working example
• git clone https://github.com/scrapy/dirbot.git• Or go download the zip version and extract on
your working directory • https://github.com/scrapy/dirbot
1-min HTML
• Hyper Text Markup Language • Describe webpage – mostly with tags<!DOCTYPE html><html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>
HTML Parts
• Each element can have attributes• <a href="http://www.w3schools.com">This is a
link</a>• href is an attribute• You can have
– Class– ID – Style – etc
XPath
• Language for finding information in an XML(HTML) document
XPath<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book> <title lang="en">Harry Potter</title> <price>29.99</price></book>
<book> <title lang="en">Learning XML</title> <price>39.95</price></book>
</bookstore>
XPath
Snipplr
• http://snipplr.com/all/tags/scrapy/
Big Data
Videos about Big Data• Some videos
– Joy of Stat - Hans Rosling!• http://www.youtube.com/watch?v=CiCQepmcuj8• http://www.ted.com/playlists/56/
making_sense_of_too_much_data.html– http://www.intel.com/content/www/us/en/big-data/big-data-101-ani
mation.html More relevant stop at 2:10
– http://motherboard.vice.com/blog/big-data-explained-brilliantly-in-one-short-video Broad stop at 4:26
– https://www.youtube.com/watch?v=LrNlZ7-SMPk many interesting stats
– http://blog.varonis.com/10-big-data-videos-watch-right-now/
“Data is the new Oil. Data is just like crude. It’s valuable, but if unrefined it cannot really be used.” – Clive Humby, DunnHumby
"The goal is to turn data into information, and information into insight." – Carly Fiorina, former chief executive of Hewlett-Packard Company or HP.
"You can have data without information, but you cannot have information without data.” -Daniel Keys Moran
“We live in the caveman era of Big Data” – Rick Smolan
“From the beginning of recorded time until 2003, we created 5 exabytes of data” (5 billion gigabytes) – Eric Schmidt
In 2011, same amount was generated in 2 days In 2013, the same amount was expected to be generated in 10 minutes
Fun Fact:In 2014, World cup final match was estimated to generate 4.3 Exabytes of internet traffic
BIG DATA EMBODIES NEW DATA CHARACTERISTICS CREATED BY TODAY’S DIGITIZED MARKETPLACE
Characteristics of big data
19
Big data characteristics
Source: IBM methodology
BIG DATA EMBODIES NEW DATA CHARACTERISTICS CREATED BY TODAY’S DIGITIZED MARKETPLACE
Characteristics of big data
20
Big data characteristics
Source: IBM methodology
Computer Scientists
Statisticians
Big Companies IBM, Intel,
etc
US!!!
What’s Scrapy? And Why?
BIG DATA: THIS IS JUST THE BEGINNING
2010
Volu
me
in E
xaby
tes
9000
2015
Percentage of uncertain data
Percent of uncertain data
2012
Sensors & Devices
VoIP
Enterprise Data
Social Media
3000
6000
100
0
50
22
Veracity
Source: IBM Global Technology Outlook 2012 IBM source data is based on analysis done by the IBM Market Intelligence Department. IBM Market Intelligence datais provided for illustrative purposes and is not intended to be a guarantee of future growth rates or market opportunity
Volume
Variety
Current Stage of Big Data
• “This is the caveman era of the big data”• What’s cool is cool because we are looking at
these for the first time and even correlation is cool sometimes! Mash up of different big data makes things scary sometimes (CMU Face app)
• Scientific process always begins with correlation then moves onto causality when mature
Big Data: Predictive Analytics VS Causal Inference
Agenda1) What’s the deal here? 2) Why should you be aware? 3) What kind of development is going on right now?4) “Big Data and You”
Predictive vs Causal
Statistics
Causal Inference Predictive analytics
Econometrics Machine Learning
The Rise of Predictive Models
• Statistics & Computer Science (Logical AI -> Statistical AI)• Overflowing data + computational power
– Better prediction– Model free – no theory backing– Blackbox algorithms – Statistical algorithms
• Goal: Predict well (with big enough data, it works)• Techniques: MANY
– Take CIS 520:Machine Learning for basic intro. At least audit! It will open up your eyes
– Stat 9XX- Statistical Learning Theory if offered! Also great – will be a lot of probability/stat theory (Sasha Rakhlin)
– * Online courses: Andrew Ng’s course, John’s Hopkins Data Science Course, etc
Good Old Causal Inference
• Statistics & Econometrics • Explore -> Develop Theory -> Test with Statistical
Inference models ( Linear Models / Graphical Models / etc)
• Requirement for X Causes Y– X must temporally come before Y (NOT in Predictive
model)– X must have significant statistical relation to Y – Association between X and Y must not be due to other
omitted variable (NOT in Predictive model)• Theory is from economics/sociology/psychology etc
Predictive Analytics VS Causal Inference
• Predictive analytics (Machine Learning, Algorithms)– Art of prediction – RMSE/Error functions
• Causal Inference (Rubin Causal Model, Structural)– Theory building – Testing theory with statistical tools and robust design of experiment or
techniques to deal with observational data• Statistics/Comp Sci (Algorithms and Data mining, Machine Learning)• Statistics/Econometrics (Causality – different school of thoughts
even within causal inference groups. For brief fun intro, see http://leedokyun.com/obs.pdf)
• Paradigm-Building – Kuhnian sense & Falsify existing beliefs – Popperian Sense– Causal inference can do both. Predictive Models cannot
Resources for Causal Inference
• Andrew Gelman: Bayesian Statistician at Columbia U– http://andrewgelman.com/– The great fight of 2009 between the Pearlian vs Rubinian!
• “Boy, these academic disputes are fun! Such vitriol! Such personal animosity! It's better than reality TV. Did Rubin slap Pearl's mom, or perhaps vice versa?”
• “With all due respect, I think you are wrong that Judea does not understand the Rubin approach.” – Larry Wasserman
• Judea Pearl “Causality”• Observational Learning books by Paul Rosenbaum• Miguel Hernan and Jamie Robins “Causal Inference” free now
– http://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Arguments About the Big data Movement in Industry
• Great portion of Start-ups & Many big data firms these days– Companies are trying to collect everything about
everyone. Becomes unwieldy beast!– http://xkcd.com/882/
Big Data AnalyticsIterative & ExploratoryData is the structure
Traditional AnalyticsStructured & Repeatable
Structure built to store data
THE BIG DATA APPROACH TO ANALYTICS IS DIFFERENT (INDUSTRY)
?AnalyzedInformation
Question
DataAnswer
Hypothesis
Start with hypothesisTest against selected data
Data leads the way Explore all data, identify correlations
Data
Correlation
All Information
Exploration
Actionable Insight
Analyze after landing… Analyze in motion…
Arguments About the Big data Movement in Academia
Read these interesting pieces featuring the dynamic duo of marketing (Prof Eric Bradlow and Prof Peter Fader) and Prof Eric Clemons of OPIM.http://www.sas.com/resources/asset/SAS_BigData_final.pdfhttp://knowledge.wharton.upenn.edu/article.cfm?articleid=2186http://www.datanami.com/datanami/2012-05-03/wharton_professor_pokes_hole_in_big_data_balloon.html
IN ACADEMIA, WE SHOULD STAY SOMEWHERE IN THE MIDDLE
Raw Data Pattern & Correlation
All Information
Exploration & Reduction & Structure
Hypotheses FormationBased on extant theory
Causal Data Analysis
All Information
Answer, Actionable Insights, and Theory
Data Mining & Machine Learning
“ETET” – Empirical Theory Empirical Theory
Some notable examples
Not Causal
Causal
Small Large
RevolutionR
Hal Varian: GoogleSusan Athey: MS
Angrist, Krueger
UCLA Stat bookTargeted learning
Economists
Machine Learning
What are you doing here?
NetflixGoogle
Data Mining
Structural Modeling
Information Systems Management
Lab experiment
Fraud detection
Hans Rosling
Marketing
Association rule
Finance Research
When dealing with unstructured/big data:Causal inference without data mining is myopic and data mining without theory-driven causal inference is blind
Quick Overview of Predictive Analytics (Machine Learning) and Applications
Machine Learning - Types
• Supervised Learning– Use labeled training data and identify classes or attributes
of new data. Calculate predictive models.– USES: Predictive models– Regression, Neural networks, support vector machine, etc
• Unsupervised Learning – Find structure in unlabeled data – USES: Exploratory analysis, organization, visualization– Clustering, feature extraction, self organizing map
• Semi-Supervised Learning
Machine Learning Broad Applications
– Face Detection– Spam Detection– Song Recognition– Signature/Zipcode Recognition– Micro Array – Astrophysics – Medical – Consumer segmentation/targeting– Recommendation Algorithms
Machine Learning Business (research) Applications
• Unstructured data -> Structured data – Natural Language Processing– Spoken Language Processing– Computer Vision
• Exploratory Analysis – Clustering– Anomaly detection
• Visualization – Dimensionality reduction– Multi-Dimensional Scaling
• Some people have started to incorporate machine learning techniques into causal inference – Machine learning in matching (PSM)– Targeted Learning, 2012 Springer Series
• (http://www.targetedlearningbook.com/)
Intro to Practical Natural Language Processing
Agenda1) Brief light-hearted Intro to NLP (What is it and why should I care?)2) Basic ideas in NLP3) Usage in Business Research
Quick Overview
• What is Natural (Spoken) Language Processing (NLP)?
• Examples• How this technology may affect:
– Industry– Academics
Natural Language Processing
• Natural Language Processing is an interdisciplinary field composed of techniques and ideas from computer science, statistics and linguistics that are concerned with making computers able to parse, understand (knowledge representation), store (knowledge database), and ultimately interact (convey information) in natural language (human language such as English)
• Methods: machine learning, bayesian statistics, algorithms, higher order logic, linguistics.
Subcategories of NLP
• Information Retrieval: Google. Optimizing text database search.
• Information Extraction: Crude basic form is Web Crawling + REGEX. Really sophisticated form, you’ll see later – Thomson Reuters
• Machine Translation• Sentiment Analysis and more
Cool Applications• NSA - uses NLP to detect anomalous activity in internet and
phone calls for terrorist activities (and us…)• Lie detection via spoken language processing • Automatic plagiarism detector
• ETS Testing - since 1999 “e-rater” automatic essay scoring on GMAT, GRE, TOEFL.
• Shazam – song discovery (application of spoken language processing)
• News aggregators based on topic
• Entertainment - Cleverbot (Turing test 59.3% VS real human 63.3%) Really evolved from dumb predecessors ELIZA, Smarter child etc.
Business Applications• Marketing - sentiment analysis and demand analysis of
products from reviews and blogs e.g. movies, consumer products
• Marketing – Opinion Mining/Subjectivity analysis/Emotion Detection/Opinion Spam Detection etc
• Finance - Quantitative Qualitative high frequency trading ( Thomson Reuters, Bloomberg)
• Management – Resume filtering and firm-employee matching
• Legal Studies – legal document search engines
• E-Commerce – help chat bots
Main stream Applications
• Siri (dumb) - preprogrammed. No learning
• IBM Watson/ Wolfram Alpha (smart):
– semantic representation of concepts
– acquisition of knowledge
– logical inference machine
• As of 2011, Watson had knowledge equivalent of a second year medical student (which isn’t saying much but still cool due to the speed Watson learns)
Main stream Applications
• Siri (dumb) - preprogrammed. No learning
• IBM Watson/ Wolfram Alpha (smart):
– semantic representation of concepts
– acquisition of knowledge
– logical inference machine
• As of 2011, Watson had knowledge equivalent of a second year medical student (which isn’t saying much but still cool due to the speed Watson learns)
Watson gets an attitude
IBM Watson learned urban dictionary in 2013…
“Watson couldn't distinguish between polite language and profanity -- which the Urban Dictionary is full of. Watson picked up some bad habits from reading Wikipedia as well. In tests it even used the word "bullshit" in an answer to a researcher's query.
Ultimately, Brown's 35-person team developed a filter to keep Watson from swearing and scraped the Urban Dictionary from its memory.”
Well no $@!# Sherlock! You mea@#$%s can bite
my shiny metal !@$
Some fun facts
• 15,000
– Average number of words spoken by an average person per day (various sociology, linguistics studies). approximately 15 words per min assuming 8 hour sleep.
• 100Million~300Million:
– Average number of words spoken by an average person in a lifetime.
• 100 TRILLION:
– approx number of words on internet in 2007 by Peter Norvig (leads google research & AI scientist).
Reasons why you should at least acknowledge NLP and keep it in mind for
the rest of your life
1. It will definitely be a disrupting technology and a large part of everyday life affecting most type of business (already has disrupted finance, marketing, management, etc)
2. Text Data: Explosion of web, Company performance report, news, security filings etc
3. Even in business research outside of Information Systems Management and Marketing, more and more researchers are utilizing NLP
Example Focus (Finance)• Thomson Reuters (Automation Team) and Bloomberg
• Business Wire: 60 stories per second
– “Apple also announced that Scott Forstall will be leaving Apple next year and will serve as an advisor to CEO Tim Cook in the interim”
– Lake Shore Bancorp, Inc. (the “Company”) (NASDAQ Global Market: LSBK), the holding company for Lake Shore Savings Bank (the “Bank”), announced third quarter 2012 net income of $863,000, or $0.15 per diluted share, compared to net income of $1.2 million, or $0.20 per diluted share, for third quarter 2011. The Company had net income of $2.8 million, or $0.48 per diluted share, for the nine months ended September 30, 2012, compared to net income of $3.1 million, or $0.54 per diluted share for the same period in 2011.
• Extract relevant information -> computer readable format such as XML/JSON
• KEY: the format of where information is, and how to extract is not preprogrammed. The NLP engine learns as new information comes in. Initially, it learns how to extract and what is important by humans tagging many articles. (semi-supervised learning)
Lake Shore Bancorp, Inc. (the “Company”) (NASDAQ Global Market: LSBK), the holding company for Lake Shore Savings Bank (the “Bank”), announced third quarter 2012 net income of $863,000, or $0.15 per diluted share, compared to net income of $1.2 million, or $0.20 per diluted share, for third quarter 2011. The Company had net income of $2.8 million, or $0.48 per diluted share, for the nine months ended September 30, 2012, compared to net income of $3.1 million, or $0.54 per diluted share for the same period in 2011. [...]
• Named Entity Recognition: Has to realize that “Lake Shore Bancorp, Inc.” is a name of a company
• Coreference resolution: “the company” is Lake Shore Bancorp, Inc • Morphological segmentation: breaking of words into basic parts and
meaning “lexeme” e.g. Announced is past tense of lexeme “announce” with inflection rule -ed
• Part of Speech Tagging and Grammar Parsing • Chunking and Breaking: e.g. A and B of X and Y is(A,X) and is(B,Y)
Example Focus(Finance)
<company name=”Lake Shore Bancorp”><Alias>The Company</Alias><Holds>Lake Shore Savings Bank</Holds><Holds>The Bank</Holds> <Q year=”2012” period=”third”>863000</Q><Q year=”2011” period=”third”>1.2 Million</Q><Net year=”2012” month=”9”>2.8 Million</Net><Net year=”2011” month=”9”>3.1 Million</Net>.........</company>
Example Focus(Finance)
Lake Shore Bancorp, Inc. (the “Company”) (NASDAQ Global Market: LSBK), the holding company for Lake Shore Savings Bank (the “Bank”), announced third quarter 2012 net income of $863,000, or $0.15 per diluted share, compared to net income of $1.2 million, or $0.20 per diluted share, for third quarter 2011. The Company had net income of $2.8 million, or $0.48 per diluted share, for the nine months ended September 30, 2012, compared to net income of $3.1 million, or $0.54 per diluted share for the same period in 2011. [...]
Real example XML from Thomson Reuters
Bottom Line• NLP can do lots of cool stuff• Unstructured text data is huge and is growing faster
than ever. And it will continue to grow as online population increases
• NLP is an important tool for anyone to be aware of• Jurafsky & Martin “Speech and Language
Processing” for deep theory• Bing Liu’s two books: http://www.cs.uic.edu/~liub/• Practical NLTK books: an NLTK cookbook by Jacob
Perkins and “NLP with python” by Steven Birds et al
Next Sessions• Overview of Machine Learning and Data Mining• You’ll see NLP in action (specific tasks)• Actual codes using NLTK (install this!)• Example research – Uses NLP and machine
learning techniques to content-code large scale social media data.
• You can read appendix of the paper “The Effect of Social Media Marketing Content on Consumer Engagement: Evidence from Facebook”
• ssrn.com/abstract=2290802