Andrew Fogg Import.io
Web data. Challenges and opportunities for official statistics
we have come a long way since this was the gold standard of data recording
the web is the biggest repository of data that we have ever created
but there are problems
web data is trapped inside web pages and getting at it is difficult
you can use an api
if there is one available
you can write a scraper
but that is painful
:-(
both require a friendly developer
and a lot of time
for every single website
4 hours 4 days
there are technology companies that now solve this problem
import.io allows you to turn websites into rows and columns of data without writing any code
we even have a version of import.io that requires no human training
web data is available now and we need to think about how we can
leverage it for official statistics
i want to show you 3 use cases of web data from business
that lend themselves to areas of official statistics
sell hard disk drives
they need to know how to price them
they pull web data on all the hard disk drives
from all the brands
in all of the markets that they operate
product properties
500GB SSD
analysis of this data
allows them to determine overall trends and make decisions about…
price per gigabyte
$/GB
premiums for certain features
$/GB
when to discount
$/GB
it gives them an overall picture of the pricing and competitive landscape
the obvious application of this for official statistics
you don’t have to sample can go for the whole population
100%
having said that: not all consumption is online and not all goods are online (although growing)
100%
80%
20%
and you can often get pricing for products even when purchase never happens online
100%
80%
20%
recruitment leads
one of the world’s biggest recruitment agencies
when a job appears on a careers page
they want it in salesforce
so this guy can do his job
they tried writing custom code
for 4,000 web sites
every website is different
how long would that take
4 hours 4 days
remember?
8 years 61 years
for 4,000 websites let’s call it 30 man years
with import•io...
0
7.5
15
22.5
30
web scrapers import•io
it took 5 people 2 weeks = 0.2 man years
you may be familiar with this story but there are new details, i promise you
let’s start with a question
how much does prostitution contribute to the UK economy?
£5.314bnthat is a big number
bigger than the entire economy of Moldova
bigger than 28% of global economies
but it is only 45% of apple’s Q1 profits (apple posted the biggest quarterly profit in history)
restatement of national accounts in line with new EU accounting guidelines
“illegal transactions to which all parties consent should be included in national accounts”
this includes prostitution and sale of illegal drugs
how did the ONS calculate that number for prostitution?
there are lots of assumptions and the number of prostitutes relies on one study
tried to count all the prostitutes in London over 6 months in 2004
phoned every brothel and escort agency they could find advertised
estimated count of off-street prostitutes in London (plus estimate of on-street prostitutes from police)
7,000 (+115)
scaled out to the UK using population statistics
how reliable is that number? especially the scaling assumption?
time to look at some web data
many activities associated with prostitution are illegal in the UK
but paying for sex is actually legal
as a result prostitution sits in a grey area and is widely marketed on the web
built an API to AdultWork directory using Import.io
AdultWork is ranked 174 top site in UK by Alexa
searched for all sex workers based in the UK and downloaded the data
categorised each sex worker by location (London / not London)
counted the number of London based prostitutes and estimated the UK total using scale assumption
then counted the number of UK based prostitutes on AdultWork
it seems that the scaling assumption is reasonable
so the ONS numbers are good?
what is wrong with this picture?
“no information is included regarding men working in the sex industry”
40% of the prostitutes on AdultWork are men (in London and across UK)
recalculating GDP to take into account male prostitution gives a dramatic increase
£8.856bnnew number
the new total is bigger than the entire economy of Iceland
bigger than 37% of global economies
but it is still only 75% of apple’s Q1 profits
this is only a preliminary analysis
but i just added £3.542bn to the UK economy
the story got widely picked up in mainstream media
it is not as simple as all that
42% of sex worker profiles on adultwork are male
but are there gender differences between male and female sex workers?
we decided to look into the gender differences in more detail
this is what we found
sex workers on adultwork are heterosexual
female sex workers are younger
female sex workers stop working sooner
a female sex worker’s profile is most likely to be less than a year old
female sex workers are more active
there is more demand for female sex workers (38x)
total number of reviews for female sex workers is much greater than male sex workers (5x)
female sex workers charge more
the price differences are sort of normally distributed apart from spikes at certain price points
sex workers are deploying a psychological pricing strategy in order to maximise revenues
what is the impact on gdp
here is a simplified version of the ons calculations
plugging our new numbers in
£12.3bnan even bigger number
so is *this* right?
we are getting closer
but we are not there
the biggest problem number is the “25 visits a week”
this is best illustrated by some “back of the envelope” calculations
revenue (131*25)*48 = £157,200 pa income
number of visits is the rogue variable here
Questions?
learn more about web data www.import.io