Andrew Fogg Import.io Web data. Challenges and opportunities for official statistics

Web data. Challenges and opportunities for official statistics (OECD)

Embed Size (px)

Citation preview

Andrew Fogg Import.io

Web data. Challenges and opportunities for official statistics

we have come a long way since this was the gold standard of data recording

the web is the biggest repository of data that we have ever created

but there are problems

web data is trapped inside web pages and getting at it is difficult

you can use an api

if there is one available

you can write a scraper

but that is painful


both require a friendly developer

and a lot of time

for every single website

4 hours 4 days

there are technology companies that now solve this problem

import.io allows you to turn websites into rows and columns of data without writing any code

we even have a version of import.io that requires no human training

web data is available now and we need to think about how we can

leverage it for official statistics

i want to show you 3 use cases of web data from business

that lend themselves to areas of official statistics

pricing strategy

sell hard disk drives

they need to know how to price them

they pull web data on all the hard disk drives

from all the brands

in all of the markets that they operate

price data

product properties


analysis of this data

allows them to determine overall trends and make decisions about…

price per gigabyte


premiums for certain features


when to discount


it gives them an overall picture of the pricing and competitive landscape

the obvious application of this for official statistics


real time data

you don’t have to sample can go for the whole population


having said that: not all consumption is online and not all goods are online (although growing)




and you can often get pricing for products even when purchase never happens online




recruitment leads

one of the world’s biggest recruitment agencies

when a job appears on a careers page

they want it in salesforce

so this guy can do his job

they tried writing custom code

for 4,000 web sites

every website is different

how long would that take

4 hours 4 days


8 years 61 years

for 4,000 websites let’s call it 30 man years

with import•io...






web scrapers import•io

it took 5 people 2 weeks = 0.2 man years

sex, drugs & gdp

you may be familiar with this story but there are new details, i promise you

let’s start with a question

how much does prostitution contribute to the UK economy?




0.4% GDP

£5.314bnthat is a big number

bigger than the entire economy of Moldova

bigger than 28% of global economies

but it is only 45% of apple’s Q1 profits (apple posted the biggest quarterly profit in history)

restatement of national accounts in line with new EU accounting guidelines

“illegal transactions to which all parties consent should be included in national accounts”

this includes prostitution and sale of illegal drugs

how did the ONS calculate that number for prostitution?

very simple

there are lots of assumptions and the number of prostitutes relies on one study

tried to count all the prostitutes in London over 6 months in 2004

phoned every brothel and escort agency they could find advertised

estimated count of off-street prostitutes in London (plus estimate of on-street prostitutes from police)

7,000 (+115)

scaled out to the UK using population statistics

how reliable is that number? especially the scaling assumption?

time to look at some web data

many activities associated with prostitution are illegal in the UK

but paying for sex is actually legal

as a result prostitution sits in a grey area and is widely marketed on the web

built an API to AdultWork directory using Import.io

AdultWork is ranked 174 top site in UK by Alexa

searched for all sex workers based in the UK and downloaded the data

categorised each sex worker by location (London / not London)

counted the number of London based prostitutes and estimated the UK total using scale assumption

then counted the number of UK based prostitutes on AdultWork

it seems that the scaling assumption is reasonable

so the ONS numbers are good?

not quite

what is wrong with this picture?

there are no men

“no information is included regarding men working in the sex industry”

40% of the prostitutes on AdultWork are men (in London and across UK)

recalculating GDP to take into account male prostitution gives a dramatic increase

£8.856bnnew number

0.6% GDP

the new total is bigger than the entire economy of Iceland

bigger than 37% of global economies

but it is still only 75% of apple’s Q1 profits

this is only a preliminary analysis

but i just added £3.542bn to the UK economy

the story got widely picked up in mainstream media


it is not as simple as all that

42% of sex worker profiles on adultwork are male

but are there gender differences between male and female sex workers?

we decided to look into the gender differences in more detail

this is what we found

sex workers on adultwork are heterosexual

female sex workers are younger

female sex workers stop working sooner

a female sex worker’s profile is most likely to be less than a year old

female sex workers are more active

there is more demand for female sex workers (38x)

total number of reviews for female sex workers is much greater than male sex workers (5x)

female sex workers charge more

the price differences are sort of normally distributed apart from spikes at certain price points

sex workers are deploying a psychological pricing strategy in order to maximise revenues

what is the impact on gdp

here is a simplified version of the ons calculations

plugging our new numbers in

£12.3bnan even bigger number

so is *this* right?

we are getting closer

but we are not there

the biggest problem number is the “25 visits a week”

this is best illustrated by some “back of the envelope” calculations

revenue (131*25)*48 = £157,200 pa income

number of visits is the rogue variable here


learn more about web data www.import.io