View
105
Download
2
Category
Tags:
Preview:
DESCRIPTION
Measuring the Digital Economy using Big Data by Prash Majmudar
Citation preview
Measuring the digital economy using big data
Prash Majmudar – Growth Intelligence
@growthintel
@prashmaj
Overview
• Background
• Approach (Data + Python)
• Sizing the economy - Results
• Examples
Background
Project background
• Research project supported by NESTA, Google
• Worked with independent economists at the National Institute of Economic and Social Research (NIESR) – Max Nathan, Anna Rosso
• Published report in 2013
• Further phases of work underway
5
Research questions
• What’s the most appropriate definition of UK ‘digital
companies’? Cleaner definitions, company counts
• What do the UK’s ‘digital companies’ (really) look like? Key
characteristics, focus on start-ups, innovating and ‘high-
growth’ companies, spatial footprint
• What drives innovation and/or high-growth status in digital
companies? Performance analysis and characteristics. Sample
historic data to investigate causality
Why?
• The digital economy is poorly served by conventional definitions and datasets.
• Reliance on Companies House (historic data)
• Standard definitions used for:
– Credit / risk
– Government policy (e.g. focus on Tech City)
– Economic productivity measures
– Companies that sell / market to other companies
SIC - Standard Industrial Classification
• Brought into being in 1948– Since 1948 the classification has been revised in
1958, 1968, 1980, 1992, 1997, and 2003
• Latest version is “SIC 2007”– adopted by UK in 2008.
– adopted by Companies House in October 2011.
• 731 SIC codes, but not without issues– Self-classification
– Emerging sectors e.g. no codes for Nanotechnology
SIC
• 77220 Renting of video tapes and disks
• 81223 Furnace and chimney cleaning services
• 01440 Raising of camels and camelids
• 32110 Striking of coins – Royal Mint
• 38310 Dismantling of wrecks
• 01260 Growing of oleaginous fruits
• 82990 Other business support service activities n.e.c. – 10% of Businesses
• 20% not classified
Challenge
• The ‘digital economy’ is not straightforward to define
• Refers to:– a set of sectors,
– a set of outputs (products and services),
– and a set of inputs (production and distribution tools, underpinned by information and communication technologies).
• Mapping the digital economy onto industries is necessarily imprecise.
• Government defines it as ‘information’ and ‘digital content’ industries (BIS 2012, 2013)
• Data driven methods can provide richer, more informative and more up to date analysis.
Data driven approach
All Companiesin the Economy~ 3M companies
Online activity
News / Events
Technologies
Classifications
Financials
TMs / Patents
UNUSUALDATA
Trade activity
UNIQUEDATA
COMPANIES
USER DATA
Linked datasets and algorithms
Enterprise users
Tech Users
Medium company
users
Approach
• Classification system is multi-dimensional:
– Sector: vertical they operate in
– Product type: principal output (services / physical goods)
– Client type: business or consumer focussed
– Sales process: how they sell / route to market
IT Film Telco PublishingOil &Gas
Architecture
Software– web
Consultancy
Hardware / tools
Electronics
Media distribution
Approach
Crowd sourced labelled data
Crawl / APIs
Pre-labelled data
Feature generation /
selection
Model training
FeatureExtraction /
pre-processing
Scrapy
Processing
Python scikit-learn / pandas
Training set
Building training sets
Crowd sourcing –create
classification tasks
Expert panels Pre-labelled data
• Using crowd sourcing
– Users follow pre-defined instructions – are rewarded for successfully completing tasks
– Can put in place qualification tests etc.
– Vote to produce labels – majority of 5
• Used expert panel when large number of classes
Feature engineering
– Multiple sources of features
• Free text (News / Web)
• Structured datasets (e.g. patent filings etc.)
– Cleaning data
• Malformed HTML
• Stripping out HTML, Javascript
– Tokenising and calculating TF-IDF weights
Modelling
• Supervised learning classification problem
• Scikit learn (fast iteration on different models). Use of Linear SVMs and processing pipelines
– One vs many classifier
• Pandas plays well here – can quickly build up feature sets
• Large number of features (thousands) – linear models are fast.
0 0.2 0.4 0.6 0.8 1 1.2 1.4
cables
smes
termination
ip
networking
server
sap
consultant
ethernet
installer
fault
cloud
remote
setup
ict
servers
copper
telecom
wireless
hardware
conferencing
desk
disruption
crm
infrastructure
hosting
fibre
cisco
switches
cabling
0 0.2 0.4 0.6 0.8 1 1.2 1.4
luxurious
quantity
footwear
collection
cotton
courier
shirts
stockists
cart
logo
satin
wholesale
hats
nylon
wear
workwear
bridal
womens
designs
socks
accessories
lace
mens
clothing
fashion
apparel
FashionComputer networkingclf.coef_
Summary
• Use multiple datasets as an input
• Build multi-class classifiers for sector, product, client, sales process
• Apply classifiers to 3M companies in the UK
Sizing the digital economy
Challenges
• Sole traders are not observed
• Registered company addresses are not always trading
addresses
• Understanding company structure
• Employee coverage is limited – gaps in data due to reliance on
historic filing data traditionally
23
Cleaning the company data
• Aim = build a benchmarking sample
• Include only observations with SIC and GI info => smaller than ‘true’
- Step 1: drop non-trading, dormant, dissolved companies or those in
administration
- Step 2: drop holding companies
- Step 3: identify groups of linked companies (via
name, postcode), keep the unit that reports highest revenue
• Benchmarking sample = 1.868m companies
• Validate ‘true’ sample (2.254m) vs. BPS enterprise counts
24
Identifying ‘digital companies’
• Aim = more robust definition, compare against SIC-based
• Use ‘sector’ and ‘product’ categories
• Intuition = we want companies in ‘digital’ sectors’ that also do
‘digital’ things (e.g. digital publishing, media, design …)
- Step 1: Identify GI sector and product categories
- Steps 2-5: clean out ‘non-digital’ GI sectors, products combinations
- Step 6: Count companies
- E.g. Process designed to exclude large proportion of architecture
firms, except those whose principal product type is software for CAD /
technical drawing
25
Company counts Observations %
A. SIC 07
Other 1,681,151 89.96
Digital Economy 187,616 10.04
B.GI sector and product
Other 1,599,072 85.57
Digital Economy 269,695 14.43
Note: Panel A follows the BIS (2009) definition. Panel B defines the digital economy using GI digital sector by digital product "cells".
Classifications:Sector – Oil and Energy
Product – Computer SoftwareClient – Businesses
Sales process – ProjectBased in Aberdeen
SIC Code: 82990 - Other business supportservice activities
Company counts are highest in London.
But we also find large counts in Manchester, Birmingham, Bristol and Brighton...
... as well as the wider Greater South East.
280.000 0.200 0.400 0.600 0.800 1.000 1.200 1.400 1.600 1.800
Livingston & Bathgate
Crawley
Oxford
Southampton
Coventry
Middlesbrough & Stockton
Cheltenham & Evesham
Swindon
Cambridge
Andover
Brighton
Bournemouth
Wycombe & Slough
Luton & Watford
Stevenage
Guildford & Aldershot
Poole
Milton Keynes & Aylesbury
Newbury
Reading & Bracknell
Basingstoke
Guildford
consultancy
custom software development digital media
media distribution
peer to peer communications photography
printing services
software desktop or server
software web application web hosting
animation 1
architecture 178
computer games 2 80
computer hardware 12 7 1
computer network security 7 1
computer networking 23 5
computer software 88 459 70
defense space 37
electrical electronic manufacturing 13 72 1
entertainment film production 6 33
financial services 820
information services 8 3
information technology 2756 6 94
internet 14 15 1 16
marketing advertising 192
photography 74 7 1
printing 12 2 63
publishing 29
semiconductors 3
telecommunications 58 9 31 1 1
Additional findings
31
Digital companies’ revenue growth in 2010-2012 is faster than non-digital ...
A. Annual Revenues
B. Annual
Revenue Growth
mean median mean median
Other 18,380,097 110,048 15.68 1.70
Digital Economy 10,547,218 123,388 20.21 4.17
Note: Sub-sample of those companies who report revenue. Companies House average revenues are averaged over the period
2010 to 2012. If for each company there is more than one observation, only the most recent is kept. Average annual revenue growth
is computed on a smaller sample, as information for at least two consecutive years is needed.
32
... and digital employers have higher average staff levels.
Employees per company
Mean Median % of all employment
A. Official / SIC07
Other 20.94 4 94.92
Digital Economy 17.23 3 5.08
B. GI sector and product
Other 20.40 4 88.67
Digital Economy 23.37 4 11.33
Note: sub-sample of firms reporting employment to Companies House. Data is averaged over 2010-2012.
Further work
• Drivers of innovation / growth
• Use of ‘tags’ to provide further descriptive analysis of digital companies
• Unsupervised approach to identify clusters
• Extension to sole traders
• Extending this approach to Europe – e.g. Belgium, France, Germany, Italy
Questions?@growthintel
@prashmaj
SIC – ICT Sector
28230 MANUFACTURE OF OFFICE MACHINERY AND COMPUTERS
26200 MANUFACTURE OF COMPUTERS AND OTHER INFORMATION PROCESSING EQUIPMENT
27320 INSULATED WIRE AND CABLE
26110 ELECTRONIC VALVES AND TUBES AND OTHER ELECTRONIC COMPONENTS
33200 TELEVISION, RADIO TRANSMITTERS AND APPARATUS FOR TELEPHONY AND TELEGRAPHY
26400 TELEVISION AND RADIO RECEIVERS, SOUND OR VIDEO RECORDING OR PRODUCING APPARATUS AND ASSOCIATED GOODS
26511 INSTRUMENTS AND APPLIANCES FOR MEASURING, CHECKING, TESTING AND NAVIGATING AND OTHER PURPOSES
26512 INDUSTRIAL PROCESS EQUIPMENT
46439 WHOLESALE OF ELECTRICAL HOUSEHOLD APPLIANCES
46510 WHOLESALE OF COMPUTERS, COMPUTER PERIPHERAL EQUIPMENT AND SOFTWARE
46660 WHOLESALE OF OTHER OFFICE MACHINERY AND EQUIPMENT
46520 WHOLESALE OF OTHER ELECTRONIC PARTS AND EQUIPMENT
46690 WHOLESALE OF OTHER MACHINERY FOR USE IN INDUSTRY, TRADE AND NAVIGATION
61900 TELECOMMUNICATIONS SERVICES
77330 RENTING OF OFFICE MACHINERY AND EQUIPMENT INCLUDING COMPUTERS
62020 COMPUTER HARDWARE CONSULTANCY
95110 MAINTENANCE AND REPAIR OF OFFICE, ACCOUNTING AND COMPUTING MACHINERY
62090 OTHER COMPUTER RELATED ACTIVITIES
SIC – Digital content industries
58110 PUBLISHING OF BOOKS
58130 PUBLISHING OF NEWSPAPERS
58142 PUBLISHING OF JOURNALS AND PERIODICALS
59200 PUBLISHING OF SOUND RECORDINGS
58190 OTHER PUBLISHING
18110 PRINTING OF NEWSPAPERS
18129 PRINTING N.E.C
18130 PRE-PRESS ACTIVITIES
18130 ANCILLARY ACTIVITIES RELATING TO PRINTING
18201 REPRODUCTION OF SOUND RECORDING
18202 REPRODUCTION OF VIDEO RECORDING
18203 REPRODUCTION OF COMPUTER MEDIA
58290 PUBLISHING OF SOFTWARE
62020 OTHER SOFTWARE CONSULTANCY AND SUPPLY
63110 DATA PROCESSING
63110 DATABASE ACTIVITIES
73110 ADVERTISING
74209 PHOTOGRAPHIC ACTIVITIES
59111 MOTION PICTURE AND VIDEO PRODUCTION
59131 MOTION PICTURE AND VIDEO DISTRIBUTION
59140 MOTION PICTURE PROJECTION
59113 RADIO & TV (DCMS ESTIMATES)
63910 NEWS AGENCY ACTIVITIES
Recommended