44
Towards the Geo-computation of Real- Time Geodemographics Muhammad Adnan

Towards the Geo-computation of Real-Time Geodemographics

  • Upload
    jerzy

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Towards the Geo-computation of Real-Time Geodemographics. Muhammad Adnan. Introduction. BSc. in Information Technology from N.E.D. University, Karachi, Pakistan MSc. in Software Engineering from Queen Mary, University of London - PowerPoint PPT Presentation

Citation preview

Page 1: Towards the Geo-computation of Real-Time Geodemographics

Towards the Geo-computation of Real-Time Geodemographics

Muhammad Adnan

Page 2: Towards the Geo-computation of Real-Time Geodemographics

Introduction

• BSc. in Information Technology from N.E.D. University, Karachi, Pakistan

• MSc. in Software Engineering from Queen Mary, University of London

• Working as Computing and Research Assistant at SPLINT, Dept. of Geography, UCL

• Part-time PhD. Student

• Research interest in optimizing the performance of clustering algorithms

Page 3: Towards the Geo-computation of Real-Time Geodemographics

Recent Projects at UCL

• Worldnames Profiler (www.publicprofiler.org/worldnames)

• National Trust Names (www.nationaltrustnames.org.uk)

• Onomap Website (www.onomap.org)

• Google Maps Shortest Path (www.publicprofiler.org/travel)

• Google Street View (www.publicprofiler.org/streetview)

• A bit of work for London Profiler (www.londonprofiler.org)

Page 4: Towards the Geo-computation of Real-Time Geodemographics

Worldnames Profiler

• Online at www.publicprofiler.org/worldnames

• Extraction of data from 22 telephone directories for different countries

• Names Cleansing and standardising the format

• Geo-coding individual households using different gazatteers

• Oracle database with 1 billion names around 26 different countries

• Geo-visualisation using Flash maps instead of slippy maps (Google Maps, ArcGIS server etc)

Page 5: Towards the Geo-computation of Real-Time Geodemographics

Worldnames Profiler

• Increased interest in creating visualisation techniques and dealing with text mining algorithms

• PhD. Focus is on the creation of bespoke real time Geodemographic classifications

Concentration of surname “SINGLETON” in North AmericaConcentration of surname “SINGLETON” in the World

Page 6: Towards the Geo-computation of Real-Time Geodemographics

What will I be talking about today ?• Geodemographic Classication

• Introduction• What data goes in ?• Standardising the data• Clustering the data• Naming the clusters

• Real time Geodemographics• Need for real time Geodemographics ?• What are real time Geodemographics ?• Computational Challenges

• Clustering Algorithms• K-means• PAM (Partitioning Around Mediods)• CLARA (Clustering Large Applications)• GA (Genetic Algorithm)• Comparison of Clustering Algorithm

• Creating bespoke Geodemographics…demo

• Parallel Processing in real time Geodemographics

Page 7: Towards the Geo-computation of Real-Time Geodemographics

What is a Geodemographic classification ?

• “A segmentation system which groups similar neighbourhoods into categories, based on the characteristics of their residents”. (Vickers, 2006)

• Generalised classification of areas based on characteristics of population.

Page 8: Towards the Geo-computation of Real-Time Geodemographics

What data goes in?

• Census data• Demographic attributes (Age, Ethnicity, Country of birth etc.)

• Household composition (Family type, Family size etc.)

• Housing characteristics (Tenure, Type & Size etc.)

• Socio-economic attributes (Education, Car ownership etc.)

• Employment attributes (Economic activity, Economic class etc.)

• Lifestyle Surveys

• Credit cards data

• Commercial companies use a mix of all these data sources

• Academic research has based on census data only

Page 9: Towards the Geo-computation of Real-Time Geodemographics

Standardising the data

• Z-Scores• Widely used variable normalisation technique

• Can create outliers in the datasets

• Range Standardisation• Standardise values between a range of 0-1

• Can erase interesting patterns in the data

• Principal Component Analysis• Reduces the dimensions of a data set

• Focuses on the part of dataset having maximum variance

• Can erase interesting patterns in the data

Page 10: Towards the Geo-computation of Real-Time Geodemographics

Clustering the data

• K-means clustering algorithm is used to cluster data into homogeneous groups

• Multiple runs of k-means due to its instability• 10,000 times (Singleton, 2008)

• Different classification systems produce different number of groups• MOSAIC classifies data into12 lifestyle groups

• ACORN classifies data into 17 lifestyle groups

Page 11: Towards the Geo-computation of Real-Time Geodemographics

Naming the clusters

• MOSAIC classifies data into 12 life style groups. 1. High Income Families

2. Suburban Semis

3. Blue Collar Owners

4. Low Rise Council

5. Council Flats

6. Victorian Low Status

7. Town Houses and Flats

8. Stylish Singles

9. Independent Elders

10. Mortgaged Families

11. Country Dwellers

12. Institutional Areas

Page 12: Towards the Geo-computation of Real-Time Geodemographics

Need for real time Geodemographics

• Current classifications are created using static data sources

• Rate and scale of current population change is making large surveys (census) increasingly redundant• Significant hidden value in transactional data

• Data is increasingly available in near real time

e.g. ONS (Office of National Statistics) NESS API

• Application specific (bespoke) classifications have demonstrated utility (Longley & Singleton, 2009)

Page 13: Towards the Geo-computation of Real-Time Geodemographics

What are real time Geodemographics ?

Specification Estimation Testing

Page 14: Towards the Geo-computation of Real-Time Geodemographics

Computational challenges

• Integration of large and possibly disparate databases• E.g. NHS data; Census data

• Data normalisation and optimization for fast transactions

• Minimizing computational time of clustering algorithms (Very Important)!

• Common protocol• XML (SOAP)

• Use of non traditional data sources. (Singleton, 2008) • E.g. Flickr; Facebook

Page 15: Towards the Geo-computation of Real-Time Geodemographics

Important Challenge: Selection of clustering algorithm

• K-Means

• PAM (Partitioning Around Medoids)

• CLARA (Clustering Large Applications)

• GA (Genetic Algorithm)

Page 16: Towards the Geo-computation of Real-Time Geodemographics

k-means

• Attempts to find out cluster centroids by minimising within sum of squares distance.

• K-means is unstable due to its initial seeds assignment.• Sensitive to outliers in the data set.

• Creating a Geodemographic classification requires running algorithm multiple times.• 10,000 times (Singleton, 2008)

• Computationally expensive in a real time environment.

Page 17: Towards the Geo-computation of Real-Time Geodemographics

k-means variants

• Hartigan’s k-means algorithm

• Lloyd’s k-means algorithm

• Forgy’s k-means algorithm

• McQueen’s k-means algorithm

Page 18: Towards the Geo-computation of Real-Time Geodemographics

k-means variants

• Hartigan’s k-means algorithm

• Lloyd’s k-means algorithm

• Forgy’s k-means algorithm

• McQueen’s k-means algorithm

Clustering efficiency of k-means variants

Page 19: Towards the Geo-computation of Real-Time Geodemographics

K-means (100 runs of k-means on OAC data set for k=4)

Page 20: Towards the Geo-computation of Real-Time Geodemographics

An example of bad clustering result (K-means)

Page 21: Towards the Geo-computation of Real-Time Geodemographics

An example of bad clustering result (K-means)

Page 22: Towards the Geo-computation of Real-Time Geodemographics

An example of bad clustering result (K-means)

Page 23: Towards the Geo-computation of Real-Time Geodemographics

Alternate Clustering Algorithms

• PAM (Partitioning around medoids)

• CLARA (Clustering Large Applications)

• GA (Genetic Algorithm)

Page 24: Towards the Geo-computation of Real-Time Geodemographics

Alternate Clustering Algorithms…

• PAM (Partitioning around medoids)

• It tries to minimize the sum of dissimilarities of the data points to their cluster centers.• Less sensitive to outliers than K-means.

• Cannot handle larger data sets.

• Produces better results than k-means for smaller data sets.

Page 25: Towards the Geo-computation of Real-Time Geodemographics

Alternate Clustering Algorithms…

• CLARA (Clustering Large Applications)

• It draws multiple samples of the dataset, applies PAM to each sample and returns the best result.• Can handle large data sets as it operates on samples rather than on actual

data set.

• Sometimes it gives bad clustering results due to its procedure of sample selection.

Page 26: Towards the Geo-computation of Real-Time Geodemographics

Alternate Clustering Algorithms…

• GA (Genetic Algorithm)

• It is inspired by models of biological evolution. It produces results through a breeding procedure.

• Creates hierarchies of generations and then merge the hierarchies in homogeneous groups having similar characteristics.

• Can be time consuming due to the creation of generation hierarchies.

Page 27: Towards the Geo-computation of Real-Time Geodemographics

Comparing the computational efficiency of

• K-means

• Clara

• GA

By using three data normalisation techniques

• Z-Scores

• Range Standardisation

• Principal Component Analysis

Page 28: Towards the Geo-computation of Real-Time Geodemographics

Data normalisation techniques

• Z-Scores• Widely used variable normalisation technique

• Can create outliers in the datasets

• Range Standardisation• Standardise values between a range of 0-1

• Can erase interesting patterns in the data

• Principal Component Analysis• Reduces the dimensions of a data set

• Focuses on the part of dataset having maximum variance

• Can erase interesting patterns in the data

Page 29: Towards the Geo-computation of Real-Time Geodemographics

Comparing computational efficiency (Z-scores)

PAM, and GA on the three geographic aggregations of a dataset covering London.

OA (Output Area) level results

LSOA (Lower Super Output Area) level results Ward level results

Page 30: Towards the Geo-computation of Real-Time Geodemographics

Comparing computational efficiency (Range Standardisation)

PAM, and GA on the three geographic aggregations of a dataset covering London.

OA (Output Area) level results

LSOA (Lower Super Output Area) level results Ward level results

Page 31: Towards the Geo-computation of Real-Time Geodemographics

Comparing computational efficiency (PCA)

PAM, and GA on the three geographic aggregations of a dataset covering London.

OA (Output Area) level results

LSOA (Lower Super Output Area) level results Ward level results

Page 32: Towards the Geo-computation of Real-Time Geodemographics

Algorithm Stability (w.r.t. Computational time)Running k-means on OA (Output Area) for 120 times on each iteration

Running CLARA on OA (Output Area) for 120 times on each iteration Running GA on OA (Output Area) for 120 times on each iteration

Page 33: Towards the Geo-computation of Real-Time Geodemographics

K-means and Principal Component Analysis

• PCA can be used to facilitate K-means clustering by reducing dimensions.

(Ding, C., He, X., 2004)

K-means result for 41 “OAC variables”K-means result for 26 “OAC Principal Components”

K=4 (99% similar)

Page 34: Towards the Geo-computation of Real-Time Geodemographics

K-means and Principal Component Analysis

• PCA can be used to facilitate K-means clustering by reducing dimensions.

(Ding, C., He, X., 2004)

K-means result for 41 “OAC variables” K-means result for 26 “OAC Principal Components”

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

No. of clusters

Tim

e (s

)

Kmeans

PCA_Kmeans

Page 35: Towards the Geo-computation of Real-Time Geodemographics

Intermediate results

• Clara is plausible alternative to k-means in a real time Geodemographic classification system.

• K-means might be combined with PCA for enhanced computation power.

• In an online environment k-means is better for small data sets.

• Exploration of non traditional data sources.

Page 36: Towards the Geo-computation of Real-Time Geodemographics

Bespoke Geodemographics….demo

• Web based clustering.

• Users interact with Java Servlets to submit the requests.

• Java Servlets interact with R, which in turn does the clustering.

• Java Servlets interact with R by using JRI.

Page 37: Towards the Geo-computation of Real-Time Geodemographics

Bespoke Geodemographics….demo

Page 38: Towards the Geo-computation of Real-Time Geodemographics

Bespoke Geodemographics….demo

Page 39: Towards the Geo-computation of Real-Time Geodemographics

Parallel Processing in Geodemographics

• Distribution of computation on multiple computers or processors.

• Effectively using the idle time of processors.

• Reduces computational time.

Page 40: Towards the Geo-computation of Real-Time Geodemographics

Parallel Processing in Geodemographics

• Graphics cards can be used for parallel processing

• Latest graphics cards are coming with multiple GPUs (Graphics Processing Units)

• CUDA • Parallel computing architecture

• Uses graphic processing units of NVIDIA graphics cards

• Graphics cards can be used for running clustering algorithms in parallel

• All the latest NVIDIA graphics cards are CUDA enabled• E.g. GeForce GTX 295, Tesla S1070, Quadro FX 5800 etc.

Page 41: Towards the Geo-computation of Real-Time Geodemographics

Parallel Processing in Geodemographics

• Running k-means on OA dataset for London (10 times for each value of k).

• Result shows an increase in computation power by approx 30% while maintaining the clustering efficiency.

• Can be approx 70% if we run k-means for 10,000 times.

Page 42: Towards the Geo-computation of Real-Time Geodemographics

Future work

• Investigation of clustering algorithms in more detail

• Investigation of CUDA in more detail for running clustering algorithms in less time.• More testing for other clustering algorithms

• Visualisation of the results produced.• Merging Google Maps with the existing visualisation techniques of

WorldNames (www.publicprofiler.org/worldnames) and National Trust Names (www.nationaltrustnames.org.uk) websites.

• Testing the usability for bespoke classifications.

Page 43: Towards the Geo-computation of Real-Time Geodemographics

Future work

Page 44: Towards the Geo-computation of Real-Time Geodemographics

Thank you for listening

Any Questions?