Download ppt - Real Time Geodemographics

Real time Geodemographics: Requirements and Challenges

Muhammad Adnan, Paul Longley

Current Geodemographic classifications

• Census data• E.g. OA (Output Area) dataset has 41 census variables.

• Variables are weighted according to their importance in classification.

• K-means clustering algorithm is used to cluster data into homogeneous groups.• Multiple runs of K-means due to its un-stability• 10,000 times (Singleton, 2008)

Need for real time Geodemographics

• Current classifications are created using static data sources.

• Rate and scale of current population change is making large surveys (census) increasingly redundant.• Significant hidden value in transactional data

• Data is increasingly available in near real time

e.g. ONS NESS API• Application specific (bespoke) classifications have

demonstrated utility (Longley & Singleton, 2009).

What are real time Geodemographics ?

Specification Estimation Testing

Computational challenges

• Integration of large and possibly disparate databases.• E.g. NHS data; Census data

• Data normalisation and optimization for fast transactions.

• Minimizing computational time of clustering algorithms (Very Important)!

• Common protocol• XML (SOAP)

• Use of non traditional data sources. (Singleton, 2008) • E.g. Flickr; Facebook

Important Challenge: Selection of clustering algorithm

• K-Means• PAM (Partitioning Around Medoids)• CLARA (Clustering Large Applications)• GA (Genetic Algorithm)

K-means

• Attempts to find out cluster centroids by minimising within sum of squares distance.

• K-means is unstable due to its initial seeds assignment.• Sensitive to outliers.

• Creating a Geodemographic classification requires running algorithm multiple times.• 10,000 times (Singleton, 2008)• Computationally expensive in a real time environment.

K-means (100 runs of k-means on OAC data set for k=4)

An example of bad clustering result (K-means)



Alternate Clustering Algorithms

• PAM (Partitioning around medoids) tries to minimize the sum of distances of the objects to their cluster centers.• Less sensitive to outliers than K-means.• Cannot handle larger data sets.

• CLARA (Clustering Large Applications) draws multiple samples of the dataset, applies PAM to each sample and returns the best result.

• GA (Genetic Algorithm) is inspired by models of biological evolution. It produces results through a breeding procedure.

This paper compares

• K-means• Clara• GA

By using three data normalisation techniques• Z-Scores• Range Standardisation• Principle Component Analysis.

• Algorithm stability of K-means, Clara, and GA

Data normalisation techniques used

• Z-Scores• Widely used variable normalisation technique• Can create outliers in the datasets

• Range Standardisation• Standardise values between a range of 0-1• Can erase interesting patterns in the data

• Principle Component Analysis.• Reduces the dimensions of a data set• Can erase interesting patterns in the data

Comparing computational efficiency (Z-scores)

PAM, and GA on the three geographic aggregations of a dataset covering London.

Figure 1: OA (Output Area) level results

Figure 2 : LSOA (Lower Super Output Area) level results Figure 3: Ward level results

Comparing computational efficiency (Range Standardisation)




Comparing computational efficiency (PCA)




Algorithm Stability (w.r.t. Computational time)Figure 10: Running k-means on OA (Output Area) for 120 times on each iteration

Figure 11: Running CLARA on OA (Output Area) for 120 times on each iteration Figure 12: Running GA on OA (Output Area) for 120 times on each iteration

K-means and Principle Component Analysis

• PCA can be used to facilitate K-means clustering by reducing dimensions.

(Ding, C., He, X., 2004)

Figure 13: K-means result for 41 “OAC variables”Figure 14: K-means result for 26 “OAC Principle Components”

K=4 (99% similar)

K-means and Principle Component Analysis

• PCA can be used to facilitate K-means clustering by reducing dimensions.

(Ding, C., He, X., 2004)

Figure 13: K-means result for 4 1 “OAC variables” Figure 14: K-means result for 26 “OAC Principle Components”

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

No. of clusters

Tim

e (s

)

Kmeans

PCA_Kmeans

Conclusion

• Clara is plausible alternative to k-means in a real time Geodemographic classification system.

• K-means might be combined with PCA for enhanced computation power.

• In an online environment k-means is better for small data sets.

• Exploration of non traditional data sources.