Real time Geodemographics: Requirements and Challenges
Muhammad Adnan, Paul Longley
Current Geodemographic classifications
• Census data• E.g. OA (Output Area) dataset has 41 census variables.
• Variables are weighted according to their importance in classification.
• K-means clustering algorithm is used to cluster data into homogeneous groups.• Multiple runs of K-means due to its un-stability• 10,000 times (Singleton, 2008)
Need for real time Geodemographics
• Current classifications are created using static data sources.
• Rate and scale of current population change is making large surveys (census) increasingly redundant.• Significant hidden value in transactional data
• Data is increasingly available in near real time
e.g. ONS NESS API• Application specific (bespoke) classifications have
demonstrated utility (Longley & Singleton, 2009).
What are real time Geodemographics ?
Specification Estimation Testing
Computational challenges
• Integration of large and possibly disparate databases.• E.g. NHS data; Census data
• Data normalisation and optimization for fast transactions.
• Minimizing computational time of clustering algorithms (Very Important)!
• Common protocol• XML (SOAP)
• Use of non traditional data sources. (Singleton, 2008) • E.g. Flickr; Facebook
Important Challenge: Selection of clustering algorithm
• K-Means• PAM (Partitioning Around Medoids)• CLARA (Clustering Large Applications)• GA (Genetic Algorithm)
K-means
• Attempts to find out cluster centroids by minimising within sum of squares distance.
• K-means is unstable due to its initial seeds assignment.• Sensitive to outliers.
• Creating a Geodemographic classification requires running algorithm multiple times.• 10,000 times (Singleton, 2008)• Computationally expensive in a real time environment.
K-means (100 runs of k-means on OAC data set for k=4)
An example of bad clustering result (K-means)
An example of bad clustering result (K-means)
An example of bad clustering result (K-means)
Alternate Clustering Algorithms
• PAM (Partitioning around medoids) tries to minimize the sum of distances of the objects to their cluster centers.• Less sensitive to outliers than K-means.• Cannot handle larger data sets.
• CLARA (Clustering Large Applications) draws multiple samples of the dataset, applies PAM to each sample and returns the best result.
• GA (Genetic Algorithm) is inspired by models of biological evolution. It produces results through a breeding procedure.
This paper compares
• K-means• Clara• GA
By using three data normalisation techniques• Z-Scores• Range Standardisation• Principle Component Analysis.
• Algorithm stability of K-means, Clara, and GA
Data normalisation techniques used
• Z-Scores• Widely used variable normalisation technique• Can create outliers in the datasets
• Range Standardisation• Standardise values between a range of 0-1• Can erase interesting patterns in the data
• Principle Component Analysis.• Reduces the dimensions of a data set• Can erase interesting patterns in the data
Comparing computational efficiency (Z-scores)
PAM, and GA on the three geographic aggregations of a dataset covering London.
Figure 1: OA (Output Area) level results
Figure 2 : LSOA (Lower Super Output Area) level results Figure 3: Ward level results
Comparing computational efficiency (Range Standardisation)
PAM, and GA on the three geographic aggregations of a dataset covering London.
Figure 4: OA (Output Area) level results
Figure 5 : LSOA (Lower Super Output Area) level results Figure 6: Ward level results
Comparing computational efficiency (PCA)
PAM, and GA on the three geographic aggregations of a dataset covering London.
Figure 7: OA (Output Area) level results
Figure 8 : LSOA (Lower Super Output Area) level results Figure 9: Ward level results
Algorithm Stability (w.r.t. Computational time)Figure 10: Running k-means on OA (Output Area) for 120 times on each iteration
Figure 11: Running CLARA on OA (Output Area) for 120 times on each iteration Figure 12: Running GA on OA (Output Area) for 120 times on each iteration
K-means and Principle Component Analysis
• PCA can be used to facilitate K-means clustering by reducing dimensions.
(Ding, C., He, X., 2004)
Figure 13: K-means result for 41 “OAC variables”Figure 14: K-means result for 26 “OAC Principle Components”
K=4 (99% similar)
K-means and Principle Component Analysis
• PCA can be used to facilitate K-means clustering by reducing dimensions.
(Ding, C., He, X., 2004)
Figure 13: K-means result for 4 1 “OAC variables” Figure 14: K-means result for 26 “OAC Principle Components”
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
No. of clusters
Tim
e (s
)
Kmeans
PCA_Kmeans
Conclusion
• Clara is plausible alternative to k-means in a real time Geodemographic classification system.
• K-means might be combined with PCA for enhanced computation power.
• In an online environment k-means is better for small data sets.
• Exploration of non traditional data sources.