Upload
hema-chandra-reddy-ethapu
View
159
Download
0
Embed Size (px)
Citation preview
BY HEMA CHANDRA REDDY ETHAPU
Big Data Domain Research
Agenda:
Introduction to domainProblems need to be solved for this domainBig data usedAnalysis techniques usedExamples of big data discoveries in the
domainChallenges facedReferences
Introduction to domain
The domain which I chose to analyze data for my project is Urban Planning which is a subsidiary domain of Government domain.
This domain mainly deals with land use statistics in towns and cities.
Mainly this land use is characterized into nine categories like domestic, non-domestic, roads, paths, rails, gardens, green space, water and others.
This further classified data provides a basis for analysis of availability of green spaces.
Introduction to domain contd..
These statistics improve the local area evidence base on neighborhood renewal, and support the creation of sustainable communities.
This analysis helps to raise the quality of life for people in urban areas, and other communities.
This data also facilitates analysis of housing density in a particular geographic area and thus providing data for building new housing in that area.
Problems need to be solved
The main problem here is to give a clear picture of how much land is being used for residential purposes and how much is being left for green spaces.
One more problem which I identified in this domain is, how weather in a particular geographic region is affected by the land utilization in that area.
Collecting weather data along with land usage statistical data will provide a good framework for data analysis and to provide better solutions for the problems.
Big data used
For my analysis in this domain, I used the data statistics that are created via a computerized process, which identifies different land parcels and buildings on an Ordnance Survey digital map product, and records their ‘type’ and area.
These objects are shown as polygons which can represent, for example, a building or land parcel.
Each object is recorded as an attribute in MasterMap and is labeled by a unique code called Topographic Identifier.
Big data used contd..
This figure represents an Ordnance Survey Map which shows the available Green Spaces in London.
Big data used contd..
Below is the sample data used for data analysis.
This data mainly comprises of Lower Layer Super Output Areas (LSOAs), Middle Layer Super Output Areas (MSOAs), Local Authorities (LAs), and Government Office Regions (GORs) in England in 2001.
Analysis techniques used
For the data collected for this project, I used Mining of Frequent Patterns (MFP) algorithm for discovering patterns in the data.
I used hybrid clustering approach for finding interpretable patterns in the given data.
After analyzing frequent patterns, I came to the conclusion of which land is being used most in which geographic region of UK.
In the next slide we will see the algorithm used and mined data.
Analysis techniques used contd..
Input: Datasets (DS)Output: Matrix Most Frequent Pattern (MFP): MFP (DS)Begin{ For each item Xi in DS { a. for each attribute i. count occurrences for Xi
C=Count (Xi) ii. Find attribute name of C having maximum count Mi=Attribute (Ci) Next [End of inner loop] b. Find Most Frequent Pattern i. MFP=Combine(Mi) Next [End of outer loop] }}
MFP Algorithm
Analysis techniques used contd..
GOR_Name
MSOA_Name
Total_Area (In m2)
Area of Buildings(In m2)
Area of industrial buildings(In m2)
Area of green spaces(In m2)
North East
Chesterfield 8075.52 2300.53
5000.01 775
North East
Derbyshire 13274.87 250.07
13000 24.80
North West
Erewash3859.31 170.53
3400.22 389
North West
High Peak1775.43 194.79
1400 180
East Midlands
Blaby5820.06 217.93
5000.07 600.87
East Midlands
Charnwood 3573.27 245.53
3000.67 200.98
Table with data containing high occupancy of industrial buildings and high green spaces.
Analysis techniques used contd..
GOR_Name
MSOA_Name
Total_Area (In m2)
Area of Buildings(In m2)
Area of industrial buildings(In m2)
Area of green spaces(In m2)
North East
Hampshire 5132.764900.45
130.31 100
North East
Devon1131.431000
50.65 65.30
North West
Treddyfrin 3686.413200
250.43 200.32
North West
Chester1431.08900.65
289.67 76.23
East Midlands
Gateshead1213.91945.82
342.72 63.41
East Midlands
New Castle 1041.91783.29
323.48 24.10Table with data containing high occupancy of domestic buildings and low green spaces.
Analysis techniques used contd..
GOR_Name
MSOA_Name
Total_Area (In m2)
Area of Buildings(In m2)
Area of industrial buildings(In m2)
Area of green spaces(In m2)
Yorkshire Uttlesford 3843.962546.36 1325.78 12.76Yorkshire Broxbourn
e 418.69327.45125.67 10.65
Humber Dacorum 1425.65967.39 765.43 9.1Humber Hertsmer
e 652.9432.76289.37 3.4
West Midlands
St Albans816.77730.34
129.60 2.90
West Midlands
Hertfordshire 614.03467.65
132.65 13.67Table with data containing high occupancy of domestic and industrial buildings with very low green spaces.
Analysis techniques used contd..
MFP Matrix Table
Most frequent patterns mined out of given data
1. North East – Chesterfield – 5000.01 (Industrial Building) – 775 (Green Space)
2. East Midlands – Blaby - 5000.01 (Industrial Building) – 600.87 (Green Space)
3. North East – Devon – 50.65 (Industrial Building) - 65.30 (Green Space)
4. West Midlands – Hertfordshire – 132.65 (Industrial Building) – 13.27 (Green Space)
Analysis techniques used contd..
From above MFP matrix, I came to a conclusion that in a particular Government Office Region (GOR), in a Middle Layer Super Output Areas (MSOA), most of the green space is used up by domestic buildings.
I observed that increase in domestic land utilization lead to decrease in green space areas.
Now that we have human – interpretable patterns in our data, it is easy to analyze this data.
Examples of big data discoveries in the domain
Collecting and analyzing data in a city for a metro rail project before the initiation of the project.
This analysis includes which localities are feasible for metro rail?, which route will be optimal for the project? and how well it will be utilized by the public after completion?
One more example is analysis of an online railway reservation system.
This analysis includes from where requests are received to the server?, how many authorized requests are received in peak hour of booking? And response time for each request.
Challenges Faced
The main challenge I faced in this domain research is finding datasets for data analysis.
The other challenge I faced is recognizing in what state the data is in.
I overcame all these challenges by learning and studying some of the machine learning algorithms and concepts in Data analysis.
References
http://ieeexplore.ieee.org.ezp1.villanova.edu/stamp/stamp.jsp?tp=&arnumber=5077118
http://www.neighbourhood.statistics.gov.uk/dissemination/instanceSelection.do?JSAllowed=true&Function=&%24ph=60_61&CurrentPageId=61&step=2&datasetFamilyId=1201&instanceSelection=119561&Next.x=28&Next.y=16
https://en.wikipedia.org/wiki/K-means_clustering
http://docs.scipy.org/doc/scipy/reference/tutorial/spatial.html
Queries?