Domain research presentation Midterm

BY HEMA CHANDRA REDDY ETHAPU

Big Data Domain Research

Agenda:

Introduction to domainProblems need to be solved for this domainBig data usedAnalysis techniques usedExamples of big data discoveries in the

domainChallenges facedReferences

Introduction to domain

The domain which I chose to analyze data for my project is Urban Planning which is a subsidiary domain of Government domain.

This domain mainly deals with land use statistics in towns and cities.

Mainly this land use is characterized into nine categories like domestic, non-domestic, roads, paths, rails, gardens, green space, water and others.

This further classified data provides a basis for analysis of availability of green spaces.

Introduction to domain contd..

These statistics improve the local area evidence base on neighborhood renewal, and support the creation of sustainable communities.

This analysis helps to raise the quality of life for people in urban areas, and other communities.

This data also facilitates analysis of housing density in a particular geographic area and thus providing data for building new housing in that area.

Problems need to be solved

The main problem here is to give a clear picture of how much land is being used for residential purposes and how much is being left for green spaces.

One more problem which I identified in this domain is, how weather in a particular geographic region is affected by the land utilization in that area.

Collecting weather data along with land usage statistical data will provide a good framework for data analysis and to provide better solutions for the problems.

Big data used

For my analysis in this domain, I used the data statistics that are created via a computerized process, which identifies different land parcels and buildings on an Ordnance Survey digital map product, and records their ‘type’ and area.

These objects are shown as polygons which can represent, for example, a building or land parcel.

Each object is recorded as an attribute in MasterMap and is labeled by a unique code called Topographic Identifier.

Big data used contd..

This figure represents an Ordnance Survey Map which shows the available Green Spaces in London.

Big data used contd..

Below is the sample data used for data analysis.

This data mainly comprises of Lower Layer Super Output Areas (LSOAs), Middle Layer Super Output Areas (MSOAs), Local Authorities (LAs), and Government Office Regions (GORs) in England in 2001.

Analysis techniques used

For the data collected for this project, I used Mining of Frequent Patterns (MFP) algorithm for discovering patterns in the data.

I used hybrid clustering approach for finding interpretable patterns in the given data.

After analyzing frequent patterns, I came to the conclusion of which land is being used most in which geographic region of UK.

In the next slide we will see the algorithm used and mined data.

Analysis techniques used contd..

Input: Datasets (DS)Output: Matrix Most Frequent Pattern (MFP): MFP (DS)Begin{ For each item Xi in DS { a. for each attribute i. count occurrences for Xi

C=Count (Xi) ii. Find attribute name of C having maximum count Mi=Attribute (Ci) Next [End of inner loop] b. Find Most Frequent Pattern i. MFP=Combine(Mi) Next [End of outer loop] }}

MFP Algorithm


GOR_Name

MSOA_Name

Total_Area (In m2)

Area of Buildings(In m2)

Area of industrial buildings(In m2)

Area of green spaces(In m2)

North East

Chesterfield 8075.52 2300.53

5000.01 775

North East

Derbyshire 13274.87 250.07

13000 24.80

North West

Erewash3859.31 170.53

3400.22 389

North West

High Peak1775.43 194.79

1400 180

East Midlands

Blaby5820.06 217.93

5000.07 600.87

East Midlands

Charnwood 3573.27 245.53

3000.67 200.98

Table with data containing high occupancy of industrial buildings and high green spaces.


GOR_Name

MSOA_Name

Total_Area (In m2)




North East

Hampshire 5132.764900.45

130.31 100

North East

Devon1131.431000

50.65 65.30

North West

Treddyfrin 3686.413200

250.43 200.32

North West

Chester1431.08900.65

289.67 76.23

East Midlands

Gateshead1213.91945.82

342.72 63.41

East Midlands

New Castle 1041.91783.29

323.48 24.10Table with data containing high occupancy of domestic buildings and low green spaces.


GOR_Name

MSOA_Name

Total_Area (In m2)




Yorkshire Uttlesford 3843.962546.36 1325.78 12.76Yorkshire Broxbourn

e 418.69327.45125.67 10.65

Humber Dacorum 1425.65967.39 765.43 9.1Humber Hertsmer

e 652.9432.76289.37 3.4

West Midlands

St Albans816.77730.34

129.60 2.90

West Midlands

Hertfordshire 614.03467.65

132.65 13.67Table with data containing high occupancy of domestic and industrial buildings with very low green spaces.


MFP Matrix Table

Most frequent patterns mined out of given data

1. North East – Chesterfield – 5000.01 (Industrial Building) – 775 (Green Space)

2. East Midlands – Blaby - 5000.01 (Industrial Building) – 600.87 (Green Space)

3. North East – Devon – 50.65 (Industrial Building) - 65.30 (Green Space)

4. West Midlands – Hertfordshire – 132.65 (Industrial Building) – 13.27 (Green Space)


From above MFP matrix, I came to a conclusion that in a particular Government Office Region (GOR), in a Middle Layer Super Output Areas (MSOA), most of the green space is used up by domestic buildings.

I observed that increase in domestic land utilization lead to decrease in green space areas.

Now that we have human – interpretable patterns in our data, it is easy to analyze this data.

Examples of big data discoveries in the domain

Collecting and analyzing data in a city for a metro rail project before the initiation of the project.

This analysis includes which localities are feasible for metro rail?, which route will be optimal for the project? and how well it will be utilized by the public after completion?

One more example is analysis of an online railway reservation system.

This analysis includes from where requests are received to the server?, how many authorized requests are received in peak hour of booking? And response time for each request.

Challenges Faced

The main challenge I faced in this domain research is finding datasets for data analysis.

The other challenge I faced is recognizing in what state the data is in.

I overcame all these challenges by learning and studying some of the machine learning algorithms and concepts in Data analysis.

References

http://ieeexplore.ieee.org.ezp1.villanova.edu/stamp/stamp.jsp?tp=&arnumber=5077118

http://www.neighbourhood.statistics.gov.uk/dissemination/instanceSelection.do?JSAllowed=true&Function=&%24ph=60_61&CurrentPageId=61&step=2&datasetFamilyId=1201&instanceSelection=119561&Next.x=28&Next.y=16

https://en.wikipedia.org/wiki/K-means_clustering

http://docs.scipy.org/doc/scipy/reference/tutorial/spatial.html




http://www.neighbourhood.statistics.gov.uk/dissemination/instanceSelection.do?JSAllowed=true&Function=&$ph=60_61&CurrentPageId=61&step=2&datasetFamilyId=1201&instanceSelection=119561&Next.x=28&Next.y=16












Queries?

Education

Domain research presentation Midterm