46
1 AUTOMATING DATA EXPLORATION A structured approach to analysing data A TOOL AGNOSTIC APPROACH

Automating Data Exploration SciPy 2016

Embed Size (px)

Citation preview

Page 1: Automating Data Exploration SciPy 2016

1

AUTOMATING DATA EXPLORATIONA structured approach to analysing data

A TOOL AGNOSTIC APPROACH

Page 2: Automating Data Exploration SciPy 2016

2

AUTOMATING DATA EXPLORATIONA structured approach to analysing data

METADATA UNIVARIATE ANALYSIS

BIVARIATE ANALYSIS

Page 3: Automating Data Exploration SciPy 2016

3

LET’S TAKE A DATASET

Each row has details about an employee who has left the organization.

Just “reading” the dataset is quite informative.

Page 4: Automating Data Exploration SciPy 2016

4

DESCRIBE THE DATA IN A STRUCTURED WAY

Page 5: Automating Data Exploration SciPy 2016

5

AUTOMATING DATA EXPLORATIONA structured approach to analysing data

METADATA UNIVARIATE ANALYSIS

BIVARIATE ANALYSIS

Page 6: Automating Data Exploration SciPy 2016

6

CATEGORICAL COLUMNS YIELD VERY LITTLE DATA

There’s not much information in one column.

The values are not quantitative,so a distribution is not meaningful.

The values are not even ordered.

In fact, the only thing we have is the list of values and their count.

... or is there more to this?

Region CountIndia 10780Headstrong 1554China 1130Philippines 1030US 792Romania 788Mexico 324Guatemala 233Poland 124Brazil 45Hungary 41Colombia 38Netherlands 33South Africa 30UK 18UAE 15GMS India 15Japan 11CZECH Republic 10Kenya 9

Page 7: Automating Data Exploration SciPy 2016

7

... BUT RANK FREQUENCY IS STILL POSSIBLE

The rank of the row provides additional information.

With this, we can explore the distribution of the rank against the count.

These distributions are called rank-frequency distributions.

Rank Region Count1 India 107802 Headstrong 15543 China 11304 Philippines 10305 US 7926 Romania 7887 Mexico 3248 Guatemala 2339 Poland 12410 Brazil 4511 Hungary 4112 Colombia 3813 Netherlands 3314 South Africa 3015 UK 1816 UAE 1517 GMS India 1518 Japan 1119 CZECH Republic 1020 Kenya 9

Page 8: Automating Data Exploration SciPy 2016

8

REGION SHOWS A POWER LAW DISTRIBUTION

Region CountIndia 10780Headstrong 1554China 1130Philippines 1030US 792Romania 788Mexico 324Guatemala 233Poland 124Brazil 45Hungary 41Colombia 38Netherlands 33South Africa 30UK 18UAE 15GMS India 15Japan 11CZECH Republic 10Kenya 9

Rank on a log scale

Freq

uenc

y on

a lo

g sc

ale

Page 9: Automating Data Exploration SciPy 2016

9

COST CODE SHOWS A POWER LAW DISTRIBUTION

Cost Code Count105 9542121 1757125 875122 7963001 6543310 635124 435131 415115 336nan 207101 205127 173109 148116 91126 66...

Page 10: Automating Data Exploration SciPy 2016

10

LE SHOWS A POWER LAW DISTRIBUTION

LE CountD84 11487GPL 853RM1 789LC2 565GMR 323D95 247GUT 233ML1 223CTK 184AXE 127A38 98A21 79EMP 61BRL 45A66 43...

Page 11: Automating Data Exploration SciPy 2016

11

WHAT CAUSESPOWER LAW DISTRIBUTIONS?

PREFERENTIAL

ATTACHMENT

EXPONENTIAL GROWTH

Page 12: Automating Data Exploration SciPy 2016

12

NO. OF FOLLOWERS ON GITHUB

Username Countslidenerd 1700astaxie 1320MugunthKumar 1081honcheng 870arunoda 827csjaba 670cheeaun 658timoxley 600karlseguin 600hemanth 514arvindr21 400yuvipanda 335mbrochh 330anandology 330sayanee 314zz85 314sanand0 309captn3m0 300sameersbn 300...

Page 13: Automating Data Exploration SciPy 2016

13

NO. OF MOVIES ACTED IN BY BOLLYWOOD PEOPLE

Person CountLata Mangeshkar 824Asha Bhosle 810Shakti Kapoor 589Kishore Kumar 585Mohammed Rafi 527Sunidhi Chauhan 515Alka Yagnik 451Udit Narayan 435Kader Khan 430Sonu Nigam 405Sameer 398Asrani 397Helen 395Shaan 377Aruna Irani 375Anupam Kher 367Shreya Ghoshal 357Gulshan Grover 341...

Page 14: Automating Data Exploration SciPy 2016

14

PARTIES IN PARLIAMENT ELECTIONS

Name CountIND 44704INC 7213BJP 3354BSP 2628SP 1311CPI 1102JD 943CPM 914DDP 716JNP 676BJS 657JP 563NOTA 543PSP 538INC(I) 492SHS 467AAP 432SWA 410...

Page 15: Automating Data Exploration SciPy 2016

15

CANDIDATE NAMES IN ASSEMBLY ELECTIONS

Name CountNONE OF THE ABOVE 629OM PRAKASH 478ASHOK KUMAR 411RAM SINGH 362RAJ KUMAR 294ANIL KUMAR 271AMAR SINGH 248MOHAN LAL 235RAM KUMAR 224BABU LAL 218RAM PRASAD 213JAGDISH 210VIJAY KUMAR 207RAJENDRA SINGH 196VINOD KUMAR 195SHYAM LAL 193RAJESH KUMAR 186SITA RAM 186RAM LAL 171...

Page 16: Automating Data Exploration SciPy 2016

16

STUDENT NAMES IN SSA SURVEY

Name CountM.MANIKANDAN 99S.PAVITHRA 84S.MANIKANDAN 84R.RAMYA 82S.SANGEETHA 70R.MANIKANDAN 69S.DIVYA 68M.PAVITHRA 68S.SANTHIYA 67S.VIGNESH 67M.PRIYA 67M.MAHALAKSHMI 64S.SARANYA 63S.SURYA 60K.MANIKANDAN 60P.PAVITHRA 56S.GAYATHRI 56P.MANIKANDAN 55...

Page 17: Automating Data Exploration SciPy 2016

Jain

Harini

Shweta

Sneha Pooja

Ashwin

Shah

Deepti

Sanjana

Varshini

Ezhumalai

Venkatesan

Silambarasan

Pandiyan

Kumaresan

Manikandan

Thirupathi

Agarwal

Kumar

Priya

Page 18: Automating Data Exploration SciPy 2016

18

NOT EVERYTHING IS POWER-LAW, THOUGH

Need to understand what drives these distributions from their behaviours

Page 19: Automating Data Exploration SciPy 2016

19

ORDERED CATEGORICALS HAVE MORE INFORMATION

Page 20: Automating Data Exploration SciPy 2016

20

CORPORATE BAND

LE Count5 122474 44493 2052 63Not Mapped 241 22SVP 10

Page 21: Automating Data Exploration SciPy 2016

21

LOCAL BAND

LE Count5A 74835B 47644A 16834B 16124C 7474D 4073 2052 63Not Mapped 241 22SVP 10

Page 22: Automating Data Exploration SciPy 2016

22

QUANTITIES HAVE EVEN MORE INFORMATION

Page 23: Automating Data Exploration SciPy 2016

23

AGE DISTRIBUTION IS LOG-NORMAL

Page 24: Automating Data Exploration SciPy 2016

24

DETECTING FRAUD

“ We know meter readings are incorrect, for various reasons.

We don’t, however, have the concrete proof we need to start the process of meter reading automation.

Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns.

ENERGY UTILITY

Page 25: Automating Data Exploration SciPy 2016

25

This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large

number of readings are aligned with the tariff slab boundaries.

This clearly shows collusion of some form with the customers.

Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11217 219 200 200 200 200 200 200 200 350 200 200250 200 200 200 201 200 200 200 250 200 200 150250 150 150 200 200 200 200 200 200 200 200 150150 200 200 200 200 200 200 200 200 200 200 50200 200 200 150 180 150 50 100 50 70 100 100100 100 100 100 100 100 100 100 100 100 110 100100 150 123 123 50 100 50 100 100 100 100 100

0 111 100 100 100 100 100 100 100 100 50 500 100 27 100 50 100 100 100 100 100 70 1001 1 1 100 99 50 100 100 100 100 100 100

This happens with specific customers, not randomly. Here are such customers’ meter readings.

Section

Apr-10

May-10

Jun-10

Jul-10

Aug-10Sep-10

Oct-10Nov-10

Dec-10

Jan-11

Feb-11

Mar-11

Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109%Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54%Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34%Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14%Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15%Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33%Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14%Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17%Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11%

If we define the “extent of fraud” as the percentage excess of the 100 unitmeter reading, the value varies considerably across sections, and time

New section manager arrives

… and is transferred out

… with some explainable anomalies.

Why would these

happen?

Page 26: Automating Data Exploration SciPy 2016

26

PREDICTING MARKS

“ What determines a child’s marks?

Do girls score better than boys?

Does the choice of subject matter?

Does the medium of instruction matter?

Does community or religion matter?

Does their birthday matter?

Does the first letter of their name matter?

EDUCATION

Page 27: Automating Data Exploration SciPy 2016

27

TN CLASS X: ENGLISH

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Page 28: Automating Data Exploration SciPy 2016

28

TN CLASS X: SOCIAL SCIENCE

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Page 29: Automating Data Exploration SciPy 2016

29

TN CLASS X: LANGUAGE

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Page 30: Automating Data Exploration SciPy 2016

30

TN CLASS X: SCIENCE

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Page 31: Automating Data Exploration SciPy 2016

31

TN CLASS X: MATHEMATICS

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Page 32: Automating Data Exploration SciPy 2016

32

ICSE 2013 CLASS XII: TOTAL MARKS

Page 33: Automating Data Exploration SciPy 2016

33

CBSE 2013 CLASS XII: ENGLISH MARKS

Page 34: Automating Data Exploration SciPy 2016

34

CBSE 2013 CLASS XII: PHYSICS MARKS

Page 35: Automating Data Exploration SciPy 2016

35

AUTOMATING DATA EXPLORATIONA structured approach to analysing data

METADATA UNIVARIATE ANALYSIS

BIVARIATE ANALYSIS

Page 36: Automating Data Exploration SciPy 2016

36

LET’S TAKE ONE DAY CRICKET DATA

Country Player Runs ScoreRate MatchDate Ground VersusAustralia Michael J Clarke 99* 93.39 30-06-2010 The Oval EnglandAustralia Dean M Jones 99* 128.57 28-01-1985 Adelaide Oval Sri LankaAustralia Bradley J Hodge 99* 115.11 04-02-2007 Melbourne Cricket Ground New ZealandIndia Virender Sehwag 99* 99 16-08-2010 Rangiri Dambulla International Stad. Sri LankaNew Zealand Bruce A Edgar 99* 72.79 14-02-1981 Eden Park IndiaPakistan Mohammad Yousuf 99* 95.19 15-11-2007 Captain Roop Singh Stadium IndiaWest Indies Richard B Richardson 99* 70.21 15-11-1985 Sharjah CA Stadium PakistanWest Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002 Sardar Patel Stadium IndiaZimbabwe Andrew Flower 99* 89.18 24-10-1999 Harare Sports Club AustraliaZimbabwe Alistair D R Campbell 99* 79.83 01-10-2000 Queens Sports Club New ZealandZimbabwe Malcolm N Waller 99* 133.78 25-10-2011 Queens Sports Club New ZealandAustralia David C Boon 98* 82.35 08-12-1994 Bellerive Oval ZimbabweAustralia Graeme M Wood 98* 63.22 11-01-1981 Melbourne Cricket Ground IndiaEngland Ian J L Trott 98* 84.48 20-10-2011 Punjab Cricket Association Stadium IndiaIndia Yuvraj Singh 98* 89.09 01-08-2001 Sinhalese Sports Club Ground Sri LankaIreland Kevin J O'Brien 98* 94.23 10-07-2010 VRA Ground ScotlandKenya Collins O Obuya 98* 75.96 13-03-2011 M.Chinnaswamy Stadium AustraliaNetherlands Ryan N ten Doeschate 98* 73.68 01-09-2009 VRA Ground AfghanistanNew Zealand James E C Franklin 98* 142.02 07-12-2010 M.Chinnaswamy Stadium IndiaPakistan Ijaz Ahmed 98* 112.64 28-10-1994 Iqbal Stadium South AfricaSouth Africa Jacques H Kallis 98* 74.24 06-02-2000 St George's Park Zimbabwe

Page 37: Automating Data Exploration SciPy 2016

37

Against which countries are higher averages

scored?

Which countries’ players score more per

match?

Page 38: Automating Data Exploration SciPy 2016

38

Which player scores the most per ball?

The player with the highest strike rate is an obscure South African whose name most of us have never heard of.

In fact, this list is filled with players we have never heard of.

Page 39: Automating Data Exploration SciPy 2016

39

Most analysis answers the question

“Which is are the top 10 X”?Which are my top products?

Which are my top branches?

Who are my best sales people?

Which vendors have the highest cost per unit?

Which divisions are spending the most money?

In which hours does the under 12 segment watch TV most?

Which customer segment has the highest revenue per user?

Page 40: Automating Data Exploration SciPy 2016

40

THIS QUESTION CAN BE ANSWERED SYSTEMATICALLY

Country Player Runs ScoreRate MatchDate Ground VersusAustralia Michael J Clarke 99* 93.39 30-06-2010 The Oval EnglandAustralia Dean M Jones 99* 128.57 28-01-1985 Adelaide Oval Sri LankaAustralia Bradley J Hodge 99* 115.11 04-02-2007 Melbourne Cricket Ground New ZealandIndia Virender Sehwag 99* 99 16-08-2010 Rangiri Dambulla International Stad. Sri LankaNew Zealand Bruce A Edgar 99* 72.79 14-02-1981 Eden Park IndiaPakistan Mohammad Yousuf 99* 95.19 15-11-2007 Captain Roop Singh Stadium IndiaWest Indies Richard B Richardson 99* 70.21 15-11-1985 Sharjah CA Stadium PakistanWest Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002 Sardar Patel Stadium IndiaZimbabwe Andrew Flower 99* 89.18 24-10-1999 Harare Sports Club AustraliaZimbabwe Alistair D R Campbell 99* 79.83 01-10-2000 Queens Sports Club New ZealandZimbabwe Malcolm N Waller 99* 133.78 25-10-2011 Queens Sports Club New ZealandAustralia David C Boon 98* 82.35 08-12-1994 Bellerive Oval ZimbabweAustralia Graeme M Wood 98* 63.22 11-01-1981 Melbourne Cricket Ground IndiaEngland Ian J L Trott 98* 84.48 20-10-2011 Punjab Cricket Association Stadium IndiaIndia Yuvraj Singh 98* 89.09 01-08-2001 Sinhalese Sports Club Ground Sri LankaIreland Kevin J O'Brien 98* 94.23 10-07-2010 VRA Ground ScotlandKenya Collins O Obuya 98* 75.96 13-03-2011 M.Chinnaswamy Stadium AustraliaNetherlands Ryan N ten Doeschate 98* 73.68 01-09-2009 VRA Ground AfghanistanNew Zealand James E C Franklin 98* 142.02 07-12-2010 M.Chinnaswamy Stadium IndiaPakistan Ijaz Ahmed 98* 112.64 28-10-1994 Iqbal Stadium South AfricaSouth Africa Jacques H Kallis 98* 74.24 06-02-2000 St George's Park Zimbabwe

Take every column in the data

Find the top value by that column

Country South Africa has the highest strike rate of 76%Player Johann Louw has the highest strike rate of 329%Runs 164 runs has the highest strike rate of 156%MatchDate12-03-2006 has the highest strike rate of 136%Ground AC-VDCA Stadium has the highest strike rate of98%Versus United States has the highest strike rate of 104%

Page 41: Automating Data Exploration SciPy 2016

41

What do the children in schools know and can do at different stages of elementary

education?

Have the inputs made into the elementary education system had a beneficial effect or

not?

Page 42: Automating Data Exploration SciPy 2016

42

HAVING BOOKS IMPROVES READING ABILITYHaving more books at home improves the performance of children when it comes to reading. (But children typically only have only 1-10 books at home) Number of students sampled

What is the impact? How many more marks can having more books fetch?

Circle size indicates number of students with this response. Few students have no books.

Is this response (“25+ books”) good or bad? Small red bars indicate low marks. Large green bars indicate high marks. Students having 25+ books tend to score high marks.

The most common response is marked in blue. This is also the circle.

The graphic is summarized in words

Indicates whether the best response is the most popular. Blue means that it is not. Green means that it is. Red means that the worst level is the most popular response.

Page 43: Automating Data Exploration SciPy 2016

43

CHILDREN LIKE GAMES, AND THEY’RE GOOD

… but playing daily hurts reading ability

Page 44: Automating Data Exploration SciPy 2016

44

WATCHING TV OCCASIONALLY IS GOODChildren who watch TV every day don’t do as well as children who watch TV only once a week.

But children who never watch TV fare the worst.

Watching TV every day helps improve children’s reading ability a little bit more…

… but mathematical abilities fall dramatically at that point

Page 45: Automating Data Exploration SciPy 2016

45

WE HAVE A WEBSITE THAT YOU CAN EXPLORE

GRAMENER.COM/NAS

Page 46: Automating Data Exploration SciPy 2016

46

AUTOMATING DATA EXPLORATIONA structured approach to analysing data

METADATA UNIVARIATE ANALYSIS

BIVARIATE ANALYSIS