Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Symbolic networks and big databike sharing data
CRoNoS COST Action Working Groups MeetingInstitute Tinbergen, Amterdam, Netherlands
1-2. September 2017
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Outline
1 Big data and aggregation2 Kaggle3 Data sets4 Analyses5 Conclusions6 References
Vladimir Batagelj: [email protected]
Last version of slides (September 2, 2017, 02 : 02): SymNet.pdf
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Divide and conquer
A “standard” approach to deal with large structures is thedivide and conquer strategy that (recursively) breaks down alarge structure to smaller, manageable sub-structures of thesame or related type.
Let U be a set of units, u ∈ U a unit, andC = {C1,C2, . . . ,Ck}, ∅ ⊂ Ci ⊆ U a partition of U – it holds:⋃
i Ci = U and i 6= j ⇒ Ci ∩ Cj = ∅.In classic data analysis the units are usually described as lists of(measured) values of selected variables ( properties, attributes)
X (u) = [x1(u), x2(u), . . . , xm(u)]
collected into a data frame X .
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Aggregation
The aggregation of a cluster C is again a list of values
Y (C ) = [y1(C ), y2(C ), . . . , ym(C )]
where yiC ) is the aggregated value of the set of values{xi (u) : u ∈ C} – forming an aggregated data frame Y .For example, for xi measured in a numerical scale their averageis usually used
yi (C ) =1
|C |∑u∈C
xi (u)
Different aggregation functions are available – see Beliakov,Pradera, Calvo: Aggregation Functions, 2007.
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Notes
1 This kind of aggregation can produce a big loss ofinformation.
2 It is not compatible with the recursive decomposition; butkeeping (|C |, yi (C )) is – for C1 ∩ C2 = ∅ we have
yi (C1 ∪ C2) =|C1|yi (C1) + |C2|yi (C2)
|C1|+ |C2|
3 The aggregated descriptions need not to be from the same“space” as the descriptions of units.
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Symbolic data analysis
Edwin Diday proposed in late eighties an approach, namedsymbolic data analysis (SDA), in which the (aggregated)descriptions can be symbolic objects (SO) such as: an interval,a list, a histogram, a distribution, etc. We get symbolic dataframes. See Billard, Diday: Symbolic Data Analysis, 2012.
Using this approach we can reduce a big data frame into small,manageable symbolic data frame and preserve much moreinformation. To analyze symbolic data frames new methodshave to be developed – SDA.
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Discrete distributions
We found very interesting the representation with discretedistribution.
The range V of a variable xi is partitioned into ki subsets {Vj}.Then yi (C ) = [yi1(C ), yi2(C ), . . . , yiki (C )] whereyij(C ) = |{u : xi (u) ∈ Vj}|.
The description based on a discrete distribution enables us toconsider variables that are measured in different types ofmeasurement scales and based on a different number of original(individual) units. It is also compatible with recursivedecomposition.
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Kaggle
Some time ago I found on Kaggle
https://www.kaggle.com/benhamner/sf-bay-area-bike-share
a contest dealing with an analysis of data on bike sharingsystem in the San Francisco Bay Area. After some searching itturned out that similar data sets are available for several citiesaround the world (mainly in US).
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Some Open data sets on Bike Sharing Systemson my disk
Bike sharing City data available # of tripsCapital Washington, D.C. 2010/10-2016/09 14691090Hubway Boston 2011/07-2016/06 3930659Divvy Chicago 2013/01-2016/06 7867601Citi Bike New York 2013/07-2016/09 33319019BABS San Francisco 2013/08-2016/08 983648Healthy Ride Pittsburgh 2015/07-2016/09 118422Indego Philadelphia 2015/04-2016/09 673703NiceRide Minnesota 2010/06-2015/12 1808452Santander C. London 2015/01-2016/11 19212558
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Data about stations
The Stations file is a snapshot of station locations and capacitiesduring the reporting time interval:
• Station ID
• Station name
• Lat/Long coordinates
• Number of individual docking points at each station
In some cases also the data about station elevantions are available.
North American Bike Share Association’s open data standard – gbfsGeneral Bikeshare Feed Specification; Systems using gbfs.
Most of the systems provide a feed service returning a JSON file withcurrent status of stations.Divvy, Indego, CitiBike stations: info, status
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Reading station status in R
wdir <- "C:/Users/batagelj/data/bikes/philly"
setwd(wdir)
stat <- "https://gbfs.bcycle.com/bcycle_indego/station_status.json"
num <- 0
setInternet2(use = TRUE)
p1 <- proc.time()
while (num < 5){
num <- num+1
fsave <- paste(’status_’,as.character(num),’.json’,sep=’’)
test <- tryCatch(download.file(stat,fsave,method="auto"),
error=function(e) e)
Sys.sleep(60)
p2 <- proc.time()
cat(p2 - p1,’\n’); flush.console()
p1 <- p2
}
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Data about trips
Each trip is anonymized and includes:
• Bike number
• Trip start day and time
• Trip end day and time
• Trip start station
• Trip end station
• Rider type
In some cases additional data are available: Gender, Year of birth.
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Additional data sources
WeatherFor cities in US we can get the weather data at NOAA, QualityControlled Local Climatological DataPrecipitations, wind, temperature, humidity, pressure.
MapsThe ESRI shape files descriptions of maps can be found using Google.Boston, Bay Area Cities, New York, Pittsburgh
Large temporal and spatial network data.There were some contests for analysing of bike sharing data. Someinteresting observations were presented. Also some blogs and paperswere written on this topic.In December 2016 there were 100 hits in WoS to the query "bike
sharing system*".
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Analyses
Different overall distributions:Pitts; Bay; Boston; NYC BSS
Impact of weather: temperature (day/night, winter), precipitations.
Cycles: year (temperature), week (working days/weekend), day(hours, parts of the day): week; days in a week
Other factors: subscriber/customer, trip duration, gendre, rider’s age,speed, elevation: age
The moves of bikes among stations by the system can be recognizedas those rides where the bike’s next trip started at a different stationfrom where the previous trip dropped off.Arrivals/departures; Boston; Changes
Prediction: SF Bay Area: count prediction
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Analyses
We find especially interesting a blog by
Todd W. Schneider: A Tale of Twenty-Two Million Citi Bike Rides:Analyzing the NYC Bike Share System
and
Jackson Whitmore: What’s happening with Healthy Ride?, April2016.
In the following slides we present some results from them.
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Year / Winterby Todd W. Schneider
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Working days / Weekendby Todd W. Schneider
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Subscribers / Custumersby Jackson Whitmore
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Bike sharing data and networks
The bike sharing data can be viewed as a spatial and temporalnetwork:Nodes – stations: name, location, capacity, (state)Links – trips: from, to, start time, finish time, bike’s id, rider type,gender, age
From this basic network we can construct several derived(aggregated) networks.
In most systems the data about nodes are static – fixed for longerperiod of time. It could be possible to collect these data using feeds.
Selecting an appropriate granulation (5 min, 15 min, 1 hour, part of aday, day, week, month, quartal, year) and some restrictions (ridertype, gender, age, . . . ) we get the corresponding frequencydistributions in nodes and on links.
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Symbolic networks
Assigning distributions to nodes and links we get a symbolic network.
There are different distributions on links:departures: (# of trips starting in selected time interval),activity: (# of trips active in selected time interval),duration: (# of trips with duration in selected time interval), etc.
and in nodes, for example:
departures: the sum of link distributions for incident links,imbalance, etc.
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Our analysis
NY Citi Bike one year data from October 2015 to September2016.13 266 296 trips, 678 stations.
The Citi Bike system had an expansion in August 2015.
We constructed a departures network with daily distributions withhalf hour granulation.
First we looked for extreme elements (links or nodes).
In a selected time interval:flow(u, v) = # of trips starting in a node u and finishing in a node vout(v) = # of trips starting in a node vin(v) = # of trips finishing in a node vflow(u, v ; k) =# of trips starting in a node u in thek-th half hour and finishing in a node v. . .
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
The most active stations / Top 3activity(v) = out(v) + in(v)
n station trips n station trips1 W 41 St & 8 Ave 281996 6 W 45 St & 8 Ave 1705932 Nassau Ave & Russell St 203855 7 W 38 St & 8 Ave 1643783 W 20 St & 8 Ave 200629 8 E 14 St & Avenue B 1639624 W 16 St & The High Line 196414 9 E 53 St & Madison Ave 1628285 W 22 St & 8 Ave 188394 10 W 53 St & 10 Ave 161931
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Imbalance
In a selected time interval:
diff (v) = out(v)− in(v)
fDist(v) =48∑k=1
|out(v ; k)− in(v ; k)|
n station out in diff station fDist1 5 Ave & E 73 St 60524 34559 25965 5 Ave & E 73 St 847032 Van Vorst Park 29962 14920 15042 Fulton St & William St 664533 8 Ave & W 33 St 57127 67592 -10465 E 75 St & 3 Ave 512974 W Broadway & Spring St 15217 23544 -8327 W 22 St & 8 Ave 505305 E 51 St & 1 Ave 72651 80783 -8132 E 33 St & 2 Ave 478936 E 75 St & 3 Ave 56302 48891 7411 Water - Whitehall Plaza 455547 Catherine St & Monroe St 36858 29455 7403 E 51 St & 1 Ave 340868 E 45 St & 3 Ave 48116 41601 6515 W 37 St & 10 Ave 338659 Water - Whitehall Plaza 71364 65638 5726 Cambridge Pl & Gates Ave 32562
10 6 Ave & Canal St 23473 28451 -4978 E 16 St & Irving Pl 30293
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Imbalance / diffTop 4
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Imbalance / fDistTop 4
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
The largest flows / Top 6
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Clamix – Clustering modal valued symbolic data
Two clustering methods for symbolic objects are implemented: theadapted leaders method and the adapted agglomerative hierarchicalclustering Ward’s method.Clamix: R-forge, docPaper: V. Batagelj, N. Kejzar, and S. Korenjak-Cerne. Clustering ofModal Valued Symbolic Data. ArXiv e-prints, 1507.06683, July 2015.
We clustered the set of 589 links with flow at least 1250. This givesus typical flow distribution shapes.
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Clustering of flows
139
332
225
398
441
542 9
359
360
429 19 241
242
274
141
364
411
367
473
169
515
413
102
403
494
199
108
200
284
401
340
404 31 488
162
314
559
405
481
232
436
393
123
530 98 194
265
427
244
282 67 292
351 50 257
201
267
491
136
362
236
423 46 79 41 350
430
326
388
434
248
198
456
370
170 89 320
371
124
502
106
212
247 75 356
323
471
496 35 173
550 15 472 80 407
560
567
135
193 94 3
554
122
400
540
561
101
191
437
519 87 70 227
469 13 119
524
144
563
190
250
295
258
297
448
349
475
521
528 21 355
568 18 74 164
166
217 22 310
151
387 83 223
311
539
309 63 68 444 65 187
459 11 32 51 342
543
262
396
431
208
551 53 495
569
167
251
278
368
579 85 91 271 49 432
512
243
425 7
224
529
556 97 120
177
134
339 82 380
465
100
588
229
290
202
313
535
301
336
344
454
218
116
318
449
182
445
374
234
584
369
458
145 5
109
112
115
140
238
485
152
321
264
288
466
128
544
147
205
575
457
583
246
377
395
378
338
467
249
357
585
308
317
373
118
293 93 383
451 33 237
341
490
518
525
110
438
322
280
470 52 196 12 510 36 149
345
157
555
440
331
104
558
230
523
304
330
478
133
197 95 419
325
397
574
489
381
578
266
277
298
233
384 88 113
228
307
468
156
353
281
211
422 66 86 37 114 77 175
174
447
363
517
337
188
386
420
178
526
186
443
450 61 121 2 38 361
327
580
146 73 209 8
299
483
195
476
276
463
538
103
548
148
300
352
587 14 125
207 71 99 150
245 45 55 172
498
180
319 4
163
287
253
433
206
117
343
324
184
486
487
107
536
333
417
534
442
226
553 39 131 78 171
239
385
492
508
275
537 81 460
546
105
155
408
566 26 69 572 84 455
142
263
252 92 504
506
259
421
499
286 43 62 533 44 439
497 20 165
516
562
428 90 54 493
545
283 48 291
220
255
358
294
375
391
273
160
289
181
315
573 64 346 25 269
382
279
389
303 42 130
509
219
513 28 185 60 204
222
414
305
347
254
474 96 34 216
328
256
394
520
153
424 76 402
435
399
214
503
129
452 59 231
532
464
392
365
192
541
306 56 261
215
461
240
418 17 511 57 210
390
221
426
446 10 354 6
462
564 1
272
127 72 549
285
410
484
412
565
213
577
582 58 23 29 176
348
203
547
154
302
137
158
334
552 30 270
586
189
416
183
501 24 168
480 40 372
522
366
379
570
479
329
409
531
268 47 557
260
482
126
316
576
132
235
406
500
505
159
376
296
514 16 138
179
161
453
111
143
507
477
581
571
335
415 27 312
527
589
0e+
002e
−04
4e−
046e
−04
Cluster DendrogramH
eigh
t
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Clustering of flows / 7 clusters
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Conclusions
• bike sharing data are an interesting type of data,
• prepare some extended data sets; get or collect the dynamicstations data,
• additional analyses:
• other symbolic objects: nodes (in and out distribution),links (subscriber, custumer distribution), . . .
• stability of distribution shape through time• . . .
• compare bike sharing systems
• Taxi (Yellow and Green) and Uber data are available for NewYork.
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
References I
1 V. Batagelj, N. Kejzar, and S. Korenjak-Cerne. Clustering of ModalValued Symbolic Data. ArXiv e-prints, 1507.06683, July 2015.
2 Gleb Beliakov, Ana Pradera, Tomasa Calvo: Aggregation Functions:A Guide for Practitioners (Studies in Fuzziness and Soft Computing).Springer, Berlin, Heidelberg 2007.
3 Lynne Billard; Edwin Diday (14 May 2012). Symbolic Data Analysis:Conceptual Statistics and Data Mining. John Wiley & Sons.
4 Bay Area Bike Share: San Francisco Bay Area - Kaggle challenge,Open data, challenge
5 Todd W. Schneider: A Tale of Twenty-Two Million Citi Bike Rides:Analyzing the NYC Bike Share System.
6 Jackson Whitmore: What’s happening with Healthy Ride?, April2016.
Symbolic networks
Symbolicnetworks
Big data andaggregation
Kaggle
Data sets
Analyses
Conclusions
References
Acknowledgments
This work was supported in part by the Slovenian Research Agency(research programs P1-0294 and research projects J7-8279 andJ1-6720).
The author’s attendance of the meeting was supported by the COSTAction IC1408 – CRoNoS.
Symbolic networks