Symbolic networks and big data - bike sharing datavlado.fmf.uni-lj.si/Pub/slides/SymNets17.pdf · Big data and aggregation Kaggle Data sets Analyses Conclusions References Some Open

Symbolicnetworks

Big data andaggregation

Kaggle

Data sets

Analyses

Conclusions

References

Symbolic networks and big databike sharing data

CRoNoS COST Action Working Groups MeetingInstitute Tinbergen, Amterdam, Netherlands

1-2. September 2017

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Outline

1 Big data and aggregation2 Kaggle3 Data sets4 Analyses5 Conclusions6 References

Vladimir Batagelj: [email protected]

Last version of slides (September 2, 2017, 02 : 02): SymNet.pdf

Symbolic networks

mailto:[email protected]

http://vladowiki.fmf.uni-lj.si/lib/exe/fetch.php?media=pub:pdf:symnet.pdf

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Divide and conquer

A “standard” approach to deal with large structures is thedivide and conquer strategy that (recursively) breaks down alarge structure to smaller, manageable sub-structures of thesame or related type.

Let U be a set of units, u ∈ U a unit, andC = {C1,C2, . . . ,Ck}, ∅ ⊂ Ci ⊆ U a partition of U – it holds:⋃

i Ci = U and i 6= j ⇒ Ci ∩ Cj = ∅.In classic data analysis the units are usually described as lists of(measured) values of selected variables ( properties, attributes)

X (u) = [x1(u), x2(u), . . . , xm(u)]

collected into a data frame X .

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Aggregation

The aggregation of a cluster C is again a list of values

Y (C ) = [y1(C ), y2(C ), . . . , ym(C )]

where yiC ) is the aggregated value of the set of values{xi (u) : u ∈ C} – forming an aggregated data frame Y .For example, for xi measured in a numerical scale their averageis usually used

yi (C ) =1

|C |∑u∈C

xi (u)

Different aggregation functions are available – see Beliakov,Pradera, Calvo: Aggregation Functions, 2007.

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Notes

1 This kind of aggregation can produce a big loss ofinformation.

2 It is not compatible with the recursive decomposition; butkeeping (|C |, yi (C )) is – for C1 ∩ C2 = ∅ we have

yi (C1 ∪ C2) =|C1|yi (C1) + |C2|yi (C2)

|C1|+ |C2|

3 The aggregated descriptions need not to be from the same“space” as the descriptions of units.

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Symbolic data analysis

Edwin Diday proposed in late eighties an approach, namedsymbolic data analysis (SDA), in which the (aggregated)descriptions can be symbolic objects (SO) such as: an interval,a list, a histogram, a distribution, etc. We get symbolic dataframes. See Billard, Diday: Symbolic Data Analysis, 2012.

Using this approach we can reduce a big data frame into small,manageable symbolic data frame and preserve much moreinformation. To analyze symbolic data frames new methodshave to be developed – SDA.

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Discrete distributions

We found very interesting the representation with discretedistribution.

The range V of a variable xi is partitioned into ki subsets {Vj}.Then yi (C ) = [yi1(C ), yi2(C ), . . . , yiki (C )] whereyij(C ) = |{u : xi (u) ∈ Vj}|.

The description based on a discrete distribution enables us toconsider variables that are measured in different types ofmeasurement scales and based on a different number of original(individual) units. It is also compatible with recursivedecomposition.

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Kaggle

Some time ago I found on Kaggle

https://www.kaggle.com/benhamner/sf-bay-area-bike-share

a contest dealing with an analysis of data on bike sharingsystem in the San Francisco Bay Area. After some searching itturned out that similar data sets are available for several citiesaround the world (mainly in US).

Symbolic networks


Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Some Open data sets on Bike Sharing Systemson my disk

Bike sharing City data available # of tripsCapital Washington, D.C. 2010/10-2016/09 14691090Hubway Boston 2011/07-2016/06 3930659Divvy Chicago 2013/01-2016/06 7867601Citi Bike New York 2013/07-2016/09 33319019BABS San Francisco 2013/08-2016/08 983648Healthy Ride Pittsburgh 2015/07-2016/09 118422Indego Philadelphia 2015/04-2016/09 673703NiceRide Minnesota 2010/06-2015/12 1808452Santander C. London 2015/01-2016/11 19212558

Symbolic networks

https://www.capitalbikeshare.com/trip-history-data

https://s3.amazonaws.com/hubway-data/index.html

https://www.divvybikes.com/system-data

https://www.citibikenyc.com/system-data

http://www.bayareabikeshare.com/open-data

https://healthyridepgh.com/data/

https://www.rideindego.com/about/data/

https://www.niceridemn.org/data/

http://cycling.data.tfl.gov.uk/

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Data about stations

The Stations file is a snapshot of station locations and capacitiesduring the reporting time interval:

• Station ID

• Station name

• Lat/Long coordinates

• Number of individual docking points at each station

In some cases also the data about station elevantions are available.

North American Bike Share Association’s open data standard – gbfsGeneral Bikeshare Feed Specification; Systems using gbfs.

Most of the systems provide a feed service returning a JSON file withcurrent status of stations.Divvy, Indego, CitiBike stations: info, status

Symbolic networks

https://github.com/NABSA/gbfs

https://github.com/NABSA/gbfs/blob/master/systems.csv

https://feeds.divvybikes.com/stations/stations.json

http://www.rideindego.com/stations/json/

https://gbfs.citibikenyc.com/gbfs/en/station_information.json

https://gbfs.citibikenyc.com/gbfs/en/station_status.json

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Reading station status in R

wdir <- "C:/Users/batagelj/data/bikes/philly"

setwd(wdir)

stat <- "https://gbfs.bcycle.com/bcycle_indego/station_status.json"

num <- 0

setInternet2(use = TRUE)

p1 <- proc.time()

while (num < 5){

num <- num+1

fsave <- paste(’status_’,as.character(num),’.json’,sep=’’)

test <- tryCatch(download.file(stat,fsave,method="auto"),

error=function(e) e)

Sys.sleep(60)

p2 <- proc.time()

cat(p2 - p1,’\n’); flush.console()

p1 <- p2

}

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Data about trips

Each trip is anonymized and includes:

• Bike number

• Trip start day and time

• Trip end day and time

• Trip start station

• Trip end station

• Rider type

In some cases additional data are available: Gender, Year of birth.

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Additional data sources

WeatherFor cities in US we can get the weather data at NOAA, QualityControlled Local Climatological DataPrecipitations, wind, temperature, humidity, pressure.

MapsThe ESRI shape files descriptions of maps can be found using Google.Boston, Bay Area Cities, New York, Pittsburgh

Large temporal and spatial network data.There were some contests for analysing of bike sharing data. Someinteresting observations were presented. Also some blogs and paperswere written on this topic.In December 2016 there were 100 hits in WoS to the query "bike

sharing system*".

Symbolic networks

https://www.ncdc.noaa.gov/qclcd/QCLCD

https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/quality-controlled-local-climatological-data-qclcd

https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/quality-controlled-local-climatological-data-qclcd

https://data.cityofboston.gov/City-Services/Boston-Neighborhood-Shapefiles/af56-j7tb/data

https://data.sfgov.org/Government/Bay-Area-Cities-Zipped-Shapefile-Format-/nghj-u9xk/data

http://www1.nyc.gov/site/planning/data-maps/open-data/districts-download-metadata.page

http://pittsburghpa.gov/dcp/gis/

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Analyses

Different overall distributions:Pitts; Bay; Boston; NYC BSS

Impact of weather: temperature (day/night, winter), precipitations.

Cycles: year (temperature), week (working days/weekend), day(hours, parts of the day): week; days in a week

Other factors: subscriber/customer, trip duration, gendre, rider’s age,speed, elevation: age

The moves of bikes among stations by the system can be recognizedas those rides where the bike’s next trip started at a different stationfrom where the previous trip dropped off.Arrivals/departures; Boston; Changes

Prediction: SF Bay Area: count prediction

Symbolic networks

http://suds-cmu.org/2016/04/21/whats-happening-with-healthy-ride/

http://thfield.github.io/babs/

http://hubwaydatachallenge.org/submission/22/

http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/

http://ariofsevit.com/hubway/

http://lejnine.com/bike_analysis.html



http://zsobhani.github.io/hubway-team-viz/


https://www.kaggle.com/jlinsche/d/benhamner/sf-bay-area-bike-share/count-prediction/discussion

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Analyses

We find especially interesting a blog by

Todd W. Schneider: A Tale of Twenty-Two Million Citi Bike Rides:Analyzing the NYC Bike Share System

and

Jackson Whitmore: What’s happening with Healthy Ride?, April2016.

In the following slides we present some results from them.

Symbolic networks




Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Year / Winterby Todd W. Schneider

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Working days / Weekendby Todd W. Schneider

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Subscribers / Custumersby Jackson Whitmore

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Bike sharing data and networks

The bike sharing data can be viewed as a spatial and temporalnetwork:Nodes – stations: name, location, capacity, (state)Links – trips: from, to, start time, finish time, bike’s id, rider type,gender, age

From this basic network we can construct several derived(aggregated) networks.

In most systems the data about nodes are static – fixed for longerperiod of time. It could be possible to collect these data using feeds.

Selecting an appropriate granulation (5 min, 15 min, 1 hour, part of aday, day, week, month, quartal, year) and some restrictions (ridertype, gender, age, . . . ) we get the corresponding frequencydistributions in nodes and on links.

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Symbolic networks

Assigning distributions to nodes and links we get a symbolic network.

There are different distributions on links:departures: (# of trips starting in selected time interval),activity: (# of trips active in selected time interval),duration: (# of trips with duration in selected time interval), etc.

and in nodes, for example:

departures: the sum of link distributions for incident links,imbalance, etc.

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Our analysis

NY Citi Bike one year data from October 2015 to September2016.13 266 296 trips, 678 stations.

The Citi Bike system had an expansion in August 2015.

We constructed a departures network with daily distributions withhalf hour granulation.

First we looked for extreme elements (links or nodes).

In a selected time interval:flow(u, v) = # of trips starting in a node u and finishing in a node vout(v) = # of trips starting in a node vin(v) = # of trips finishing in a node vflow(u, v ; k) =# of trips starting in a node u in thek-th half hour and finishing in a node v. . .

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

The most active stations / Top 3activity(v) = out(v) + in(v)

n station trips n station trips1 W 41 St & 8 Ave 281996 6 W 45 St & 8 Ave 1705932 Nassau Ave & Russell St 203855 7 W 38 St & 8 Ave 1643783 W 20 St & 8 Ave 200629 8 E 14 St & Avenue B 1639624 W 16 St & The High Line 196414 9 E 53 St & Madison Ave 1628285 W 22 St & 8 Ave 188394 10 W 53 St & 10 Ave 161931

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Imbalance

In a selected time interval:

diff (v) = out(v)− in(v)

fDist(v) =48∑k=1

|out(v ; k)− in(v ; k)|

n station out in diff station fDist1 5 Ave & E 73 St 60524 34559 25965 5 Ave & E 73 St 847032 Van Vorst Park 29962 14920 15042 Fulton St & William St 664533 8 Ave & W 33 St 57127 67592 -10465 E 75 St & 3 Ave 512974 W Broadway & Spring St 15217 23544 -8327 W 22 St & 8 Ave 505305 E 51 St & 1 Ave 72651 80783 -8132 E 33 St & 2 Ave 478936 E 75 St & 3 Ave 56302 48891 7411 Water - Whitehall Plaza 455547 Catherine St & Monroe St 36858 29455 7403 E 51 St & 1 Ave 340868 E 45 St & 3 Ave 48116 41601 6515 W 37 St & 10 Ave 338659 Water - Whitehall Plaza 71364 65638 5726 Cambridge Pl & Gates Ave 32562

10 6 Ave & Canal St 23473 28451 -4978 E 16 St & Irving Pl 30293

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Imbalance / diffTop 4

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Imbalance / fDistTop 4

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

The largest flows / Top 6

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Clamix – Clustering modal valued symbolic data

Two clustering methods for symbolic objects are implemented: theadapted leaders method and the adapted agglomerative hierarchicalclustering Ward’s method.Clamix: R-forge, docPaper: V. Batagelj, N. Kejzar, and S. Korenjak-Cerne. Clustering ofModal Valued Symbolic Data. ArXiv e-prints, 1507.06683, July 2015.

We clustered the set of 589 links with flow at least 1250. This givesus typical flow distribution shapes.

Symbolic networks

https://r-forge.r-project.org/R/?group_id=864

https://rdrr.io/rforge/clamix/man/clamix-package.html

https://arxiv.org/abs/1507.06683

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Clustering of flows

139

332

225

398

441

542 9

359

360

429 19 241

242

274

141

364

411

367

473

169

515

413

102

403

494

199

108

200

284

401

340

404 31 488

162

314

559

405

481

232

436

393

123

530 98 194

265

427

244

282 67 292

351 50 257

201

267

491

136

362

236

423 46 79 41 350

430

326

388

434

248

198

456

370

170 89 320

371

124

502

106

212

247 75 356

323

471

496 35 173

550 15 472 80 407

560

567

135

193 94 3

554

122

400

540

561

101

191

437

519 87 70 227

469 13 119

524

144

563

190

250

295

258

297

448

349

475

521

528 21 355

568 18 74 164

166

217 22 310

151

387 83 223

311

539

309 63 68 444 65 187

459 11 32 51 342

543

262

396

431

208

551 53 495

569

167

251

278

368

579 85 91 271 49 432

512

243

425 7

224

529

556 97 120

177

134

339 82 380

465

100

588

229

290

202

313

535

301

336

344

454

218

116

318

449

182

445

374

234

584

369

458

145 5

109

112

115

140

238

485

152

321

264

288

466

128

544

147

205

575

457

583

246

377

395

378

338

467

249

357

585

308

317

373

118

293 93 383

451 33 237

341

490

518

525

110

438

322

280

470 52 196 12 510 36 149

345

157

555

440

331

104

558

230

523

304

330

478

133

197 95 419

325

397

574

489

381

578

266

277

298

233

384 88 113

228

307

468

156

353

281

211

422 66 86 37 114 77 175

174

447

363

517

337

188

386

420

178

526

186

443

450 61 121 2 38 361

327

580

146 73 209 8

299

483

195

476

276

463

538

103

548

148

300

352

587 14 125

207 71 99 150

245 45 55 172

498

180

319 4

163

287

253

433

206

117

343

324

184

486

487

107

536

333

417

534

442

226

553 39 131 78 171

239

385

492

508

275

537 81 460

546

105

155

408

566 26 69 572 84 455

142

263

252 92 504

506

259

421

499

286 43 62 533 44 439

497 20 165

516

562

428 90 54 493

545

283 48 291

220

255

358

294

375

391

273

160

289

181

315

573 64 346 25 269

382

279

389

303 42 130

509

219

513 28 185 60 204

222

414

305

347

254

474 96 34 216

328

256

394

520

153

424 76 402

435

399

214

503

129

452 59 231

532

464

392

365

192

541

306 56 261

215

461

240

418 17 511 57 210

390

221

426

446 10 354 6

462

564 1

272

127 72 549

285

410

484

412

565

213

577

582 58 23 29 176

348

203

547

154

302

137

158

334

552 30 270

586

189

416

183

501 24 168

480 40 372

522

366

379

570

479

329

409

531

268 47 557

260

482

126

316

576

132

235

406

500

505

159

376

296

514 16 138

179

161

453

111

143

507

477

581

571

335

415 27 312

527

589

0e+

002e

−04

4e−

046e

−04

Cluster DendrogramH

eigh

t

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Clustering of flows / 7 clusters

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Conclusions

• bike sharing data are an interesting type of data,

• prepare some extended data sets; get or collect the dynamicstations data,

• additional analyses:

• other symbolic objects: nodes (in and out distribution),links (subscriber, custumer distribution), . . .

• stability of distribution shape through time• . . .

• compare bike sharing systems

• Taxi (Yellow and Green) and Uber data are available for NewYork.

Symbolic networks

Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

References I

1 V. Batagelj, N. Kejzar, and S. Korenjak-Cerne. Clustering of ModalValued Symbolic Data. ArXiv e-prints, 1507.06683, July 2015.

2 Gleb Beliakov, Ana Pradera, Tomasa Calvo: Aggregation Functions:A Guide for Practitioners (Studies in Fuzziness and Soft Computing).Springer, Berlin, Heidelberg 2007.

3 Lynne Billard; Edwin Diday (14 May 2012). Symbolic Data Analysis:Conceptual Statistics and Data Mining. John Wiley & Sons.

4 Bay Area Bike Share: San Francisco Bay Area - Kaggle challenge,Open data, challenge

5 Todd W. Schneider: A Tale of Twenty-Two Million Citi Bike Rides:Analyzing the NYC Bike Share System.

6 Jackson Whitmore: What’s happening with Healthy Ride?, April2016.

Symbolic networks

https://arxiv.org/abs/1507.06683


http://www.bayareabikeshare.com/open-data

https://www.kaggle.com/c/bike-sharing-demand




Symbolicnetworks


Kaggle

Data sets

Analyses

Conclusions

References

Acknowledgments

This work was supported in part by the Slovenian Research Agency(research programs P1-0294 and research projects J7-8279 andJ1-6720).

The author’s attendance of the meeting was supported by the COSTAction IC1408 – CRoNoS.

Symbolic networks

Documents

Symbolic networks and big data - bike sharing datavlado.fmf.uni-lj.si/Pub/slides/SymNets17.pdf · Big data and aggregation Kaggle Data sets Analyses Conclusions References Some Open