Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
http://poloclub.gatech.edu/cse6242CSE6242 / CX4242: Data & Visual Analytics
Data CollectionDuen Horng (Polo) Chau Assistant ProfessorAssociate Director, MS AnalyticsGeorgia Tech
Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Data you can just downloadYahoo WebScope datasetsNYC Taxi data: Trip (11GB), Fare (7.7GB)StackOverflow (xml)Atlanta crime data (csv)Soccer statistics…
3
More on course website: http://poloclub.gatech.edu/cse6242/2016fall/#datasets
Data you can just downloadIf you have leads, let us know on Piazza!
4
More datasets on course website: http://poloclub.gatech.edu/cse6242/2016fall/#datasets
5
http://yahoolabs.tumblr.com/post/137281912191/yahoo-releases-the-largest-ever-machine-learning?soc_src=mail&soc_trk=ma
Collect Data via APIsTwitter (small subset)https://dev.twitter.com/streaming/overview
Last.fm (Pandora has unofficial API)
FlickrFacebook (your friends only)
CrunchBase (database about companies)
Rotten Tomatoes not free anymore :-(
iTunes7
How to Scrape? Google Play example
Goal: build network of similar apps
10
https://play.google.com/store/apps/details?id=com.shazam.android&hl=en
https://play.google.com/store/apps/details?
id=com.spotify.music&hl=en
Most popular embedded database in the world iPhone (iOS), Android, Chrome (browsers), Mac, etc.
Self-contained: one file contains data + schemaServerless: database right on your computerZero-configuration: no need to set up!
http://www.sqlite.org/different.html 14http://www.sqlite.org
SQL Refresher: create table>sqlite3 database.db
sqlite> create table student(ssn integer, name text);
sqlite> .schema
CREATE TABLE student(ssn integer, name text);
ssn name
15
SQL Refresher: insert rowsinsert into student values(111, "Smith");
insert into student values(222, "Johnson");
insert into student values(333, "Obama");
select * from student;
ssn name111 Smith222 Johnson333 Obama
16
SQL Refresher: create another tablecreate table takes (ssn integer, course_id integer, grade integer);
sqlite>.schema
CREATE TABLE student(ssn integer, name text);
CREATE TABLE takes (ssn integer, course_id integer, grade integer);
ssn course_id grade
17
SQL Refresher: joining 2 tables
More than one tables - joinsE.g., create roster for this course (6242)
ssn course_id grade111 6242 100222 6242 90222 4000 80
ssn name111 Smith222 Johnson333 Obama
18
SQL Refresher: joining 2 tables + filtering
select name from student, takes where student.ssn = takes.ssn and takes.course_id = 6242;
ssn course_id grade111 6242 100222 6242 90222 4000 80
ssn name111 Smith222 Johnson333 Obama
19
SQL General Formselect a1, a2, ... an from t1, t2, ... tm where predicate [order by ....] [group by ...] [having ...]
20
Find ssn and GPA for each studentselect ssn, avg(grade) from takes group by ssn having avg(grade) > 90;
ssn course_id grade111 6242 100222 6242 90222 4000 80
ssn avg(grade)111 100222 85
21