Upload
mwe400
View
46
Download
0
Embed Size (px)
DESCRIPTION
The role of big data in education for the social sciences.
Citation preview
BACK TO BASICS: BIG DATA AND EDUCATION IN THE SOCIAL
SCIENCES Matthew S. WeberRutgers University
AEJMC 2014Montreal, Canada
2
5
Breaking down the walls of big data?
6
http://archivehub.rutgers.edu
EXAMPLE: Undergraduates
Learning About Your Network• By being aware of your connections, you can take an active role
in managing your connections
– Be aware of the connections that you have, and what they contribute to your “network”
– Seek out networking opportunities
– Forge connections with people you admire and respect
LinkedIn Network Maps
Assignment Prompt
Prompt: Use www.touchgraph.com/facebook to generate a map of your Facebook network. Spend some time exploring your different connections, and then respond to the following:• What different types of clusters do you see? Be specific in identifying
at least 2 – 3 different clusters.• Is there someone in your network you forgot about? Who? Why?• Identify 2 people who you feel are the most useful connections in
your network based on where they are positioned. Who are they and why are they useful?
12
EXAMPLE: PhD
SET DEFAULT_PARALLEL 30;titles = LOAD '/home/hai/Projects/HistoryCrawl/Data/IA/2_26_2014/nsf1.wat.gz' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.Payload-Metadata.HTTP-Response-Metadata.HTML-Metadata','Envelope.WARC-Header-Metadata.WARC-Target-URI','Envelope.WARC-Header-Metadata.WARC-Date','Envelope.WARC-Header-Metadata.Content-Type','Envelope.WARC-Header-Metadata.Content-Length') AS (links:chararray,target:chararray,date:chararray,contenttype:chararray,contentlength:chararray);
nonnulls = filter titles by links is not null;paths = foreach nonnulls generate org.sci.historycrawl.parser($0,$1,$2),$2,$3,$4;i6 = foreach paths generate bagwati.url,$1,$2,$3; i7 = foreach i6 generate flatten($0) as words,org.sci.historycrawl.formatdate(SUBSTRING($1,0,10)),$2,$3;
i8 = foreach i7 generate org.sci.historycrawl.getsourceURL($0),org.sci.historycrawl.getdstURL($0),org.sci.historycrawl.getText($0),$1,$2,(long)$3;
i9 = group i8 by ($0,$1,$3);i10 = foreach i9 generate FLATTEN(group),FLATTEN(TOP(1,0,i8.$2)),COUNT(i8),FLATTEN(TOP(1,0,i8.$4)),SUM(i8.$5);
i11 = filter i10 by $0 is not null;i12 = filter i11 by $1 is not null;store i12 INTO '/home/hai/Projects/HistoryCrawl/Data/IA/2_26_2014/HC_Output' using PigStorage();
SET DEFAULT_PARALLEL 30;titles = LOAD '/home/hai/Projects/HistoryCrawl/Data/IA/2_26_2014/nsf1.wat.gz' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.Payload-Metadata.HTTP-Response-Metadata.HTML-Metadata','Envelope.WARC-Header-Metadata.WARC-Target-URI','Envelope.WARC-Header-Metadata.WARC-Date','Envelope.WARC-Header-Metadata.Content-Type','Envelope.WARC-Header-Metadata.Content-Length') AS (links:chararray,target:chararray,date:chararray,contenttype:chararray,contentlength:chararray);
nonnulls = filter titles by links is not null;paths = foreach nonnulls generate org.sci.historycrawl.parser($0,$1,$2),$2,$3,$4;i6 = foreach paths generate bagwati.url,$1,$2,$3; i7 = foreach i6 generate flatten($0) as words,org.sci.historycrawl.formatdate(SUBSTRING($1,0,10)),$2,$3;
i8 = foreach i7 generate org.sci.historycrawl.getsourceURL($0),org.sci.historycrawl.getdstURL($0),org.sci.historycrawl.getText($0),$1,$2,(long)$3;
i9 = group i8 by ($0,$1,$3);i10 = foreach i9 generate FLATTEN(group),FLATTEN(TOP(1,0,i8.$2)),COUNT(i8),FLATTEN(TOP(1,0,i8.$4)),SUM(i8.$5);
i11 = filter i10 by $0 is not null;i12 = filter i11 by $1 is not null;store i12 INTO '/home/hai/Projects/HistoryCrawl/Data/IA/2_26_2014/HC_Output' using PigStorage();
SET DEFAULT_PARALLEL 30;titles = LOAD '/home/hai/Projects/HistoryCrawl/Data/IA/2_26_2014/nsf1.wat.gz' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.Payload-Metadata.HTTP-Response-Metadata.HTML-Metadata','Envelope.WARC-Header-Metadata.WARC-Target-URI','Envelope.WARC-Header-Metadata.WARC-Date','Envelope.WARC-Header-Metadata.Content-Type','Envelope.WARC-Header-Metadata.Content-Length') AS (links:chararray,target:chararray,date:chararray,contenttype:chararray,contentlength:chararray);
nonnulls = filter titles by links is not null;paths = foreach nonnulls generate org.sci.historycrawl.parser($0,$1,$2),$2,$3,$4;i6 = foreach paths generate bagwati.url,$1,$2,$3; i7 = foreach i6 generate flatten($0) as words,org.sci.historycrawl.formatdate(SUBSTRING($1,0,10)),$2,$3;
i8 = foreach i7 generate org.sci.historycrawl.getsourceURL($0),org.sci.historycrawl.getdstURL($0),org.sci.historycrawl.getText($0),$1,$2,(long)$3;
i9 = group i8 by ($0,$1,$3);i10 = foreach i9 generate FLATTEN(group),FLATTEN(TOP(1,0,i8.$2)),COUNT(i8),FLATTEN(TOP(1,0,i8.$4)),SUM(i8.$5);
i11 = filter i10 by $0 is not null;i12 = filter i11 by $1 is not null;store i12 INTO '/home/hai/Projects/HistoryCrawl/Data/IA/2_26_2014/HC_Output' using PigStorage();
18
Source | Destination | Date | Frequency | Content Type | Bytes | Descriptive Text
Link Data:
http://gawker.com/5953665/mitt-romneys-staff-played-the-media-covering-them-in-a-friendly-game-of-flag-football
Mitt Romney's Staff Played the Media Covering Them in a Friendly Game of Flag
http://gawker.com
2012-10-22
19
Dataset Research Potential Dates Captures Unique URLs
Hurricane KatrinaOnline networks and organizational resilience (Chewning, Lai and Doerfel, 2012; Perry, Taylor and Doerfel, 2003) in the wake of disasters; information dissemination
2003 – 2012 1,694,236 663,740
Superstorm Sandy 2003 – 2012 41,703,112 20,013,455
US SenateStudy the growth of political activity in online environments (Adamic & Glance, 2005; Bruns, 2007; Chang & Park, 2012); polarization & media discourse
109th – 112th Congresses
26,965,770 8,674,397
US House 51,840,777 12,410,014
Occupy Wall Street
Previous research on NGOs in the online environment (Bach & Stark, 2004; Shumate, 2003, 2012; Shumate, Fulk, & Monge, 2005); use of hyperlink data to study the formation and role of alliances between SMOs
2010 – 2012 247,928,272 11,3259,655
US MediaPrevious studies of news media organizations (Greer & Mensing, 2006; Weber, 2012; Weber & Monge, In Press); focus on evolutionary patterns
2008 – 2012 1,315,132,555 539,184,823
• Email me! [email protected]• ArchiveHub: http://archivehub.rutgers.edu
• The Team– Kris Carpenter, Vinay Goel, Internet Archive – David Lazer, Katherine Ognyanova, Northeastern University – Allie Kosterich, Hai Nguyen, Luan Nguyen, Marya Doerfel, Rutgers University– Peter Monge, Ayushman Datta, Kristen Guth, USC
20
Research supported by NSF Award #1244727 and the NetSCI Lab @ Rutgers