61
Archiving the Mobile Web Frank McCown, Monica Yarbrough, & Keith Enlow Computer Science Dept Harding University WADL 2013 Indianapolis, IN July 25, 2013

Archiving the Mobile Web

Embed Size (px)

DESCRIPTION

Presented at WADL 2013 in Indianapolis, Indiana.

Citation preview

Page 1: Archiving the Mobile Web

Archiving the Mobile Web

Frank McCown, Monica Yarbrough, & Keith Enlow

Computer Science DeptHarding University

WADL 2013Indianapolis, IN

July 25, 2013

Page 2: Archiving the Mobile Web

Mobile vs. Stationary Web

Page 3: Archiving the Mobile Web

Mobile Web-Related Markup Languages

http://en.wikipedia.org/wiki/File:Mobile_Web_Standards_Evolution_Vector.svg

Smartphone era

Page 4: Archiving the Mobile Web

Two Types of Mobile Web

Feature Phone Web Smartphone Web

cHTML (iMode), WML, WAP, etc.

XHTML, HTML5, etc.

Page 5: Archiving the Mobile Web
Page 6: Archiving the Mobile Web

Serving Up Mobile Sites

1. Responsive web design• Same HTML content to desktop and mobile

• CSS media queries alter appearance

<!-- CSS media query on a link element --><link rel="stylesheet" media="(max-width: 800px)" href="example.css" />

<!-- CSS media query within a style sheet --><style>@media (max-width: 600px) {.sidebar { display: none; }

}</style>

Page 7: Archiving the Mobile Web

Example of Responsive Web Design

Page 8: Archiving the Mobile Web

Serving Up Mobile Sites

1. Responsive web design• Same HTML content to desktop and mobile

• CSS media queries alter appearance

2. Redirect mobile user agent to mobile site• Client-side redirection

• Server-side redirection

Page 9: Archiving the Mobile Web

Client-Side Redirection

• JavaScript detects mobile user agent

// From www.harding.eduvar ua = navigator.userAgent.toLowerCase(); if (queryString.match('version=mobile') ||

ua.match(/IEMobile|Windows CE|NetFront|PlayStation|like Mac OS Z|MIDP|UP\.Browser|Symbian|

Nintendo|BlackBerry|mobile/i)) {

if (!ua.match('ipad')) {if (window.location.pathname.match('.html'))

window.location = window.location.pathname.replace('.html', '.m.html');else

window.location = window.location.pathname + 'index.m.html'; }

}

Page 10: Archiving the Mobile Web

Client-Side Redirection

Page 11: Archiving the Mobile Web

Server-Side Redirection

• Server routes mobile user agent to different page

Apache Example:RewriteEngine OnRewriteBase /

RewriteCond %{HTTP_USER_AGENT} (android|bb\d+|meego).+mobile|avantgo|badda\/|blackberry|blazer|etc…|zte\-) [NC]RewriteRule ^$ http://detectmobilebrowser.com/mobile [R,L]

https://developers.google.com/webmasters/smartphone-sites/details

Page 12: Archiving the Mobile Web

Server-Side Redirection

Page 13: Archiving the Mobile Web

Serving Up Mobile Sites

1. Responsive web design• Same HTML content to desktop and mobile

• CSS media queries alter appearance

2. Redirect mobile user agent to mobile site• Client-side redirection

• Server-side redirection

3. User-agent content negotiation• Dynamically serving different HTML for the same URL

Page 14: Archiving the Mobile Web

User-Agent Content Negotiation

• Server serves up different content for same URL

• Use Vary: User-Agent header in response

• Best method for serving content quickly

Page 15: Archiving the Mobile Web

Archiving Mobile Sites

1. Responsive web design• Easy: Crawl like normal

• Use client tools to view page formatted for mobile

2. Redirect mobile user agent to mobile site• Need to crawl with mobile user agent

• Need JavaScript-enabled crawler to handle client-side redirection

3. User-agent content negotiation• Need to crawl with mobile user agent

• Need to distinguish mobile vs. desktop for same URL

Page 16: Archiving the Mobile Web

How are we doing archiving mobile sites so

far?

Page 17: Archiving the Mobile Web
Page 18: Archiving the Mobile Web

Earliest archived

page

Page 19: Archiving the Mobile Web

Earliest 2007 archived page: WML

Page 20: Archiving the Mobile Web

Finally some news!

Page 21: Archiving the Mobile Web

Really???

Page 22: Archiving the Mobile Web

Great…

Page 23: Archiving the Mobile Web

Only desktop version is archived!

Page 24: Archiving the Mobile Web

Mobile FinderBy Monica Yarbrough

Page 25: Archiving the Mobile Web

Google’s Suggestions for SEO

• Vary HTTP Header

• Annotations within the HTML:• On desktop page:

• <link rel=“alternate” media=“only screen and (max-width: 640px)” href=“http://m.example.com/page-1” >

• On mobile page:• <link rel=“canonical” href=“http://www.example.com/page-1”

>

• Media queries

https://developers.google.com/webmasters/smartphone-sites/

Page 26: Archiving the Mobile Web

How Mobile Finder Works

• Use both desktop and mobile useragents

• Look for:• Redirect

• Different content

• Different stylesheets

• Media queries

Page 27: Archiving the Mobile Web

How Mobile Finder Works

• Change the url to fit common mobile url patterns

ex: www.t-mobile.com m.t-mobile.com

Page 28: Archiving the Mobile Web

PhantomJs

• Headless WebKit (browser)

• Well-known and widely used

• Used to get the content of a page

• Takes snapshots of the sites it visits

• Scriptable with coffeescript or javascript

Page 29: Archiving the Mobile Web

Web Service

• Query string with 2 parameters• url (required)

• useragent (optional)

• http://cs.harding.edu/mobilefinder/service.php?url=URL&useragent=USER_AGENT

• Default useragent = Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; mediaqueries/1.0; +http://cs.harding.edu)

Page 30: Archiving the Mobile Web

Results

<MobileFinder>

<url>http://www.cnn.com/</url>

<mobileUrl>http://www.cnn.com/</mobileUrl>

<reason>

<code>400</code>

<message>differing content</message>

</reason>

<useragent> Mozilla/5.0 (Android; Linux armv7l; rv:9.0) Gecko/20111216 Firefox/9.0 Fennec/9.0</useragent>

<timeAccessed>2013-07-20 15:23:42</timeAccessed>

<error/>

<MobileFinder/>

Page 31: Archiving the Mobile Web

Limitations

• Crashing

• Inconsistent results

• Problems executing javascript redirection

• Falsely fails when it actually gets the content

• Fails to get url of page accessed

• Slow

Page 32: Archiving the Mobile Web

Limitations

• Client-side Redirects

www.golferen.no/wip4/ (right)

www.ng.kz/ (below)

Page 33: Archiving the Mobile Web

Analysis Results

• Accuracy (of 100 random hand-checked results)• 96 % accurate overall

• 1 % inaccurately record not found when there is in fact a mobile version

• 3 % inaccurately say mobile found when there is not a mobile version

Page 34: Archiving the Mobile Web

Nytimes desktop vs mobile

Page 35: Archiving the Mobile Web

Rakuten.co.jpdesktop vs mobile

Page 36: Archiving the Mobile Web

Are Google’s Suggestions Used?

• 28 % found a mobile version following Google’s suggestions

• 85 % found as having some sort of mobile version

Page 37: Archiving the Mobile Web

Are Google’s Suggestions Used?

• 28 % found following Google’s suggestions

• Of the 82% that were found as not following the rules:

• 93% missing vary HTTP header

• 89% missing alternate and canonical links

Page 38: Archiving the Mobile Web

Are Google’s Suggestions Used?

• 28 % found following Google’s suggestions

• 85 % found as having some sort of mobile version• Redirect: 35%

• “Significantly” different content: 28%

• Stylesheets alone: 9%

• Stylesheets and media queries: 11%

• Media queries alone: 6%

• Differing urls (trial and error): 11%

Page 39: Archiving the Mobile Web

End Result

• As a whole, mobile web pages do not adhere to Google’s standards

• There are no truly consistent ways for finding a mobile version of a site

Page 40: Archiving the Mobile Web

Keith Enlow

Heritrix Mobile

Page 41: Archiving the Mobile Web

Introduction

• Heritrix 3.1

• Mobile Finder Web Service

• 2 Options• Crawl desktop web pages (default)

• Crawl mobile web pages using Mobile finder and exclude mobile web pages that use media queries.

Page 42: Archiving the Mobile Web

Experiment

• Decision Making Heritrix

• Web Service (Mobile Finder) Heritrix• Modified Heritrix 3.1 to include two options for crawling

• Option 0: Crawl with desktop user agent

• Option 1: Crawl with mobile user agent using Mobile Finder

• Added built in mobile user agent adapted from Google Bot

• Crawled a small set of URLs

• Used Mobile Finder to find if the given URL has mobile version

• Wrote a small script to discover differences between the mobile and desktop versions

Page 43: Archiving the Mobile Web

<property name="userAgentTemplate"

value="Mozilla/5.0 (compatible; heritrix/@VERISON@+ @OPERATOR_CONTACT_URL@)"/>

<property name="userAgentTemplateMobile"

value="Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us)

AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117

Safari/6531.22.7 (compatible; heritrix/@VERSION@+ @OPERATOR_CONTACT_URL@"/>

<!-- Option # = Description

0 [Default] Crawl using desktop user agent

1 Crawl using mobile user agent + Mobile Finder Web Service -->

<property name="CrawlOption" value="0" />

Page 44: Archiving the Mobile Web

URLs Crawled

Desktop URL Mobile URL

• www.huffingtonpost.com• www.foxnews.com• www.nbcnews.com• www.whitehouse.gov• www.nasa.gov• www.ssa.gov• www.cornell.edu• www.stanford.edu• www.mit.edu

• m.huffpost.com • foxnews.mobi• www.nbcnews.com• m.whitehouse.gov• mobile.nasa.gov• www.ssa.gov/mobile• m.cornell.edu/#home• m.stanford.edu• m.mit.edu /

mobile.mit.edu

Page 45: Archiving the Mobile Web
Page 46: Archiving the Mobile Web
Page 47: Archiving the Mobile Web
Page 48: Archiving the Mobile Web
Page 49: Archiving the Mobile Web
Page 50: Archiving the Mobile Web
Page 51: Archiving the Mobile Web
Page 52: Archiving the Mobile Web
Page 53: Archiving the Mobile Web
Page 54: Archiving the Mobile Web

Redirection/Delivery

• 200 Response (server side redirect)

• 302 “Temporary” relocation

• 301 “Permanent” relocation

• JavaScript Redirection (client side redirect)

• Media Queries

• Style Sheets

Page 55: Archiving the Mobile Web

Tiny Limits

• No JavaScript Engine• Heritrix is unable to perform and execute JavaScript

code

• Unable to catch client side redirection and will instead continue to crawl the desktop version of the web page.

Note: The Mobile Finder Web Service will find the mobile page and therefore Heritrix will continue the crawl.

• www.nasa.gov

• www.ssa.gov

• www.cornell.edu

Page 56: Archiving the Mobile Web

Hufington Fox News NBC News NASA SSA White House Stanford Cornell MIT

56774 12703 8894 4960 2380 8121 2351 2901 120

2134 110 3545 63 53 570 116 94 124

Total Link Count

Page 57: Archiving the Mobile Web

HTML Distribution

Huffington Fox News NBC News NASA SSA White House Stanford Cornell MIT

11550 2681 2302 851 20 3251 385 596 12

493 35 488 18 0 76 16 31 26

Page 58: Archiving the Mobile Web

JavaScript Distribution

Huffington Fox News NBC News NASA SSA White House Stanford Cornell MIT

245 107 46 589 12 83 104 525 2

33 4 14 8 0 13 4 8 0

Page 59: Archiving the Mobile Web

CSS Distribution

Huffington Fox News NBC News NASA SSA White House Stanford Cornell MIT

587 301 72 304 1 154 214 86 3

36 3 17 1 0 19 8 4 3

Page 60: Archiving the Mobile Web

Image Distribution

Huffington Fox News NBC NASA SSAWhite House Stanford Cornell MIT

38671 8893 5852 2908 17 4187 1460 1484 87

1227 59 2769 28 0 436 74 4 89

Page 61: Archiving the Mobile Web

Acknowledgements

• Internet Archive aided in Mobile Finder work

• Funded by NSF grant 1008492