31
WIRED - Web Analytics Week WIRED - Web Analytics Week WIRED System Evaluations due now Web Logs overview Web Analytics - Understanding Queries - Tracking Users Web Log Reliability Web Log Data Mining & KDD

WIRED - Web Analytics Week

  • Upload
    enye

  • View
    46

  • Download
    0

Embed Size (px)

DESCRIPTION

WIRED - Web Analytics Week. WIRED System Evaluations due now Web Logs overview Web Analytics Understanding Queries Tracking Users Web Log Reliability Web Log Data Mining & KDD. Web Analytics. Evaluation of Web Information Retrieval (& Web Information Seeking) What can we learn? - PowerPoint PPT Presentation

Citation preview

Page 1: WIRED - Web Analytics Week

WIRED - Web Analytics WeekWIRED - Web Analytics Week• WIRED System Evaluations due now• Web Logs overview• Web Analytics- Understanding Queries- Tracking Users

• Web Log Reliability• Web Log Data Mining & KDD

Page 2: WIRED - Web Analytics Week

Web AnalyticsWeb Analytics• Evaluation of Web Information Retrieval (& Web

Information Seeking)

• What can we learn?- IR systems use- Web server administration

• Who are the users?- Types of users- User situations

• How does it affect or help IR?

Page 3: WIRED - Web Analytics Week

Web Server OverviewWeb Server Overview• Any application that can serve files using the HTTP

protocol- Text, HTML, XHTML, XML…- Graphics- CGI, applets, serlets- other media & MIME types

• Apache or MS IIS that serve primarily Web pages• Servers create ASCII text log files showing:

- Date, time, bytes transferred, (cache status)- Status/error codes, user IP address, (domain name)- Server method, URI, misc comments

Page 4: WIRED - Web Analytics Week

Web Log OverviewWeb Log Overview• Access Log- Logs information such as page served or time

served• Referer Log- Logs name of the server and page that links to

current served page- Not always- Can be from any Web site

• Agent Log- Logs browser type and operating system• Mozilla• Windows

Page 5: WIRED - Web Analytics Week

What can we learn from Web logs?What can we learn from Web logs?• Every time a Web browser requests a file, it

gets logged- Where the user came from- What kind of browser used to access the server- Referring URL

• Every time a page gets served, it gets logged- Request time, serve time, bytes transferred, URI,

status code

Page 6: WIRED - Web Analytics Week

Web Log Analysis in ActionWeb Log Analysis in Action• UT Web log reports(Figures in parentheses refer to the 7 days to 28-Mar-2004 03:00).Successful requests: 39,826,634 (39,596,364)Average successful requests per day: 5,690,083 (5,656,623)Successful requests for pages: 4,189,081 (4,154,717)Average successful requests for pages per day: 598,499 (593,530)Failed requests: 442,129 (439,467)Redirected requests: 1,101,849 (1,093,606)Distinct files requested: 479,022 (473,341)Corrupt logfile lines: 427Data transferred: 278.504 Gbytes (276.650 Gbytes)Average data transferred per day: 39.790 Gbytes (39.521 Gbytes)

Page 7: WIRED - Web Analytics Week

Problems with Web ServersProblems with Web Servers• Actual user or intent not known• Paths difficult to determine• Infrequent access challenging to uncover• No State Information• Server Hits not Representative

- Counters inaccurate• DOS, Floods, Bandwidth can Stop “intended” usage• Robots, etc.• ISP Proxy servers• “5.3 Unsound inferences from data that is logged”

Haigh & Megarity, 1998.

Page 8: WIRED - Web Analytics Week

Web Server ConfigurationWeb Server Configuration• Unique file & directory names = “at a glance analysis”• Hierarchical directory structure• Redirect CGI to find referrer• Use a database

- store web content- record usage data with context of content logged

• Create state information with programming- Servlets, ActiveX, Javascript- Custom server or log format

• Log rollover, report frequency, special case testing

Page 9: WIRED - Web Analytics Week

Log File FormatLog File Format• Extended Log File Format -

W3C Working Draft WD-logfile-960323 192.117.240.3 - - [24/Jul/1998:00:00:04 -0400]"GET /10/3/a3-160-e.html HTTP/1.0" 200 2308 "http://www.amicus.nlc-bnc.ca/wbin/resanet/itemdisp/l=0/d=1/r=1/e=0/h=10/i=11683503""Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)"

• Every server generates slightly different logs- Versions & operating system issues- Admin tweaks to log formats

• Extended Log Format most common- WWW Consortium Standards (= apache)

Page 10: WIRED - Web Analytics Week

Let’s Look at some logsLet’s Look at some logs• http://www.ischool.utexas.edu/analog-

monthly.html• http://www.ischool.utexas.edu/analog-

weekly.html

Page 11: WIRED - Web Analytics Week

Log Analysis ToolsLog Analysis Tools• Analog• Webalizer• Sawmill• WebTrends• AWStats• WWWStat• GetStats• Perl Scripts• Data Mining & Business Intelligence tools

Page 12: WIRED - Web Analytics Week

WebTrendsWebTrends

• A whole industry of analytics• Most popular commercial application

Page 13: WIRED - Web Analytics Week

Measuring Web Site UsageMeasuring Web Site Usage• Now that the Web is a primary source,

understanding its use is critical• Little external cues that the Web site is being

used• What - pages and their content/subject• How - browsers• Who - userid or IP• When - trends, daily, weekly, yearly• Where - the user is and what page they came

from

Page 14: WIRED - Web Analytics Week

What you can’t measure?What you can’t measure?• Who the user is - Always- If the user’s needs have changed

• If they’re using the information- Browsing vs. Reading vs. Acting on the

information• Changes to site and how they affect each user• Pages not used at all - and why

Page 15: WIRED - Web Analytics Week

Analysis of a Very Large Search LogAnalysis of a Very Large Search Log• What kinds of patterns can we find?• Request = query and results page• 280 GB – Six Weeks of Web Queries

- Almost 1 Billion Search Requests, 850K valid, 575K queries- 285 Million User Sessions (cookie issues)- Large volume, less trendy- Why are unique queries important?

• Web Users:- Use Short Queries in short sessions - 63.7% one request- Mostly Look at the First Ten Results only- Seldom Modify Queries

• Traditional IR Isn’t Accurately Describing Web Search• Phrase Searching Could Be Augmented

• Silverstein, Henzinger, Marais, Moricz (1998)

Page 16: WIRED - Web Analytics Week

Analysis of a Very Large Search LogAnalysis of a Very Large Search Log• 2.35 Average Terms Per Query- 0 = 20.6% (?)- 1 = 25.8%- 2 = 26.0% = 72.4%

• Operators Per Query- 0 = 79.6%

• Terms Predictable• First Set of Results Viewed Only = 85%• Some (Single Term Phrase) Query Correlation - Augmentation- Taxonomy Input- Robots vs. Humans

Page 17: WIRED - Web Analytics Week

Web Analytics and IR?Web Analytics and IR?• Knowing access patterns of users• Lists of search terms- Numbers of words- Words, concepts to add (synonyms)- Types of queries

• Success of searching a site- Was a result link clicked on?- How many pp/user after a search?

• Is a new or better search interface needed?

Page 18: WIRED - Web Analytics Week

Real Life Information RetrievalReal Life Information Retrieval• 51K Queries from Excite (1997)• Search Terms = 2.21• Number of Terms

- 1 = 31% 2 = 31% 3 = 18% (80% Combined)• Logic & Modifiers (by User)

- Infrequent- AND, “+”, “-”

• Logic & Modifiers (by Query)- 6% of Users- Less Than 10% of Queries- Lots of Mistakes

• Uniqueness of Queries- 35% successive- 22% modified- 43% identical

Page 19: WIRED - Web Analytics Week

Real Life Information RetrievalReal Life Information Retrieval• Queries per user 2.8• Sessions

- Flawed Analysis (User ID)- Some Revisits to Query (Result Page Revisits)

• Page Views- Accurate, but not by User

• Use of Relevance Feedback (more like this)- Not Used Much (~11%)

• Terms Used Typical & frequent• Mistakes

- Typos- Misspellings- Bad (Advanced) Query Formulation

• Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998)

Page 20: WIRED - Web Analytics Week

KDD for Extracting KnowledgeKDD for Extracting Knowledge• Knowledge extraction, information discovery, information

extraction, data archeology, data pattern processing, OLAP, HV statistical analysis

• Sounds as if “knowledge” is there to be found.

• User and usage context help find the knowledge

• Hypothesis before analysis• Why KDD, why now?- Data storage, analysis costs- Visualization

Page 21: WIRED - Web Analytics Week

KDD ProcessKDD Process

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

• Database for structured data and queries- How structured, alorithms for queries- How results can be understood and visualized- Iterative & Interactive, hypothesis driven &

hypothesis generating

Page 22: WIRED - Web Analytics Week

KDD EffortsKDD Efforts• Data Cleaning• Formulating the Questions• “Finding useful features to represent the

data” p30• Models:- Classification to fit data into pre-defined classes- Regressions to fit predictions & values- Clustering to class sets found in data- Summarization to briefly describe data- Dependency discovery of variable relationships- Sequence analysis for time or interaction patterns

Page 23: WIRED - Web Analytics Week

Data Prep for Mining the WWWData Prep for Mining the WWW• Processing the data before mining• WEBMINER system - site toplogy- Cleaning- User identification- Session identification (episodes)- Path completion

Page 24: WIRED - Web Analytics Week

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 25: WIRED - Web Analytics Week

Web Usage MiningWeb Usage Mining• VL Verification• Data Mining to Discover Patterns of Use- Pre-Processing- Pattern Discovery- Pattern Analysis

• Site Analysis, Not User Analysis

• Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.N. - 2000

Page 26: WIRED - Web Analytics Week

Web Usage DiscoveryWeb Usage Discovery- Content

• Text• Graphics• Features

- Structure• Content Organization• Templates and Tags

- Usage• Patterns• Page References• Dates and Times

- User Profile• Demographics• Customer Information

Page 27: WIRED - Web Analytics Week

Web Usage CollectionWeb Usage Collection• Types of Data- Web Servers- Proxies- Web Clients

• Data Abstractions- Sessions- Episodes- Clickstreams- Page Views

• The Tools for Web Use Verification

Page 28: WIRED - Web Analytics Week

Web Usage PreprocessingWeb Usage Preprocessing• Usage Preprocessing- Understanding the Web Use Activities of the Site - Extract from Logs

• Content Preprocessing- Converting Content Into Formats for Processing- Understanding Content (Working with Dev Team)

• Structure Preprocessing- Mining Links and Navigation from Site- Understanding Page Content and Link Structures

Page 29: WIRED - Web Analytics Week

Web Usage Pattern DiscoveryWeb Usage Pattern Discovery• Clustering for Similarities

- Pages- Users- Links

• Classification- Mapping Data to Pre-defined Classes- Rule Discovery- Rule Rules- Computation Intensive- Many Paths to the Similar Answers

• Pattern Detection- Ordering By Time- Predicting Use With Time

Page 30: WIRED - Web Analytics Week

Web Usage Mining as Evaluation?Web Usage Mining as Evaluation?• Mining Goals- Improved Design- Improved Delivery- Improved Content

• Personalization (XMod Data)• System Improvement (Tech Data)• Site Modification (IA Data)• Business Intelligence (Market Data)• Usage Characterization (User Behavior Data)

Page 31: WIRED - Web Analytics Week

Web Analytics Wrap-upWeb Analytics Wrap-up• What can we learn about users?• What can we learn about services?• How can we help users improve their use?• How can IR models benefit from this

analysis?• What kind of improvements in Web IR

systems and their interfaces can be take from this?