26
1 Web Research - Large-Scale Web Data Analysis Amanda Spink Queensland University of Technology Jim Jansen The Pennsylvania State University

Web Research - Large-Scale Web Data Analysis

Embed Size (px)

DESCRIPTION

Web Research - Large-Scale Web Data Analysis. Amanda Spink Queensland University of Technology Jim Jansen The Pennsylvania State University. Web Data Analysis 1997-2007. Track Web search trends and characteristics - PowerPoint PPT Presentation

Citation preview

1

Web Research - Large-Scale Web Data Analysis

Amanda SpinkQueensland University of Technology

Jim JansenThe Pennsylvania State University

2

Web Data Analysis 1997-2007

Track Web search trends and characteristics

Web query transaction logs collected in 1997,1999, 2001, 2003, 2004, 2005 and 2006.

Combined dataset of 20 million+ Web searches

3

Web Search Studies

Web search engines:- Alta Vista- Ask Jeeves- Excite- AlltheWeb- Vivisimo- Dogpile

Transaction log analysis studies Focus on user search analysis for competitive advantage

4

Web Data Sample

Query UID CookieDate/Time Browser Location Vertical

Organic_Clicks

Sponsored_Clicks

jamie pressly

66.215.238.179

29NIJYMA4TB385Y

2006-05-15 00:00:00 msie6.0 usa Images 0 1

maytag parts

206.192.197.53

2UGJ23KA4T2TCMV

2006-05-15 00:00:00 msie6.0 usa Web 1 0

free gay porno videos

65.23.175.149

KSKNPKA4TB22ER

2006-05-15 00:00:00 msie6.0 usa Web 0 1

5

Data Collection Methods

Various combinations of methods and approaches Transaction log analysis Videotaping and Audio-taping Think aloud protocols Usability – HCI techniques Focus groups Interviews Survey Experiments Diaries

6

Data Analysis Methods

Quantitative and statistical analysis

Qualitative analysis – grounded theory

Combination of both methods

7

Key Issues – Search Studies

What is the goal of the project?– Insights, understanding and develop theory– User modeling– Trends analysis– Interface/systems design– User training

8

Key Issues – Search Studies

What variables to measure?

How much data is enough?

Methods used – single or multiple?

HCI approach – test interface/system features

9

Transaction Log Analysis (TLA)

File or log of communications between user and system

File recorded on a server – side recordings

Log or file formats vary but there are fields common to most (e.g., IP address, cookie, time stamp, query, vertical, click thru)

10

Why Collect and Analyze Log Data?

Gain understanding of user interaction with system and interface

Goal to improve system and interface design, and improve user training.

Transaction log analysis is extensively used in academia and industry

11

TLA Process

Goals and objectives

Data collection

Log preparation

Data analysis

Making sense

12

Data Collected

Process of collecting the interaction data for a given period in a transaction log

Collect data on the search episode User identification Date Time Search session content Resources accessed (e.g., URL’s)

13

Logging Software

Custom and commercial applications (the Wrapper - http://ist.psu.edu/faculty_pages/jjansen/academic/wrapper.htm )

WinWhatWhere spy software

Morea 1.1 software

Camtasia Studio

14

Data Preparation

Process of cleaning and preparing the log data for analysis

Log data into a relational database Cleaning the log – corrupted data Parsing the log (e.g., removing Web sessions

identified as agents) Normalizing the log

15

Log Analysis – Three Levels

Term

Query

Session

16

Term Level Analysis

Term occurrence Total terms High and low usage terms Term distribution Co-occurring terms

17

Query Level Analysis

Initial query Subsequent queries Modified queries and query reformulation Identical queries Query complexity Boolean use Spelling Types of queries Query topics

18

Query Subjects – Alta Vista 2002 & Vivisimo 2004

1. People/Places 49.2%2. Commerce, etc. 12.5%3. Computers, etc. 12.4%4. Health/sciences 7.4%5. Education/Humanities 5%6. Entertainment, etc.

4.5%7. Sex/Pornography 3.2%8. Society/Culture, etc.

3.1%9. Government

1.5%10. Performing/Fine Arts

0.6%

1. Commerce, etc. 21%2. Indiscernible 19%3. People/Places, etc. 15%4. Computers/Internet 13%5. Social/Culture 9%6. Health/Sciences 6%7. Education/Humanities 5%8. Sex/Pornography 4%9. Performing/Fine Arts 3%10. Government 3%11. Entertainment, etc. 2%

19

Web Search Session Level Analysis

Search duration

Search patterns

Successive and multitasking sessions

Page or resource viewing

20

Web Session Duration (Minutes)

56% less than 1 minute

72% sessions less than 5 minutes 81% sessions less than 15 minutes Mean: approx. 58 minutes and 2 seconds

(see Jansen, B. J., Spink, A., and Koshman, S. 2007. Web searcher interactions with the Dogpile.com meta-search engine. Journal of the American Society for Information Science and Technology. 58(5), 744-755.)

21

Transaction Log Analysis (TLA) Methods

Quantitative and statistical analysis – requires software and expertise

Qualitative analysis – requires training

Creativity factor

Combination of quantitative and qualitative methods

22

TLA Strengths

Data from a large user base

Reasonable and non-intrusive

Less time than other methods

Can be relatively inexpensive

23

TLA Limitations

Transaction logs do not include user demographic and other data

Lacks data on search reasons and motivations

Incomplete data due to corrupted logging

24

Conclusions

Search analysis is a complex process with many choices

TLA a powerful tool

Requires planning, training and expertise

Can be combined with other data collection and analysis techniques

25

Further Reading

Spink, A., & Jansen, B. J. (2004). Web Search: Public Searching of the Web. Springer.

Jansen, B. J. (2006). Search log analysis: What is it; what's been done; how to do it. Library and Information Science Research, 28(3), 407-432

Jansen, B. J., Spink, A., & Taksa, I. (forthcoming). Handbook of Web Log Analysis. Idea Group Publishing.

26

QUESTIONS?

Thank You