1
Web Research - Large-Scale Web Data Analysis
Amanda SpinkQueensland University of Technology
Jim JansenThe Pennsylvania State University
2
Web Data Analysis 1997-2007
Track Web search trends and characteristics
Web query transaction logs collected in 1997,1999, 2001, 2003, 2004, 2005 and 2006.
Combined dataset of 20 million+ Web searches
3
Web Search Studies
Web search engines:- Alta Vista- Ask Jeeves- Excite- AlltheWeb- Vivisimo- Dogpile
Transaction log analysis studies Focus on user search analysis for competitive advantage
4
Web Data Sample
Query UID CookieDate/Time Browser Location Vertical
Organic_Clicks
Sponsored_Clicks
jamie pressly
66.215.238.179
29NIJYMA4TB385Y
2006-05-15 00:00:00 msie6.0 usa Images 0 1
maytag parts
206.192.197.53
2UGJ23KA4T2TCMV
2006-05-15 00:00:00 msie6.0 usa Web 1 0
free gay porno videos
65.23.175.149
KSKNPKA4TB22ER
2006-05-15 00:00:00 msie6.0 usa Web 0 1
5
Data Collection Methods
Various combinations of methods and approaches Transaction log analysis Videotaping and Audio-taping Think aloud protocols Usability – HCI techniques Focus groups Interviews Survey Experiments Diaries
6
Data Analysis Methods
Quantitative and statistical analysis
Qualitative analysis – grounded theory
Combination of both methods
7
Key Issues – Search Studies
What is the goal of the project?– Insights, understanding and develop theory– User modeling– Trends analysis– Interface/systems design– User training
8
Key Issues – Search Studies
What variables to measure?
How much data is enough?
Methods used – single or multiple?
HCI approach – test interface/system features
9
Transaction Log Analysis (TLA)
File or log of communications between user and system
File recorded on a server – side recordings
Log or file formats vary but there are fields common to most (e.g., IP address, cookie, time stamp, query, vertical, click thru)
10
Why Collect and Analyze Log Data?
Gain understanding of user interaction with system and interface
Goal to improve system and interface design, and improve user training.
Transaction log analysis is extensively used in academia and industry
12
Data Collected
Process of collecting the interaction data for a given period in a transaction log
Collect data on the search episode User identification Date Time Search session content Resources accessed (e.g., URL’s)
13
Logging Software
Custom and commercial applications (the Wrapper - http://ist.psu.edu/faculty_pages/jjansen/academic/wrapper.htm )
WinWhatWhere spy software
Morea 1.1 software
Camtasia Studio
14
Data Preparation
Process of cleaning and preparing the log data for analysis
Log data into a relational database Cleaning the log – corrupted data Parsing the log (e.g., removing Web sessions
identified as agents) Normalizing the log
16
Term Level Analysis
Term occurrence Total terms High and low usage terms Term distribution Co-occurring terms
17
Query Level Analysis
Initial query Subsequent queries Modified queries and query reformulation Identical queries Query complexity Boolean use Spelling Types of queries Query topics
18
Query Subjects – Alta Vista 2002 & Vivisimo 2004
1. People/Places 49.2%2. Commerce, etc. 12.5%3. Computers, etc. 12.4%4. Health/sciences 7.4%5. Education/Humanities 5%6. Entertainment, etc.
4.5%7. Sex/Pornography 3.2%8. Society/Culture, etc.
3.1%9. Government
1.5%10. Performing/Fine Arts
0.6%
1. Commerce, etc. 21%2. Indiscernible 19%3. People/Places, etc. 15%4. Computers/Internet 13%5. Social/Culture 9%6. Health/Sciences 6%7. Education/Humanities 5%8. Sex/Pornography 4%9. Performing/Fine Arts 3%10. Government 3%11. Entertainment, etc. 2%
19
Web Search Session Level Analysis
Search duration
Search patterns
Successive and multitasking sessions
Page or resource viewing
20
Web Session Duration (Minutes)
56% less than 1 minute
72% sessions less than 5 minutes 81% sessions less than 15 minutes Mean: approx. 58 minutes and 2 seconds
(see Jansen, B. J., Spink, A., and Koshman, S. 2007. Web searcher interactions with the Dogpile.com meta-search engine. Journal of the American Society for Information Science and Technology. 58(5), 744-755.)
21
Transaction Log Analysis (TLA) Methods
Quantitative and statistical analysis – requires software and expertise
Qualitative analysis – requires training
Creativity factor
Combination of quantitative and qualitative methods
22
TLA Strengths
Data from a large user base
Reasonable and non-intrusive
Less time than other methods
Can be relatively inexpensive
23
TLA Limitations
Transaction logs do not include user demographic and other data
Lacks data on search reasons and motivations
Incomplete data due to corrupted logging
24
Conclusions
Search analysis is a complex process with many choices
TLA a powerful tool
Requires planning, training and expertise
Can be combined with other data collection and analysis techniques
25
Further Reading
Spink, A., & Jansen, B. J. (2004). Web Search: Public Searching of the Web. Springer.
Jansen, B. J. (2006). Search log analysis: What is it; what's been done; how to do it. Library and Information Science Research, 28(3), 407-432
Jansen, B. J., Spink, A., & Taksa, I. (forthcoming). Handbook of Web Log Analysis. Idea Group Publishing.