Upload
morgan-parks
View
223
Download
1
Tags:
Embed Size (px)
Citation preview
Recruiting Online Volunteers forLinguistic Knowledge Acquisition Ed Kenschaft job talk May 13, 2008 45 minutes www.kenschaft.org/papers/linguistathome.html
Outline
The internet is essentially unregulated, immensely huge, and growing exponentially.
Terrorist groups use the internet for recruiting and training.
Computational linguistics subdisciplines such as opinion detection can be used to identify terrorist websites.
Most such systems require training data which is not readily available.
Other research projects have had success recruiting internet volunteers for comparably difficult tasks.
I propose to do the same with opinion labeling, and then extend to other related areas.
Challenge: Internet
1.36 billion users (Q1 2008) across entire populated world and diverse language groups 20.7% annual growth (December 2006 to December 2007)
103,160,364 active domains (May 03, 2008) 332,840,730 deleted domains 648,853 new domains in past 24 hours (May 03, 2008)
619,939 (May 12, 2008)
Source (user info): www.internetworldstats.com/ Copyright © 2008, Miniwatts Marketing Group
Source (domain info): www.domaintools.com/internet-statistics/
Internet Users
most users in Asia, Europe, and North America fastest growth in Middle East, Africa, and Latin America
Primary Languages of Internet Users
largely European, East Asian, and Arabic 206 million others
fastest growth in Arabic
Internet Summary
internet is vast, with tremendously fast growth most content and users are from developed nations and
well-studied languages fastest growth is in developing nations and less-studied
languages analysts and technology need to keep up
Challenge: Use of Internet for Global Terrorism
several thousand terrorist websites, growing exponentially purposes
propaganda -- worldwide, anonymous "The Global Islamic Call to Resistance", 1600 pages, call for self-
starting terrorist cells "Questions and Uncertainties Concerning the Mujahideen and their
Operations", doctrinal justifications news bulletins videos of American soldiers being blown up video statements on recent events video game, "Night of Bush Capturing"
...
Challenge: Use of Internet for Global Terrorism
purposes [continued] training manuals, e.g. assassination, manufacturing
poisons/explosives "Encyclopedia of Preparation", huge & growing online manual
coordinate attacks between individuals or groups internet jihadist Irhabi007 helped plan attacks by two men from
Atlanta, GA, on Washington, DC, targets "... networks within networks, connections within
connections and links between individuals that cross local, national and international boundaries."
Peter Clarke, head of the counter-terrorism branch of London's Metropolitan Police
Internet Terrorism Summary
"The radicalisation process is occurring more quickly, more widely and more anonymously in the internet age, raising the likelihood of surprise attacks by unknown groups whose members and supporters may be difficult to pinpoint."
National Intelligence Estimate, USA, 2006 "We have to find a way to stanch the flow. The internet
creates a constant reservoir of radicalised people which terrorist groups and networks can draw upon."
Professor Bruce Hoffman, terrorism expert, Georgetown University
How can we identify terrorist websites?
Digression: Humans and Computers are Different
computers can do many things that humans can't do (well) humans can do many things that computers can't do (well)
Examples of Differences
computers only find new prime numbers scan the entire web for "Osama bin Laden"
humans only recognize emotions from facial expressions captcha
both play chess
Crossover
humans can impersonate computers long division find new prime numbers
computers can impersonate humans Eliza – requires clever rules, limited domain machine learning – requires lots of data
Opinion Detection
identify opinions and attitudes in texts (more generally, modalities)
humans are very good at it, computers are not
Opinion Detection (Examples)
"America is a mistake, admittedly a gigantic mistake, but a mistake nevertheless."
(Sigmund Freud) SPEAKER DISLIKES America
"The United States of America is a threat to world peace." (Nelson Mandela)
SPEAKER DISLIKES United States of America
Opinion Detection (continued)
"Mr. McGee, don't make me angry. You wouldn't like me when I'm angry."
(David Banner) Mr. McGee SHOULDN'T make me angry Mr. McGee DISLIKES me when I'm angry
"All I want for Christmas is my two front teeth." (personal communication)
SPEAKER WANTS my two front teeth
Opinion Detection Resources
humans can do this task well, but not fast enough computers are moderately successful in limited domains
TREC 2006 & 2007 accuracy of computers depends on availability of training
data
TREC 2006 Blog(Opinion Retrieval) Track
given a blog entry and a topic, identify whether: the entry is relevant to that topic the entry expresses an opinion on the topic the opinion is positive, negative, or mixed
no training data provided CMU used ~10,000 training examples from movie and
product reviews (Yang et al 2006)
TREC 2006 Examples
Opinionated Skype 2.0 eats its young
The elaborate press release and WSJ review while impressive don’t help mask the fact that, Skype is short on new ground breaking ideas. Personalization via avatars and ring-tones... big new idea? Not really. Phil Wolff over on Skype Journal puts it nicely when he writes, “If you’ve been using Skype, the Beta version of Skype 2.0 for Windows won’t give you a new Wow! experience.” ...
Non-Opinionated Skype Launches Skype 2.0 Features Skype Video
Skype released the beta version of Skype 2.0, the newest version of its software that allows anyone with an Internet connection to make free Internet calls. The software is designed for greater ease of use, integrated video calling, and ...
TREC 2006 Results (MAP)
Topic relevance Best 42.29% Median 16.99%
Opinion finding Best 30.04% Median 10.59%
Where can we get training data?
Volunteer Projects
Enlist online volunteers Provide minimal training Optionally, frame as a competitive game "The easiest part is getting the public involved. Most
volunteer-computing projects can draw on tens of thousands of people with practically no advertising, relying on word of mouth. The problem is usually keeping these eager amateurs busy."
("Spreading the load", The Economist)
Non-computational Projects
amateur bird-watchers track bird migrations amateur astronomers spot new comets
Galaxy Zoo
roughly a million galaxies from Sloan Digital Sky Survey classify
elliptical clockwise spiral anticlockwise spiral unclear
identify interactions between galaxies, real or illusory
Galaxy Zoo Volunteers
100,000+ volunteers within a few months 30 volunteers classify each galaxy peak load 70,000 per hour final datasets
34,617,406 analyses 82,931 users filter unreliable volunteers using known test cases
Galaxy Zoo Results
unexpected source of error users are biased toward anticlockwise spirals
2 papers submitted for publication currently over 20 projects underway using resulting data future work
phase two: more detailed questions phase three: more image sources
www.galaxyzoo.org/
Stardust@home
Problem aerogel sent seven years and 3 billion km through space identify tracks of microparticles in gel
Volunteers 24,000 participants 40 million searches in under a year
Results 50 candidate dust particles, each identified by hundreds of
participants featured in seven conference papers stardustathome.ssl.berkeley.edu/
Herbaria@home
thousands of 19th-century plant specimens with handwritten notes
read notes and enter information into database
Herbaria@home Volunteers
162 volunteers, Zipfian distribution 68 volunteers transcribed 10 or more entries 24 volunteers transcribed 100 or more entries 7 volunteers transcribed 1000 or more entries
Herbaria@home Results
22702 specimens documented (May 5, 2008) no redundancy herbariaunited.org/atHome/
Open Mind Word Expert
word sense disambiguation He boarded the plane from gate 53. The ball is not in play until it crosses the plane.
systems need training data
Open Mind Word Expert Results
90,000 sense taggings over four months 240 words, 87 examples each on average inter-annotator agreement: 66.56% 66.23% precision, vs. 63.32% baseline best precision for words with most training examples
Volunteer Projects Summary
projects get anywhere between 100+ and 100,000+ volunteers
Zipfian distribution of contributions by volunteers
What makes for a successful project?
Games
"In every job that must be done, there is an element of fun. You find the fun, and – snap! – the job's a game."
(Mary Poppins) 9 billion human-hours of solitaire were played in 2003
7 million human-hours to build the Empire State Building,or 6.8 hours out of 2003
20 million human-hours to build the Panama Canal,or one day out of 2003
(Luis von Ahn, "Human Computation")
ESP Game
Problem: label images with words/captions Purposes
index images for search provide captions for visually impaired
ESP Game Setup
two people, strangers type whatever the other player is typing get points whenever you agree timed only store solutions when n pairs are recorded taboo words from previous solutions random test images to catch cheaters symmetric verification game
both players get same input and give same output each player verifies the other
ESP Game Results
75,000 players (after one year) many people play over 20 hours per week
15 million agreements highly accurate highly complete
large part of appeal is relation with anonymous partner www.espgame.org/
Peakaboom
Problem images with object labels
e.g. output of ESP Game need to locate objects in images used for training computer vision
Peakaboom Setup
player A sees image player B has to guess object in image player A clicks on image, revealing small area to player B asymmetric verification game
player A gets input, which player B has to guess player B verifies player A's analysis
Peakaboom Results
27,000 players in first four months 2,100,000 object locations many people averaged over 12 hours per day
for first 10 days www.peekaboom.org/
Verbosity (proposed)
Problem input common sense facts
e.g. "cereal is eaten with milk" Game
player A sees word player B has to guess word player A gets to fill in various templates
e.g. "object is typically near ____" asymmetric verification game
Toolkits
Amazon Mechanical Turk paid service requester posts task online, along with instructions and pay
rate worker views available tasks and selects those of interests
Examples examine an image and click on specified objects,
$0.05 per object evaluate relevance of search results, $0.02 per evaluation
www.mturk.com/
Toolkits (continued)
Bossa open source, Linux developer provides task-specific PHP scripts system rates volunteer skill, evaluates agreement among
volunteers pointer to Bolt, open source tutorial builder boinc.berkeley.edu/trac/wiki/BossaIntro/
Facebook install customized apps take advantage of social networks www.facebook.com/
Linguist@home (a.k.a. That's Your Opinion)
annotate sentences with opinions make it fun resources
customer data server expert consultant(s)
tools Bossa (PHP) or Java Facebook and standalone
Linguist@home 1-player game
display sentence display list of templates
determined by expert consultant highlight eligible participants
entities and events allow multiple answers
10 points for first, 20 for second, 30 for third, etc.
How can we assure that answers are valid?
Linguist@home 2-player game
symmetric verification 2 players each play same game points for matched answers
Future Work
extend to other linguistic subdisciplines e.g. topic classification
extend to other widely used & studied languages e.g. German, Chinese
extend to fastest growing languages e.g. Arabic sociopolitical factors
References ------. Playing or processing. The Economist. Dec 6, 2007.
------. Spreading the load. The Economist. Dec 6, 2007.
------. A world wide web of terror. The Economist. July 12, 2007.
Amir Alexander. Aerogel: The "Frozen Smoke" that Made Stardust Possible. The Planetary Society. November 8, 2006.
Nathaniel Ayewah, Rada Mihalcea, and Vivi Nastase. Building Multilingual Semantic Networks with Non-Expert Contributions over the Web. Proceedings of the KCAP 2003 Workshop on Distributed and Collaborative Knowledge Capture. Sanibel Island, Florida, November 2003.
Timothy Chklovski. 2005. Designing interfaces for guided collection of knowledge about everyday objects from volunteers. In Proceedings of the 10th international Conference on intelligent User interfaces (San Diego, California, USA, January 10 - 13, 2005). IUI '05. ACM, New York, NY, 311-313.
Timothy Chklovski, Using Analogy to Acquire Commonsense Knowledge from Human Contributors, MIT Artificial Intelligence Laboratory technical report AITR-2003-002, February 2003.
Timothy Chklovski and Rada Mihalcea. Exploiting Agreement and Disagreement of Human Annotators for Word Sense Disambiguation. Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP 2003). Borovetz, Bulgaria, September 2003.
Kate Land, Anze Slosar, Chris Lintott, Dan Andreescu, Steven Bamford, Phil Murray, Robert Nichol, M.Jordan Raddick, Kevin Schawinski, Alex Szalay, Daniel Thomas, Jan Van den Berg. Galaxy Zoo: The large-scale spin statistics of spiral galaxies in the Sloan Digital Sky Survey. Submitted March 22, 2008.
Chris J. Lintott, Kevin Schawinski, Anze Slosar, Kate Land, Steven Bamford, Daniel Thomas, M. Jordan Raddick, Robert C. Nichol, Alex Szalay, Dan Andreescu, Phil Murray, Jan van den Berg. Galaxy Zoo : Morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey. Submitted to MNRAS, April 29, 2008.
References (continued) Rada Mihalcea and Timothy Chklovski. Open Mind Word Expert: Creating Large Annotated Data Collections with Web Users'
Help. Proceedings of the EACL 2003 Workshop on Linguistically Annotated Corpora (LINC 2003). Budapest, April 2003.
Iadh Ounis, Maarten de Rijke, Craig Macdonald, Gilad Mishne, Ian Soboroff. Overview of the TREC-2006 Blog Track. TREC 2006.
Luis von Ahn. Games With a Purpose. IEEE Computer Magazine, vol. 39, no. 6, pp. 92-94, June 2006.
Luis von Ahn. Human Computation. Google Tech Talks. July 26, 2006.
Luis von Ahn, Ruoran Liu and Manuel Blum. Peekaboom: A Game for Locating Objects in Images. ACM CHI 2006.
Luis von Ahn, S. Ginosar, M. Kedia, R. Liu and M. Blum. Improving Accessibility of the Web with a Computer Game. ACM CHI 2006.
Luis von Ahn, Mihir Kedia and Manuel Blum. Verbosity: A Game for Collecting Common-Sense Facts. ACM CHI 2006.
A. J. Westphal, C. C. Allen, R. Bastien, J. Borg, F. Brenker, J. C. Bridges, D. E. Brownlee, A. L. Butterworth, C. Floss, G. J. Flynn, D. Frank, Z. Gainsforth, E. Gruen, P. Hoppe, A. T. Kearsley, H. Leroux, L. R. Nittler, S. A. Sandford, A. Simionovici, F. J. Stadermann, R. M. Stroud, P. Tsou, T. Tyliszczak, J. Warren, M. E. Zolensky. Preliminary Examination of the Interstellar Collector of Stardust. 39th Lunar and Planetary Science Conference (2008), Abstract #1855.
Nicholos Wethington. Galaxy Zoo Gets a Makeover. Universe Today. April 23, 2008.
Nicholos Wethington. Galaxy Zoo Results Show that the Universe Isn't 'Lopsided'. Universe Today. March 28, 2008.
Hui Yang, Luo Si, Jamie Callan. Knowledge Transfer and Opinion Detection in the TREC2006 Blog Track. TREC 2006.