Recruiting Online Volunteers for Linguistic Knowledge Acquisition Ed Kenschaft job talk May 13, 2008 45 minutes

Recruiting Online Volunteers forLinguistic Knowledge Acquisition Ed Kenschaft job talk May 13, 2008 45 minutes www.kenschaft.org/papers/linguistathome.html

http://www.kenschaft.org/papers/linguistathome.html

Outline

The internet is essentially unregulated, immensely huge, and growing exponentially.

Terrorist groups use the internet for recruiting and training.

Computational linguistics subdisciplines such as opinion detection can be used to identify terrorist websites.

Most such systems require training data which is not readily available.

Other research projects have had success recruiting internet volunteers for comparably difficult tasks.

I propose to do the same with opinion labeling, and then extend to other related areas.

Challenge: Internet

1.36 billion users (Q1 2008) across entire populated world and diverse language groups 20.7% annual growth (December 2006 to December 2007)

103,160,364 active domains (May 03, 2008) 332,840,730 deleted domains 648,853 new domains in past 24 hours (May 03, 2008)

619,939 (May 12, 2008)

Source (user info): www.internetworldstats.com/ Copyright © 2008, Miniwatts Marketing Group

Source (domain info): www.domaintools.com/internet-statistics/

http://www.internetworldstats.com/

http://www.domaintools.com/internet-statistics/

Internet Users

most users in Asia, Europe, and North America fastest growth in Middle East, Africa, and Latin America

Primary Languages of Internet Users

largely European, East Asian, and Arabic 206 million others

fastest growth in Arabic

Internet Summary

internet is vast, with tremendously fast growth most content and users are from developed nations and

well-studied languages fastest growth is in developing nations and less-studied

languages analysts and technology need to keep up

Challenge: Use of Internet for Global Terrorism

several thousand terrorist websites, growing exponentially purposes

propaganda -- worldwide, anonymous "The Global Islamic Call to Resistance", 1600 pages, call for self-

starting terrorist cells "Questions and Uncertainties Concerning the Mujahideen and their

Operations", doctrinal justifications news bulletins videos of American soldiers being blown up video statements on recent events video game, "Night of Bush Capturing"

...

Challenge: Use of Internet for Global Terrorism

purposes [continued] training manuals, e.g. assassination, manufacturing

poisons/explosives "Encyclopedia of Preparation", huge & growing online manual

coordinate attacks between individuals or groups internet jihadist Irhabi007 helped plan attacks by two men from

Atlanta, GA, on Washington, DC, targets "... networks within networks, connections within

connections and links between individuals that cross local, national and international boundaries."

Peter Clarke, head of the counter-terrorism branch of London's Metropolitan Police

Internet Terrorism Summary

"The radicalisation process is occurring more quickly, more widely and more anonymously in the internet age, raising the likelihood of surprise attacks by unknown groups whose members and supporters may be difficult to pinpoint."

National Intelligence Estimate, USA, 2006 "We have to find a way to stanch the flow. The internet

creates a constant reservoir of radicalised people which terrorist groups and networks can draw upon."

Professor Bruce Hoffman, terrorism expert, Georgetown University

How can we identify terrorist websites?

Digression: Humans and Computers are Different

computers can do many things that humans can't do (well) humans can do many things that computers can't do (well)

Examples of Differences

computers only find new prime numbers scan the entire web for "Osama bin Laden"

humans only recognize emotions from facial expressions captcha

both play chess

Crossover

humans can impersonate computers long division find new prime numbers

computers can impersonate humans Eliza – requires clever rules, limited domain machine learning – requires lots of data

Opinion Detection

identify opinions and attitudes in texts (more generally, modalities)

humans are very good at it, computers are not

Opinion Detection (Examples)

"America is a mistake, admittedly a gigantic mistake, but a mistake nevertheless."

(Sigmund Freud) SPEAKER DISLIKES America

"The United States of America is a threat to world peace." (Nelson Mandela)

SPEAKER DISLIKES United States of America

Opinion Detection (continued)

"Mr. McGee, don't make me angry. You wouldn't like me when I'm angry."

(David Banner) Mr. McGee SHOULDN'T make me angry Mr. McGee DISLIKES me when I'm angry

"All I want for Christmas is my two front teeth." (personal communication)

SPEAKER WANTS my two front teeth

Opinion Detection Resources

humans can do this task well, but not fast enough computers are moderately successful in limited domains

TREC 2006 & 2007 accuracy of computers depends on availability of training

data

TREC 2006 Blog(Opinion Retrieval) Track

given a blog entry and a topic, identify whether: the entry is relevant to that topic the entry expresses an opinion on the topic the opinion is positive, negative, or mixed

no training data provided CMU used ~10,000 training examples from movie and

product reviews (Yang et al 2006)

TREC 2006 Examples

Opinionated Skype 2.0 eats its young

The elaborate press release and WSJ review while impressive don’t help mask the fact that, Skype is short on new ground breaking ideas. Personalization via avatars and ring-tones... big new idea? Not really. Phil Wolff over on Skype Journal puts it nicely when he writes, “If you’ve been using Skype, the Beta version of Skype 2.0 for Windows won’t give you a new Wow! experience.” ...

Non-Opinionated Skype Launches Skype 2.0 Features Skype Video

Skype released the beta version of Skype 2.0, the newest version of its software that allows anyone with an Internet connection to make free Internet calls. The software is designed for greater ease of use, integrated video calling, and ...

TREC 2006 Results (MAP)

Topic relevance Best 42.29% Median 16.99%

Opinion finding Best 30.04% Median 10.59%

Where can we get training data?

Volunteer Projects

Enlist online volunteers Provide minimal training Optionally, frame as a competitive game "The easiest part is getting the public involved. Most

volunteer-computing projects can draw on tens of thousands of people with practically no advertising, relying on word of mouth. The problem is usually keeping these eager amateurs busy."

("Spreading the load", The Economist)

Non-computational Projects

amateur bird-watchers track bird migrations amateur astronomers spot new comets

Galaxy Zoo

roughly a million galaxies from Sloan Digital Sky Survey classify

elliptical clockwise spiral anticlockwise spiral unclear

identify interactions between galaxies, real or illusory

Galaxy Zoo Volunteers

100,000+ volunteers within a few months 30 volunteers classify each galaxy peak load 70,000 per hour final datasets

34,617,406 analyses 82,931 users filter unreliable volunteers using known test cases

Galaxy Zoo Results

unexpected source of error users are biased toward anticlockwise spirals

2 papers submitted for publication currently over 20 projects underway using resulting data future work

phase two: more detailed questions phase three: more image sources

www.galaxyzoo.org/

http://www.galaxyzoo.org/

Stardust@home

Problem aerogel sent seven years and 3 billion km through space identify tracks of microparticles in gel

Volunteers 24,000 participants 40 million searches in under a year

Results 50 candidate dust particles, each identified by hundreds of

participants featured in seven conference papers stardustathome.ssl.berkeley.edu/

Herbaria@home

thousands of 19th-century plant specimens with handwritten notes

read notes and enter information into database

Herbaria@home Volunteers

162 volunteers, Zipfian distribution 68 volunteers transcribed 10 or more entries 24 volunteers transcribed 100 or more entries 7 volunteers transcribed 1000 or more entries

Herbaria@home Results

22702 specimens documented (May 5, 2008) no redundancy herbariaunited.org/atHome/

Open Mind Word Expert

word sense disambiguation He boarded the plane from gate 53. The ball is not in play until it crosses the plane.

systems need training data

Open Mind Word Expert Results

90,000 sense taggings over four months 240 words, 87 examples each on average inter-annotator agreement: 66.56% 66.23% precision, vs. 63.32% baseline best precision for words with most training examples

Volunteer Projects Summary

projects get anywhere between 100+ and 100,000+ volunteers

Zipfian distribution of contributions by volunteers

What makes for a successful project?

Games

"In every job that must be done, there is an element of fun. You find the fun, and – snap! – the job's a game."

(Mary Poppins) 9 billion human-hours of solitaire were played in 2003

7 million human-hours to build the Empire State Building,or 6.8 hours out of 2003

20 million human-hours to build the Panama Canal,or one day out of 2003

(Luis von Ahn, "Human Computation")

ESP Game

Problem: label images with words/captions Purposes

index images for search provide captions for visually impaired

ESP Game Setup

two people, strangers type whatever the other player is typing get points whenever you agree timed only store solutions when n pairs are recorded taboo words from previous solutions random test images to catch cheaters symmetric verification game

both players get same input and give same output each player verifies the other

ESP Game Results

75,000 players (after one year) many people play over 20 hours per week

15 million agreements highly accurate highly complete

large part of appeal is relation with anonymous partner www.espgame.org/

http://www.espgame.org/

Peakaboom

Problem images with object labels

e.g. output of ESP Game need to locate objects in images used for training computer vision

Peakaboom Setup

player A sees image player B has to guess object in image player A clicks on image, revealing small area to player B asymmetric verification game

player A gets input, which player B has to guess player B verifies player A's analysis

Peakaboom Results

27,000 players in first four months 2,100,000 object locations many people averaged over 12 hours per day

for first 10 days www.peekaboom.org/

http://www.peekaboom.org/

Verbosity (proposed)

Problem input common sense facts

e.g. "cereal is eaten with milk" Game

player A sees word player B has to guess word player A gets to fill in various templates

e.g. "object is typically near ____" asymmetric verification game

Toolkits

Amazon Mechanical Turk paid service requester posts task online, along with instructions and pay

rate worker views available tasks and selects those of interests

Examples examine an image and click on specified objects,

$0.05 per object evaluate relevance of search results, $0.02 per evaluation

www.mturk.com/

http://www.mturk.com/

Toolkits (continued)

Bossa open source, Linux developer provides task-specific PHP scripts system rates volunteer skill, evaluates agreement among

volunteers pointer to Bolt, open source tutorial builder boinc.berkeley.edu/trac/wiki/BossaIntro/

Facebook install customized apps take advantage of social networks www.facebook.com/

Linguist@home (a.k.a. That's Your Opinion)

annotate sentences with opinions make it fun resources

customer data server expert consultant(s)

tools Bossa (PHP) or Java Facebook and standalone

Linguist@home 1-player game

display sentence display list of templates

determined by expert consultant highlight eligible participants

entities and events allow multiple answers

10 points for first, 20 for second, 30 for third, etc.

How can we assure that answers are valid?

Linguist@home 2-player game

symmetric verification 2 players each play same game points for matched answers

Future Work

extend to other linguistic subdisciplines e.g. topic classification

extend to other widely used & studied languages e.g. German, Chinese

extend to fastest growing languages e.g. Arabic sociopolitical factors

References ------. Playing or processing. The Economist. Dec 6, 2007.

------. Spreading the load. The Economist. Dec 6, 2007.

------. A world wide web of terror. The Economist. July 12, 2007.

Amir Alexander. Aerogel: The "Frozen Smoke" that Made Stardust Possible. The Planetary Society. November 8, 2006.

Nathaniel Ayewah, Rada Mihalcea, and Vivi Nastase. Building Multilingual Semantic Networks with Non-Expert Contributions over the Web. Proceedings of the KCAP 2003 Workshop on Distributed and Collaborative Knowledge Capture. Sanibel Island, Florida, November 2003.

Timothy Chklovski. 2005. Designing interfaces for guided collection of knowledge about everyday objects from volunteers. In Proceedings of the 10th international Conference on intelligent User interfaces (San Diego, California, USA, January 10 - 13, 2005). IUI '05. ACM, New York, NY, 311-313.

Timothy Chklovski, Using Analogy to Acquire Commonsense Knowledge from Human Contributors, MIT Artificial Intelligence Laboratory technical report AITR-2003-002, February 2003.

Timothy Chklovski and Rada Mihalcea. Exploiting Agreement and Disagreement of Human Annotators for Word Sense Disambiguation. Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP 2003). Borovetz, Bulgaria, September 2003.

Kate Land, Anze Slosar, Chris Lintott, Dan Andreescu, Steven Bamford, Phil Murray, Robert Nichol, M.Jordan Raddick, Kevin Schawinski, Alex Szalay, Daniel Thomas, Jan Van den Berg. Galaxy Zoo: The large-scale spin statistics of spiral galaxies in the Sloan Digital Sky Survey. Submitted March 22, 2008.

Chris J. Lintott, Kevin Schawinski, Anze Slosar, Kate Land, Steven Bamford, Daniel Thomas, M. Jordan Raddick, Robert C. Nichol, Alex Szalay, Dan Andreescu, Phil Murray, Jan van den Berg. Galaxy Zoo : Morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey. Submitted to MNRAS, April 29, 2008.

References (continued) Rada Mihalcea and Timothy Chklovski. Open Mind Word Expert: Creating Large Annotated Data Collections with Web Users'

Help. Proceedings of the EACL 2003 Workshop on Linguistically Annotated Corpora (LINC 2003). Budapest, April 2003.

Iadh Ounis, Maarten de Rijke, Craig Macdonald, Gilad Mishne, Ian Soboroff. Overview of the TREC-2006 Blog Track. TREC 2006.

Luis von Ahn. Games With a Purpose. IEEE Computer Magazine, vol. 39, no. 6, pp. 92-94, June 2006.

Luis von Ahn. Human Computation. Google Tech Talks. July 26, 2006.

Luis von Ahn, Ruoran Liu and Manuel Blum. Peekaboom: A Game for Locating Objects in Images. ACM CHI 2006.

Luis von Ahn, S. Ginosar, M. Kedia, R. Liu and M. Blum. Improving Accessibility of the Web with a Computer Game. ACM CHI 2006.

Luis von Ahn, Mihir Kedia and Manuel Blum. Verbosity: A Game for Collecting Common-Sense Facts. ACM CHI 2006.

A. J. Westphal, C. C. Allen, R. Bastien, J. Borg, F. Brenker, J. C. Bridges, D. E. Brownlee, A. L. Butterworth, C. Floss, G. J. Flynn, D. Frank, Z. Gainsforth, E. Gruen, P. Hoppe, A. T. Kearsley, H. Leroux, L. R. Nittler, S. A. Sandford, A. Simionovici, F. J. Stadermann, R. M. Stroud, P. Tsou, T. Tyliszczak, J. Warren, M. E. Zolensky. Preliminary Examination of the Interstellar Collector of Stardust. 39th Lunar and Planetary Science Conference (2008), Abstract #1855.

Nicholos Wethington. Galaxy Zoo Gets a Makeover. Universe Today. April 23, 2008.

Nicholos Wethington. Galaxy Zoo Results Show that the Universe Isn't 'Lopsided'. Universe Today. March 28, 2008.

Hui Yang, Luo Si, Jamie Callan. Knowledge Transfer and Opinion Detection in the TREC2006 Blog Track. TREC 2006.

Documents

Recruiting Online Volunteers for Linguistic Knowledge Acquisition Ed Kenschaft job talk May 13, 2008 45 minutes