97
What motivates library crowdsourcing volunteers? Experiences at the California Digital Newspaper Collection and the Cambridge Public Library Newspapers Collection Brian Geiger Director, Center for Bibliographical Studies and Research California Digital Newspaper Collection Frederick Zarndt Chair, IFLA Newspapers Section

20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Embed Size (px)

Citation preview

Page 1: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

What motivates library crowdsourcing volunteers?

Experiences at the California Digital Newspaper Collection and the

Cambridge Public Library Newspapers Collection

Brian GeigerDirector, Center for Bibliographical Studies and Research

California Digital Newspaper Collection

Frederick ZarndtChair, IFLA Newspapers Section

Page 2: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Photo held by John Oxley Library, State Library of Queensland. Original from

Courier-mail, Brisbane, Queensland, Australia.

1. crowds, properties of2.crowdsourcing, short history of3.crowdsourcing and libraries, especially

digital newspapers collections4.crowdsourcing, motivations for5. crowdsourcing, interesting fact6.crowdsourcing, website traffic7. crowdsourcing, accuracy of8.crowdsourcing, economic benefits of9.crowdsourcing, other benefits of

Page 3: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

CrowdProperties

Page 4: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

In it he says ...

In 2004 James Surowiecki published ...

The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations

Page 5: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

... a crowd of persons that are diverse ...

Page 6: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

... independent ...

Page 7: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

... and decentralized ...

Page 8: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

usually make better

judgements or decisions than single persons

Page 9: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

“Country Fair” by Grandma Moses. Original painting 1950.

Page 10: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

wisdom of crowdsin summary

James Surowiecki, The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations, Anchor Books, New York, 2005.

DiversityEach person should have private information even if it's just an eccentric interpretation of the known facts.

Independence People's opinions aren't determined by the opinions of those around them.

Decentralization People are able to specialize and draw on local knowledge.

Aggregation Some mechanism exists for turning private judgments into a collective decision.

Page 11: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

CrowdsourcingShort history

Page 12: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

“crowdsourcing”

was coined by Jeff Howe in “The rise of crowdsourcing” published in Wired

magazine June 2006.

Page 13: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

web trends for “crowdsourcing”

Jan-2006 to Jan-2013

Page 14: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

• On the date of publication of Jeff Howe’s Wired magazine article, 1-Jun-2007, Wikipedia did not have an entry (list) of crowdsourcing projects*.

• On 25-Jan-2010 Wikipedia’s list of crowdsourcing projects had 35 entries*.

• On 17-Mar -2013 Wikipedia’s list of crowdsourcing projects had 158 entries+.

* From Internet Archives’ Wayback Machine.+ Wikipedia contributors, "List of crowdsourcing projects," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/List_of_crowdsourcing_projects (accessed March 17, 2013).

Page 15: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Crowdsourcing is the practice of obtaining needed services, ideas, or content by

soliciting contributions from a large group of people, and especially from an online

community, rather than from traditional employees or suppliers. ... [It] is different

from ordinary outsourcing since it is a task or problem that is outsourced to an undefined public rather than a specific, named group.

Wikipedia contributors, "Crowdsourcing," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/wiki/Crowdsourcing (accessed March 17, 2013)

Page 16: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Amazon Mechanical Turk was launched Nov 2005 Alexa global / USA rank of Amazon Mechanical Turk (Jun 2013): 8,242 / 3,060

Alexa reputation (Jun 2013): 1.344

crowdsourcing

Page 17: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

CAPTCHA is a Completely Automated Public Turing test to tell Computers and Human Apart

Started as a computer science project at Carnegie MellonEach day 200,000,000 recaptcha’s are solved by humans around the world

Page 18: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Open Street Map was 1st launched in 2007Alexa global / Belarus / USA traffic rank of Open Street Map (Jun 2013): 8,853 / 1,373 / 20,114Alexa reputation (Jun 2013): 28,368

Page 19: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Zooniverse was 1st launched July 2007 Alexa global / USA traffic rank of Zooniverse (Jun 2013): 275,960 / 153,853

Alexa reputation (Jun 2013): 502Registered users (Jun 2013): 848,140

citizen science

Page 20: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Kickstarter was launched in 2009 Alexa global / USA traffic rank of Kickstarter (Jun 2013): 798 / 361

Alexa reputation (Jun 2013): 85,59143,000+ projects successfully funded with more than USD $669,000,000

crowdfunding

Page 21: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

crowdcollaboration

Page 22: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

• Began 2001• Now in 285 languages• 40 wikipedia

languages with more than 100,000 articles

• 112 wikipedia languages with more than 10,000 articles

• 400,000,000 unique visitors per month

• 3,900,000+ articles in English, 1,400,000+ in German, 1,250,000+ in French, 1,050,000 in Dutch

• 85,000 active contributors / more than 300,000 have edited Wikipedia more than 10 times• Alexa global / USA traffic rank (Jun 2013): 7 / 8• Alexa reputation (Jun 2013): 2,102,569

Page 24: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Crowdsourcing and Libraries

Page 25: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
Page 26: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
Page 27: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Family Search Indexing was 1st launched (beta) 2004 Alexa global / USA traffic rank of FamilySearch (Jun 2013): 4,664 / 1,373

Alexa reputation (Jun 2013): 8,140

Page 28: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

• Started (beta) 2004

• More than 780,000 worldwide registered volunteers from ~25 countries index records relevant to family history

• Approximately 100,000 active volunteers each month

• UI in Chinese, English, German, French, Italian, Japanese, Korean, Portuguese, and Russian

• Blind double-key entry with arbitration / reconciliation

• More than 1,500,088,741 records indexed (July 2012)

• Accuracy typically > 99.95%

Page 29: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Project Gutenberg was 1st launched Dec 1971 Alexa global traffic rank of Project Gutenberg (Jun 2013): 6,709

Alexa reputation (Jun 2013): 31,190

Page 30: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

• Started Dec 1971

• Worldwide volunteers transcribe or proofread OCR’d public domain books through Distributed Proofreaders

• 40,000 books completed (July 2012)

• Partner / affiliated projects for Australia, Canada, Europe, Germany, Luxembourg, Philippines, Runeberg (Nordic literature), Russia, Taiwan

Page 31: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Alexa global / Australia traffic rank of National Library of Australia (June 2013): 15,435 / 411Trove gets ~77% of all National Library web traffic.

Alexa reputation (Jun 2013): 8,788

Page 32: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

National Library of Australia

• Online since 2008• More than 9,830,000 / 96,000,000 newspaper

pages / articles (May 2013)• Top text corrector 1,911,000+ lines (May 2013)• 2,661,140 lines corrected each month (average for

1st 5 months 2013)• 96,304,337 lines corrected as of May 2013, up from

66,527,535 lines corrected May 2012• 95,299 / 8,519 registered / active users (May 2013)

Page 33: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Alexa global / country traffic rank of National Library of Finland2,535,854 (31-Oct-2012) / 199 (2-Apr-2012)

Page 34: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

National Library of Finland

• Digitalkoot is a project to improve OCR text in digitized newspapers -- by playing games!

• Digitalkoot is a collaboration between the National Library and Microtask

• Players correct OCR text by playing Myyräsillassa (Mole Bridge) or Myyräjahdissa (Mole Hunt)

• National Library has 4,000,000+ digitized pages• 109,321 registered players (October 2012)• Since February 2011 8,024,530 micro-tasks have

been completed

Page 35: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Alexa global / USA traffic rank of UC Riverside (June 2013): 12,584 / 4,382CDNC gets ~1.6% of all UC Riverside web traffic.

Page 36: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

California Digital Newspaper Collection

• CDNC began digitizing newspapers in 2005 as part of NDNP

• Newspapers digitized to article-level as well as to page-level as required by NDNP

• Hosted on Veridian beginning 2009

• Collection size 61,351 issues, 544,474 pages, 6,327,491 articles (May 2013)

Page 37: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

OCR text correction

• OCR text correction added Aug 2011

• Corrections are done line by line

• 1340 registered / 599 active users (Jun 2013)

• ~1,160,465+ lines of text corrected (Jun 2013)

• ~1.1% of the collection corrected, 98.9% to go!

• Top corrector 379,573 lines > 2x 2nd corrector

Page 38: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
Page 39: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Cambridge Public Library Historic Newspaper Collection

• Cambridge Historic Newspapers online since Jan 2012.

• Cambridge Massachusetts Public Library digitized local newspapers (http://cambridge.dlconsulting.com/)

• Newspapers digitized to article-level

• Collection size 6,346 issues, 59,070 pages, 669,406 articles (May 2013)

• Collection includes 13,099 obituary cards

Page 40: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

CrowdsourcingMotivation

Page 41: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Cognitive surplus

... people are learning to use their free time for creative activities rather than consumptive ones [such as watching TV] ...

... the total human cognitive effort in creating all of Wikipedia in every language is about one hundred million hours ...

... Americans alone watch two hundred billion hours of TV every year, or enough time, if it would be devoted to projects similar to Wikipedia, to create about 2000 of them ...

Clay Shirky. Cognitive surplus: Creativity and generosity in a connected age. Penguin Press. New York. 2010.

Page 42: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

User Lines corrected1 379,5782 136,6373 56,9964 52,0935 46,1026 39,6177 35,8788 33,2049 31,31910 30,408

Lines corrected User1,456,906 11,385,369 21,010,360 3960,230 4847,340 5786,147 6657,187 7600,513 8582,276 9565,384 10

Page 43: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
Page 44: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Motivationgenealogists and family

historians

• National Library of Australia’s 2012 Trove status report showed that ~50% of Trove users are family historians

• National Library of New Zealand survey found that ~50% of PapersPast users are genealogists

• A 2012 Utah Digital Newspapers survey showed that 72% of its users are genealogists*

PAPERSPAST

*John Herbert and Randy Olsen. “Small town papers: Still delivering the news”. Paper given at 2012 World Library and Information Congress. Helsinki. August 2012.

Page 45: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

User survey

• CDNC and Cambridge Public Library published a user survey in Mar 2013 (it’s still online)

• 604 / 32 responses

• surveys are (mostly) identical except for organization name

Survey is on home pages at http://cdnc.ucr.edu and http://cambridge.dlconsulting.com/

Page 46: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

User demographicgenealogists and family historians

Page 47: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

User demographicno spring chickensX

Page 48: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

User demographicreasons for use

Page 49: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

User demographictypes of information

* John Herbert and Randy Olsen. “Small town papers: Still delivering the news”. Paper given at 2012 World Library and Information Congress. Helsinki. August 2012.

Page 50: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

“I enjoy the correction - it’s a great way to learn more about past history and things of interest whilst doing a ‘service to the community’ by correcting text for the benefit of others.”

“I have recently retired from IT and thought that I could be of some assistance to the project. It benefits me and other people. It helps with family research.”

From Rose Holley in “Many Hands Make Light Work.” National Library of Australia March 2009.

MotivationTrove users’ report

Page 51: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

“I am interested in all kinds of history. I have pursued genealogy as a hobby for many years. I correct text at CDNC because I see it as a constructive way to contribute to a worthwhile project.

Because I am interested in history, I enjoy it.”Wesley, California

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 52: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

“I only correct the text on articles of local interest - nothing at state, national or international level, no advertisements, etc.  The objective is to be able to help researchers to locate local people, places, organizations and events using the on-line

search at CDNC.  I correct local news & gossip, personal items, real estate transactions, superior court proceedings, county and

local board of supervisors meetings, obituaries, birth notices, marriages, yachting news, etc.”

Ann, California

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 53: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

“I am correcting text for the Coronado Tent City Program for 1903.  It is important to correct any problems with personal names and other information so that researchers will be able

to search by keyword and be assured of retrieving desired results. ... type fonts cause a great deal of difficulty in

digitizing the text and can cause problems for searchers.  Also, many of the guests' names at Tent City and Hotel Del

Coronado were taken from the registration books and reported in the Program.  This led to many problems in spelling of last names and the editors were not careful to be consistent in the

spellings.  This Program is an important resource since it provides an excellent picture of daily life in Tent City and

captures much of the history of Coronado itself.”Gene, California

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 54: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

“I have always been interested in history, especially the development of the American West, and nothing brings it alive

better than newspapers of the time. I believe them to be an invaluable source of knowledge for us and future generations.”

David, United Kingdom

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 55: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

CDNC is an excellent source of information matching my personal interest in such topics as sea history, development

of shipbuilding, clippers and other ships etc. ... Unfortunately, the quality of text ... is rather poor I’m

afraid. This is why I started to do all corrections necessary for myself ... and to leave the corrected text for use of

others. .... I am not doing this very regularly as this is just my hobby and pleasure.

Jerzey, Poland

Personal communications with CDNC text correctors.

MotivationCDNC users’ report

Page 56: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

As an amateur historical researcher my time for research is very limited.  Making time to travel to archives, libraries, and historical societies does not happen as often as I would like.  The Cambridge

Public Library’s online newspaper collection has been an invaluable resource and it is fun.  I am very grateful for all the help I have received

over the years from so many research organizations. Correcting text has several benefits.  It makes it much more likely that I will find a story if I decide to search for it in the future.  It is a way of saying

‘thank you’ to the Cambridge Library for having such a great resource available and maybe I can make the next person’s research a little

easier. It is my own little historical preservation project.Daniel, Massachusetts

Personal communications with Cambridge text correctors.

MotivationCambridge users’ report

Page 57: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Graphic from Kaufmann et al. “More than fun and money. Worker Motivation in Crowdsourcing – A Study on Mechanical Turk.”

Motivation

Page 58: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Crowdsourcing Interesting fact

Page 59: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

the long tail* of crowdsourced OCR text correctiona probability distribution has a long tail if a larger share of the population rests within its tail than would be the

case if the population were normally distributed.the most productive users represent a small fraction of the total user population and complete ~50% of total

production.said a different way, less productive users represent the largest fraction of the user population and also complete

~50% of the total production.

The phrase “long tail” was popularized by Chris Anderson in the October 2004 Wired magazine article The Long Tail and by Clay Shirky’s February 2003 essay “Power laws, web logs, and inequality”.

Page 60: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

OCR text correction long tails

0

75000

150000

225000

300000

CDNC lines corrected by text corrector

0

750,000

1,500,000

2,250,000

3,000,000

NLA lines corrected by text corrector

top corrector 242,965 top corrector 1,456,906

50%

50%

50%

50%

Statistics from Oct 2012

Page 61: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Website traffic

Page 62: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Website traffic

After a crowdsourcing transcription project of diaries from the American War Between the States, Nicole Saylor, Head of Digital Library Services at the University of Iowa Libraries, reported

“On June 9, 2011, we went from about 1000 daily hits to our digital library on a really good day to more than 70,000.”

Nicole Saylor interviewed by Trevor Owens. “Crowdsourcing the Civil War: Insights Interview with Nicole Saylor” blog post at http://blogs.loc.gov/digitalpreservation/2011/12/crowdsourcing-the-civil-war-insights-interview-with-nicole-saylor/. Dec 6, 2011.

Page 63: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Website traffic

Website traffic at CDNC before / after implementing crowdsourcing

before crowdsourcing11-Jun-2011 / 12-Jul-2011

after crowdsourcing11-Jun-2012 / 12-Jul-2012 change

visits 17,485 21,488 +22.9%

unique visitors 11,381 13,376 +17.5%

visit duration 9m 24s 11m 7s +18.3%

bounce rate 51.3% 44.5% -6.8%

pages per visit 14.9 11.7 -21.5%

Page 64: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Website traffic

Page 65: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Accuracy

“His Accuracy Depends on Ours!"Office for Emergency Management. Office of War Information. Domestic Operations Branch. Bureau of Special Services. [Photo held at US National Archives and Records Administration]

Page 66: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Why correct OCR text?Here’s why ...

Page 67: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Deaths. lln»rieff, Esq. of <c .. Qn. Sunday, the till. greatly Drandrellt, of Orms4\irJi.- ~ ; ;✓ ' • * On ijfr r inn l j j j i l F i i j ' 1 1 f H a v o d i v y d , Carnarvonshire, S ; **" *- ' « ' March Oxford, F. Tfovmeud, Uerald. » • V . •On Tncsdav last , Mr. Charles. IWilinson, this 8 ; had vf thesis#,, a week ago, which tcrminate<i'iu his death. . / ' ■ O'i Sunday, dJst nit. at. A s b t C n v H a l l , m a r L a n c a s t e r , Mr.,Geo. Worn ick , many years house'steward hit late Once The Hamilton and Brandon. He locked himself h»oWn'r«wte<: soon. twelve o'clock" that dny, and fii»-d a loaded pistol "through Ins bead, 1 which instantaneously killed him. Coronet's Verdict, shot himself in a temporary fit of Friday week,

raw OCR text

Excerpt from The British Newspaper Archive, Chester Courant, Tuesday 6-Apr-1819, page 3.

newspaper image

Page 68: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Accuracy

• Edwin Kiljin (Koninklijke Bibliotheek the Netherlands) reports raw OCR character accuracies of 68% for early 20th century newspapers

• Rose Holley (National Library of Australia) reports raw OCR character accuracy varied from 71% to 98% on a sample Trove digitized newspapers

Rose Holley. “How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine. March/April 2009.

Edwin Kiljin. “The current state-of-art in newspaper digitization.” D-Lib Magazine. January/February 2008.

Public domain graphic courtesy of Wikimedia Commons.

Page 69: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

AccuracyMAPPING TEXTS* assesses digitization quality of digital newspapers by comparing the number of words recognized to the total number of words scanned

* Mapping texts is a collaboration between the University of North Texas and Stanford University aimed at experimenting with new methods for finding and analyzing meaningful patterns embedded in massive collections of digital newspapers.

Page 70: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

uncorrected OCR accuracy by newspaper title

Title OCR character accuracy

~OCR word accuracy*

PRP Pacific Rural Press 1871 - 1922 92.6% 68.1%

SFC San Francisco Call 1890 - 1913 92.6% 68.1%

LAH Los Angeles Herald 1873 - 1910 88.7% 54.9%

LH Livermore Herald 1877 - 1899 88.6% 54.6%

DAC Daily Alta California 1841 - 1891 88.2% 53.4%

CFJ California Farmer and Journalof Useful Sciences 1855 - 1880 86.5% 48.4%

SN Sausalito News 1885 - 1922 70.4% 17.3%

*Word accuracy assumes average word length is 5 characters

Page 71: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

OCR accuracy by newspaper title

Title OCR character accuracy

Corrected accuracy

PRP Pacific Rural Press 1871 - 1922 92.6% 99.3%

SFC San Francisco Call 1890 - 1913 92.6% 99.6%

LAH Los Angeles Herald 1873 - 1910 88.7% 99.1%

LH Livermore Herald 1877 - 1899 88.6% 99.9%

DAC Daily Alta California 1841 - 1891 88.2% 99.9%

CFJ California Farmer and Journalof Useful Sciences 1855 - 1880 86.5% 99.8%

SN Sausalito News 1885 - 1922 70.4% 100.0%

Page 72: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

corrected accuracyby newspaper title

Title OCR character accuracy

~OCR word accuracy*

Corrected accuracy

~Corrected word accuracy*

PRP 1871 - 1922 92.6% 68.1% 99.3% 96.5%

SFC 1890 - 1913 92.6% 68.1% 99.6% 98.0%

LAH 1873 - 1910 88.7% 54.9% 99.1% 95.6%

LH 1877 - 1899 88.6% 54.6% 99.9% 99.5%

DAC 1841 - 1891 88.2% 53.4% 99.9% 99.5%

CF 1855 - 1880 86.5% 48.4% 98.3% 91.8%

SN 1885 - 1922 70.4% 17.3% 100.0% 100.0%

*Word accuracy assumes average word length is 5 characters

Page 73: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

correction accuracy by user

User Average OCR accuracy

Correction accuracy

A 70.4% 100.0%B 87.1% 99.5%C 95.4% 99.5%D 86.5% 98.3%E 95.3% 100.0%F 91.0% 100.0%G 91.0% 99.8%H 90.5% 99.0%I 96.6% 99.8%J 94.8% 100.0%K 86.8% 99.3%

Page 74: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

How does low text accuracy affect search recall?

The Facts• Average uncorrected OCR character accuracy of the

CDNC sample data is ~89%

• Average length of an English word is 5 characters

• Average word accuracy is 89% x 89% x 89% x 89% x 89% = 55.8% - round up to 60% or 6 out of 10 words correct

Accuracy

Page 75: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

ARNDT

ARNDTARNDT

ARNDT ARNDT

ARNDT

AR

ND

T

AR

ND

T

ARNDT

ARNDT

Search recall no text correction

instances of “ARNDT” found instances of “ARNDT” not found

Page 76: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Accuracy

The Facts• Average corrected character accuracy of the CDNC

sample data is ~99.4%

• Average word accuracy of CDNC corrected text is 99.4% x 99.4% x 99.4% x 99.4% x 99.4% = 97.0%

Page 77: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

ARNDT

ARNDTARNDT

ARNDT ARNDT

ARNDT

ARNDT

ARNDT

ARNDT

AR

ND

T

instances of “ARNDT” found instances of “ARNDT” not found

Search recall with text correction

Page 78: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

A search for “Arndt” at Chronicling America gives 10,267 results*• If Chronicling America text accuracy is 55.8% (same as

uncorrected CDNC sample), then 8,133 instances of “Arndt” were not found

• If text accuracy is 97.0%, then 317 instances of “Arndt” were not found

Accuracy

* Search performed 31 Oct 2012

Page 79: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Accuracy

Suppose the word/name is longer than 5 characters?

The Facts• Assume that average uncorrected / corrected OCR

character accuracy is ~89% / ~99% same as CDNC.

Name Name length Raw text accuracy Corrected text accuracy

Eklund 6 49.7% 94.2%

Kennedy 7 44.2% 93.25

Espinosa 8 39.4% 92.3%

Bonaparte 9 35.0% 91.4%

Chatterjee 10 31.2% 90.4%

Page 80: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Accuracy

Name Number of search results

Missing results with raw text accuracy

Missing results with corrected text accuracy

Eklund 2,951 2,987 182

Kennedy 360,723 455,392 26,111

Espinosa 1,918 2,950 160

Bonaparte 44,664 82,947 4,203

Chatterjee 19 42 2

Chronicling America searches done 19-Mar-2013 (6,025,474 pages from 1836 to 1922).

Page 81: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Crowdsourcing economic benefits

Public domain photo courtesy of US Navy

Page 82: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

$Economic benefits

Financial value of outsourced OCR text correction for newspapers?

The Assumptions

• 25 to 50 characters per line in a newspaper column: Assume 40 characters per line (CDNC sample average)

• Outsourced text transcription or correction costs USD $0.35 to $1.20 per 1000 characters: Assume $0.50 per 1000 characters

Page 83: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

$Economics

$ 578,000 lines x 40 characters per line x 1/1000 x $0.50 = $11,560

$ 68,908,757 lines x 40 characters per line x 1/1000 x $0.50 = $1,378,175

Page 84: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

$Economics

Financial value of in-house OCR text correction?

The Assumptions

• Correction takes 15 seconds per line

• Cost is hourly wage plus benefits of lowest level employee, $10 for CDNC, $41.88* for Australia

AUD $40.38 = USD $41.88 is the actual labor value assumed by the National Library of Australia to calculate avoided costs due to crowdsourced OCR text correction in its 2012 Trove Status Report.

Page 85: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

$Economics

$ 578,000 lines x 15 seconds per line x 1/3600 hrs per second x $10.00 per hr = $24,083

$ 68,908,757 lines x 15 seconds per line x 1/3600 hrs per second x $41.88 per hr = $12,024,578

Page 86: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Crowdsourcingother benefits

Public domain photo “A useful instruction for young sailors from the Royal Hospital School, Greenwich” from the National Maritime Museum.

Page 87: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

“when someone transcribes a document, they are actually better fulfilling the mission of a cultural

heritage organization than someone who simply stops by to flip through the pages”

Other benefit

Paraphrased from Trevor Owen’s blog http://www.trevorowens.org/2012/03/crowdsourcing-cultural-heritage-the-objectives-are-upside-down/ (accessed June 2013).

Page 88: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

“in addition to increasing search accuracy or lowering the costs of document transcription, crowdsourcing is

the single greatest advancement in getting people using and interacting with library collections”

Other benefit

Paraphrased from Trevor Owen’s blog http://www.trevorowens.org/2012/03/crowdsourcing-cultural-heritage-the-objectives-are-upside-down/ (accessed June 2013).

Page 89: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Crowdsourcing considerations• How to market / advertise

crowdsourcing?

• How to motivate crowdsourcers?

• Is authentication / identity of crowdsourcers an issue?

• How to administer crowdsourced data?

Photo of Aleister Crowley [Public domain] from Wikimedia Commons

Page 90: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Conclusions

Conclusion of the Sonata for piano #32, opus 111 by Ludwig van Beethoven

• Lots of crowdsourcing in cultural heritage organizations and elsewhere

• Benefits are multi-faceted: Economic, data accuracy, patron engagement, increased web traffic

Page 91: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Resources

Public domain photo “A useful instruction for young sailors from the Royal Hospital School, Greenwich” from the National Maritime Museum.

Page 92: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Comprehensive worldwide list of online newspaper archives

Wikipedia contributors, "List of online newspaper archives," Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/wiki/Wikipedia:List_of_online_newspaper_archives (accessed March 17, 2013).

Page 93: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Search many digital newspaper collections at once!

As of June 2013 elephind (http://www.elephind.com) has indexed 1,032 newspapers from 13 historical digital collections comprising 1,097,928 issues and 45,243,443 pages/articles.

Page 94: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Correct California newspapers at http://cdnc.ucr.edu

Correct Cambridge MA newspapers http://bit.ly/cambridgepublic

Correct Australian newspapers http://trove.nla.gov.au

Correct Tennessee newspapers http://tndp.lib.utk.edu

Correct Virginia newspapers http://virginiachronicle.com

Try crowdsourcing!

Page 95: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

Hãy thử crowdsourcing!

Or try Russian language periodicals http://bit.ly/russianperiodicals

Correct Vietnamese newspapers http://bit.ly/nationallibraryofvietnam

Попробуйте краудсорсинга!

Page 97: 20130630 What motivates library crowdsourcing volunteers? [ALA LITA]

?Brian Geiger

[email protected]

Frederick [email protected]

Photo held by John Oxley Library, State Library of Queensland. Original from

Courier-mail, Brisbane, Queensland, Australia.