Upload
olivier-amprimo
View
589
Download
7
Embed Size (px)
DESCRIPTION
Presentation of a pilot to test human computation gaming to improve OCR correction of non-digital born content, in Singapore.This presentation was given at BarCamp Singapore 4, Saturday 21 November 2009.
Citation preview
Crowdsourcing OCR CorrectionThrough Game Playing
By:
Lin Tingji JovianNational University of Singapore (NUS)
Olivier AmprimoDigital Resources & Services, National Library Board (NLB)
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Overview
1. Problems with Digital Archiving
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Overview
1. Problems with Digital Archiving
2. Solutions
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Overview
1. Problems with Digital Archiving
2. Solutions
3. TypeAttack
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Overview
1. Problems with Digital Archiving
2. Solutions
3. TypeAttack
4. How Does TypeAttack Work
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Overview
1. Problems with Digital Archiving
2. Solutions
3. TypeAttack
4. How Does TypeAttack work
5. Where We Are Now
Problems
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Problems in Digitizing Archives
• National Library Board (NLB) digitizes The Straits Times (SPH) articles.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Problems in Digitizing Archives
• National Library Board (NLB) digitizes The Straits Times (SPH) articles.
• However, when the articles are older, digitization of the content is prone tomany inaccuracies.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Problems in Digitizing Archives
• National Library Board (NLB) digitizes The Straits Times (SPH) articles.
• However, when the articles are older, digitization of the content is prone tomany inaccuracies.
• For example:
Source: ReCaptcha
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Some of NLB’s OCRResult
NLB’s OCR Translation
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
IS NOT GOOD ENOUGH !Some of NLB’s OCRResult
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Problems in Digitizing Archives
• National Library Board (NLB) digitizes Straits Times (SPH) articles.
• However, when the articles are older, digitization of the content is prone tomany inaccuracies.
• For example:
• In fact, the NLB needs to employ people to double check and rectifyerrors.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Problems in Digitizing Archives
• National Library Board (NLB) digitizes Straits Times (SPH) articles.
• However, when the articles are older, digitization of the content is prone tomany inaccuracies.
• For example:
• In fact, the NLB needs to employ people to double check and rectifyerrors.
• This leads to extra cost and inefficiency.
Solutions
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Solution to Improve Digitization Process
1. Many tasks challenges even sophisticated computer programs.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Solution to Improve Digitization Process
1. Many tasks challenges even sophisticated computer programs.
2. They are trivial for humans
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Options to Improve Digitization Process
1. Employ a large number of people, dedicated full time.
+ + NLB has experience in doing this.
- - Resource Allocation and Co-ordination Cost
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
2. Enroll and Support Volunteers
+ + National Library of Australia > http://www.nla.gov.au/ndp/get_involved/
- - Singapore specifics:• Copyrights• Computer Literacy of Elders• Partial Retirement
Options to Improve Digitization Process
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
3. Make it a mandatory part of popular processes
+ + A Turing Machine to separate bots from humans online > ReCaptcha
- - ReCaptcha turned out to be very proprietaryNo plans for a change, even with GoogleWord, not sentence > decontextualization = poor meaning
Options to Improve Digitization Process
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
4. Make it part of something attractive
+ + Tap into popularity of games – Human Computational GamesMore than 200 million hours are spent each day playing
computer games in U.S. alone.The World Financial Crisis makes it bigger!
>> TypeAttack
Options to Improve Digitization Process
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
TypeAttack
• TypeAttack is a Human Computational Game on Facebook that helpsdigitize archives for National Library Board.
• Being built on Facebook, TypeAttack can:– Harness Facebook’s 200 million active users worldwide.– Utilize Facebook’s viral techniques:
• Friend invites to the game.• Publishing typing scores onto a user’s Facebook wall.• Utilizing Facebook’s newsfeeds to expose TypeAttack to more
Facebook users.– Most importantly, perform Human Computation efficiently.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
TypeAttack
• TypeAttack is a Human Computational Game on Facebook that helpsdigitize archives for National Library Board.
• Being built on Facebook, TypeAttack can:– Harness Facebook’s 200 million active users worldwide.– Utilize Facebook’s viral techniques:
• Friend invites to the game.• Publishing typing scores onto a user’s Facebook wall.• Utilizing Facebook’s newsfeeds to expose TypeAttack to more
Facebook users.– Most importantly, perform Human Computation efficiently.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
TypeAttack
• TypeAttack is a Human Computational Game on Facebook that helpsdigitize archives for National Library Board.
• Being built on Facebook, TypeAttack can:– Harness Facebook’s 200 million active users worldwide.– Utilize Facebook’s viral techniques:
• Friend invites to the game.• Publishing typing scores onto a user’s Facebook wall.• Utilizing Facebook’s newsfeeds to expose TypeAttack to more
Facebook users.– Most importantly, perform Human Computation efficiently.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
TypeAttack
• TypeAttack is a Human Computational Game on Facebook that helpsdigitize archives for National Library Board.
• Being built on Facebook, TypeAttack can:– Harness Facebook’s 200 million active users worldwide.– Utilize Facebook’s viral techniques:
• Friend invites to the game.• Publishing typing scores onto a user’s Facebook wall.• Utilizing Facebook’s newsfeeds to expose TypeAttack to more
Facebook users.– Most importantly, perform Human Computation efficiently.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Luis von AhnCarnegie Mellon University
Since people spend so much time on computer games…
Let us make use of them to perform tasks that computershave difficulty performing.
BIRTH OF Human Computational Games1.These are games played by humans that produce useful computationas a side-effect.
2.People play not because they want to solve computational problems,but because they want to be entertained.
3.It combines human brainpower with computers to solve problemsthat neither could solve alone.
Step-back:Motivation behind Human Computational Games
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Luis von AhnCarnegie Mellon University
Since people spend so much time on computer games…
Let us make use of them to perform tasks that computershave difficulty performing.
BIRTH OF Human Computational Games1.These are games played by humans that produce useful computationas a side-effect.
2.People play not because they want to solve computational problems,but because they want to be entertained.
3.It combines human brainpower with computers to solve problemsthat neither could solve alone.
Step-back:Motivation behind Human Computational Games
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Luis von AhnCarnegie Mellon University
Since people spend so much time on computer games…
Let us make use of them to perform tasks that computershave difficulty performing.
BIRTH OF Human Computational Games1.These are games played by humans that produce useful computationas a side-effect.
2.People play not because they want to solve computational problems,but because they want to be entertained.
3.It combines human brainpower with computers to solve problemsthat neither could solve alone.
Step-back:Motivation behind Human Computational Games
How Does TypeAttack Works
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How TypeAttack works
• Flow:
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
• Flow:
Entire collection of Straits Timesin the year 1938 (worth 10 GB)
XML files representing data per (newspaper)page
How TypeAttack works
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
image
XML
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
1. TypeAttack “cuts” the differentarticles in a page.
2. Within an article, it cuts out shortsnippets of text.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
1. TypeAttack “cuts” the differentarticles in a page.
2. Within an article, it cuts out shortsnippets of text.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
• Flow:
Entire 1938 Straits Times+
Respective XML files
TypeAttack+
Facebook users+
Computational Algorithms
Digitized versions withover 99.0% transcription
accuracy at the word level.
How TypeAttack works
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?
TypeAttack uses 2 kinds of information:
1. Output from Facebook userss.
2. NLB’s OCR Translation Results– accuracy rate @ word level
3. Bi-gram (text prediction)– Looks at the probability of two-word sequence– E.g. Given Word A, what is the probability of Word B?
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?
TypeAttack uses 2 kinds of information:
1. Output from Facebook userss.
2. NLB’s OCR Translation Results– accuracy rate @ word level
3. Bi-gram (text prediction)– Looks at the probability of two-word sequence– E.g. Given Word A, what is the probability of Word B?
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?
TypeAttack uses 2 kinds of information:
1. Output from Facebook userss.
2. Results from NLB’s OCR Translation.– accuracy rate @ word level
3. Bi-gram (text prediction)– Looks at the probability of two-word sequence– E.g. Given Word A, what is the probability of Word B?
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(1) Output from Facebook users
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(1) Output from Facebook users
Today is nice.
Today iz nice.
is nize.
Today
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(1) Output from Facebook users
Today is nice.
Today iz nice.
is nize.
Today
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(1) Output from Facebook users
Today is nice.
Today iz nice.
is nize.
Today
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(1) Output from Facebook users
Today is nice.
Today iz nice.
is nize.
Today is
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(1) Output from Facebook users
Today is nice.
Today iz nice.
is nize.
Today is
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(1) Output from Facebook users
Today is nice.
Today iz nice.
is nize.
Today is nice.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(1) Output from Facebook users
Today is nice.
Today iz nice.
is nize.
Today is nice.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(1) Output from Facebook users
• Say for example, the paragraph to be typed is “Today is nice.".
• Based on all the players' output on this particular paragraph, it will get aprobability of each word.
“Today" = 96.4% of users typed."is" = 95.1% of users typed."nice." = 97.3% of users typed.
• Thus since each word in this paragraph is at least 95% probable, wedetermine that “Today is nice." is the correct output.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(1) Output from Facebook users
• Say for example, the paragraph to be typed is “Today is nice.".
• Based on all the players' output on this particular paragraph, it will get aprobability of each word.
“Today" = 96.4% of users typed."is" = 95.1% of users typed."nice." = 97.3% of users typed.
• Thus since each word in this paragraph is at least 95% probable, wedetermine that “Today is nice." is the correct output.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(1) Output from Facebook users
• Say for example, the paragraph to be typed is “Today is nice.".
• Based on all the players' output on this particular paragraph, it will get aprobability of each word.
“Today" = 96.4% of users typed."is" = 95.1% of users typed."nice." = 97.3% of users typed.
• Thus since each word in this paragraph is at least 95% probable, wedetermine that “Today is nice." is the correct output.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(2) NLB’s OCR Translation Result
• Method previously simply compares the output between players.
• To speed things up, we will compare the players' output with the OCRResult.
• Once every word in the paragraph is >95% probable, the paragraph'sstatus is set to 'complete' and will not be displayed in the game anymore.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(2) NLB’s OCR Translation Result
• Method previously simply compares the output between players.
• To speed things up, we will compare the players' output with theOCR Translation Result.
• Once every word in the paragraph is >95% probable, the paragraph'sstatus is set to 'complete' and will not be displayed in the game anymore.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
How does TypeAttack digitize content?(2) NLB’s OCR Translation Result
• Method previously simply compares the output between players.
• To speed things up, we will compare the players' output with theOCR Translation Result.
• Once every word in the paragraph is >95% probable, the paragraph'sstatus is set to 'complete' and will not be displayed in the game anymore.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Innovativeness and Uniqueness
1. Uses Game Elements and Social Networking to channel humanbrainpower through computer games.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Innovativeness and Uniqueness
1. Uses Game Elements and Social Networking to channel humanbrainpower through computer games.
2. Utilizing probabilities that are determined from different areas (fromFacebook userss and OCR) to ensure that we can extract the correcttext content with minimum user output.
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Innovativeness and Uniqueness
1. Uses Game Elements and Social Networking to channel humanbrainpower through computer games.
2. Utilizing probabilities that are determined from different areas (fromFacebook userss and OCR) to ensure that we can extract the correcttext content with minimum user output.
3. Working SystemSeamlessly integrates Game Elements, Computational Algorithms andSocial Networking aspects to solve problems that neither humans norcomputers can solve individually.
Where We Are Now
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Where We Are Now
• Evaluating the project : Sustainability• User growth• Volume of contribution• Average user contribution trend• Automating the process further (snippet selection)• Expansion beyond FaceBook
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Where We Are Now
• Evaluating the project : Sustainability• User growth• Volume of contribution• Average user contribution trend• Automating the process further (snippet selection)• Expansion beyond FaceBook
• Impacts• Product Design• Marketing and Communication• NLB Digitization Process• NLB Culture
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Where We Are Now
• Evaluating the project : Economic Rationale• Performance (Words per minute) TypeAttack (60) vs Standard (33)• Cost of TypeAttack (operations + development) vs Part timers
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Where We Are Now
• Evaluating the project : Economic Rationale• Performance (Words per minute) TypeAttack (60) vs Standard (33)• Cost of TypeAttack (operations + development) vs Part timers
• Impacts• Finance• IT and Digital Services
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Where We Are Now
• Evaluating the project : Scope of Activity• What about content with low word confidence (< 95%) ?• Only the Straits Times +70 years old?• Other languages? Mandarin, Bahasa Melayu, Tamil
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
Where We Are Now
• Evaluating the project : Scope of Activity• What about content with low word confidence (< 95%) ?• Only the Straits Times +70 years old?• Other languages? Mandarin, Bahasa Melayu, Tamil
• Impacts• NLB Digitization Process• Singapore Copyright Law: 70 years > 20 years• NLB / SPH Partnership (derogatory agreement on copyright)
What You Can Do
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
What YOU can do
• Be part of the crowd: join and play! http://apps.facebook.com/typeattack/
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
What YOU can do
• Be part of the crowd: join and play! http://apps.facebook.com/typeattack/
• Spread the word!
http://apps.facebook.com/typeattack/ © Lin Tingji Jovian, Olivier Amprimo
What YOU can do
• Be part of the crowd: join and play! http://apps.facebook.com/typeattack/
• Spread the word!
• Feel free to contribute!
• After which, NLB will decide whether they will want to use the system tofurther digitize more contents that have poor OCR accuracies.
Thank You