21
OCR Workshop 2011 University of Illinois OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop Loretta Auvil UIUC October 18, 2011

  • Upload
    buffy

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

OCR Workshop Loretta Auvil UIUC October 18, 2011. Pearson Correlation Algorithm. Correlation- Ngram Viewer. Correlation- Ngram Viewer. new version of the Google ngrams viewer (for 1 grams) addresses case-sensitivity period spellings past-tense syncope (' d) - PowerPoint PPT Presentation

Citation preview

Page 1: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

OCR Workshop

Loretta AuvilUIUC

October 18, 2011

Page 2: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Correlation-Ngram Viewer

• Pearson Correlation Algorithm

Page 3: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Correlation-Ngram Viewer

• new version of the Google ngrams viewer (for 1 grams)• addresses case-sensitivity• period spellings• past-tense syncope (' d)• f/s substitution as well as other OCR issues

• searches within already stored correlation results (using Pearson) results for top 10K ngrams

• Computes correlation (using Pearson) results for given word against top 1K ngrams

Page 4: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

OCR Correction

• HTRC Example of one of the worst pages of text based on number of corrections per word rate = 0.1994

Page 5: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Worst Page

Page 6: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Corrected Page

Page 7: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Some StatsGoogle Ngram HTRC 250K Books Laura’s

Total number of ngrams: 359,511,583,097 20,173,974,251

Total number of ngrams (ignoring punctuation chars): 306,780,490,555Total number of ngrams (ignoring numbers only & repeating characters, other noise that I could easily identify): 293,760,570,946 19,282,108,416 593,055

Total number of corrections that we have made: 1,660,948,155 131,571,046 4,294

Percent of Cleaning 0.57% 0.68% 0.72%

Unique ngrams before cleaning 7,380,256 24,545

Unique ngrams after cleaning 4,977,548 22,354

Number of unique misspelled words: 17,906

Number of unique misspelled words with no suggested replacement: 11,143

Number of generated rules: 154227 6,763

Number of valid rules: 99,455 99,455 3,751Number of rules that are shorter than 5 chars and ignored 7,076 1,674

Page 8: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Spellcheck Component• Wrapped existing spellchecker from com.swabunga.spell• Input

• Dictionary to define the correct words• Transformations is a set of rules that should be tried on

misspelled words before taking the spell checker's suggestions

• Token counts is a set of counts that can be used to choose word when spell checker suggests multiple ones

• Output• Replacement Rules are the transformation rules for misspelled

words• Replacements are suggestions for misspelled words• Corrected Text is the original text with corrections applied• Uncorrected Misspellings is the list of words for which a

correction/replacement could not be found

Page 9: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Adding Levenshtein• Use the Levenshtein algorithm to filter the list of

suggestions considered• The Levenshtein distance is a metric for measuring the

amount of difference between two sequences. The value of this property is expressed as a percentage that will depend on the length of the misspelled word

• Example:

Page 10: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Transformation Rules

Complete List• o=0; i=1; l=1; z=2; o=3; e=3; s=3; d=3; t=4;e=4; l=4;

s=o; s=5; c=6; e=6; fi=6; o=6; l=7; z=7; y=7; j=8; g=8; s=8; a=9; c=9; g=9; o=9; ti=9; b={h,o}; c={e,o,q}; cl={ct,d}; ct={cl,d,dl,dt,ft}; d={cl,ct}; dl=ct; dt=ct; e=c; fl={ss,st}; ft=ct; h={li,b,ii,ll}; i=l; j=y; l=i; li=h; m={rn,lll}; n={ll,il,h}; oe=ce; r=ll; rn=m; s=f; sh={fli,ih,jb,jh,m,sb}; ss=fl; st=fl; tb=th; th=tb; v=y; u={ll,n,ti}; y={j,v};

Page 11: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Mashup Framework

Components

Virtualization Infrastructure

Meandre Infrastructure

Visualization

Component Repository

Component Discovery

Meandre Data-Intensive Flows

Apps ServicesPlugins

Web Apps

Analytics

Data

Dev

elop

er

Tool

sRepositorie

sData

AnalysisComponents

Flows

User Interfaces

Computational Resources

Visualizations

Meandre Workbench

Page 12: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Meandre for Mashups• Major Capabilities

• Dataflow execution• Semantic technology (using RDF for storing meta info)• Web-Oriented• Supports publishing services for data, analytics and

visualization• Modular components • Encapsulation and execution mechanism • Promotes reuse, sharing, and collaboration• Cloud-friendly infrastructure• Implements MapReduce for parallelization• Open source

• Note: Trading off some performance for reuse, flexibility and modular components… with option to parallelize components to improve performance

Page 13: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Locations

Components

Flows

Meandre Workbench• Web-based UI (GWT)• Components and

flows are retrieved from server

• Additional locations of components and flows can be added to server

• Create flow using a graphical drag and drop interface

• Change property values

• Execute the flow

Page 14: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Spellcheck Flow

Page 15: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Knowledge Discovery Infrastructure Benefits

• Provides access to data management tools• Selecting/Loading data from databases, flat files or

repositories• Integrates data mining algorithms • Supports an extensible interface for creating one’s own

algorithms• Provides means for building and applying models• Provides integrated visualizations components• Provides capability to build custom applications• Provides access for local or distributed computation• Provides the ability to share components and applications

Page 16: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

From Silos to Mashups• Definition: Mashup is a web page or application that uses and

combines data, presentation or functionality from two or more sources to create new services

• Why do we want this?• Enable out services in many applications and on a variety of

devices (laptop, high-res display wall, ipad, iphone or the others)• Share and reuse is a good thing• Reach communities with our tools and their data!!!

• What can we do to change this?• We can think and create data driven solutions so that they can

be mashed up with other tools.• We can build web services that can be deployed or accessed.• We can create API’s to be used.

Page 17: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Components

Analytics• Unsupervised Learning

• Clustering• Frequent Pattern Analysis• Topic Modeling (Mallet)• Concept Mapping

• Supervised Learning• Naïve Bayesian• Support Vector Machines

(Weka)• Decision Trees (c4.5)

• Optimization Approaches• Genetic Algorithm

• Text Analysis (NLP, Entity Extraction)• OpenNLP• Stanford NER

• Spellcheck• OpenMary (NLP, Text-Speech)

Visualization• Geographic (Google Maps)• Temporal (Simile)• Network Graphs – Link Nodes

and Arcs (Protovis)• Line Charts (D3)• Parallel Coordinates (Protovis)• Stacked Area Chart (Flare)• Tag Cloud Maker• Decision Tree (Applet D2K)• Naïve Bayes (Applet D2K)• Rule Association (Applet)• Dendogram (GWT)

Page 18: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Topic Modeling• Uses Mallet Topic Modeling to cluster nouns from

over 4000 documents from 19th century with 10 segments per document

• Top 10 topics showing at most 200 keywords for that topic

Page 19: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Concept Mapping

• Sentiment Analysis• six core emotions (Love, Joy, Surprise, Anger, Sadness,

Fear)

Page 20: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Thanks

• Xavier Llora lead developer, now at Google• Boris Capitanu, developer of Workbench, and now

lead developer• Other team members

Page 21: OCR Workshop Loretta Auvil UIUC October 18, 2011

OCR Workshop 2011

University of Illinois

Links

• www.seasr.org• www.seasr.org/meandre