1
The SCRIBO Module of the OLENA Platform: a Free Software Framework for Document Image Analysis Guillaume Lazzara, Roland Levillain, Thierry G´ eraud, Yann Jacquelet, Julien Marquegnies, Arthur Cr´ epin-Leblond EPITA Research and Development Laboratory (LRDE), France [email protected] At a Glance The Issue A Document Image Analysis (DIA) processing chain cannot handle all types of documents. The Point It is necessary to provide specific treatments for each kind of documents. Our Contribution A framework to design DIA software, preserving flexibility and efficiency. The Outcome The implementation of our proposal, the SCRIBO module, illustrates the benefits of this approach. Desired Properties of a Modern DIA Framework Flexibility Reusable building blocks to adapt processing chains to specific documents. Efficiency Handle large amounts of documents. Multiple interfaces Command line and Graphical Interfaces available. Easy to integrate high-level Application Programming Interface (API) and support for various platforms. Motivations Research Application Non-initiated users Domain experts OCRopus Scribo Gamera Qgar Leptonica Implement a framework with all our desired properties. Provide easy-to-use applications for DIA Make research progress in DIA accessible to end-user applications. Using our Image Processing (IP) library in concrete use cases. The Olena platform ` GUI CLI Web services Image Processing C++ Library (Milena) (SCRIBO) DIA framework . . . DIA Applications More information Online demos http://olena.lrde.epita.fr/Demos Website http://olena.lrde.epita.fr/ Contact [email protected] The SCRIBO Project [1] Project conducted in the context of the “System@tic Paris-R´ egion” Cluster (France). 9 Partners : AFP, CEA-List, EPITA, INRIA-Alpage, Mandriva, Nuxeo, Proxem, Tagmatica, XWiki. 3 years of development. Budget of 3,5Me . Applications and Uses Cases Original document image. \»\\` `-1 \\“\§|§' ' camp, and in such surprisingly spacious beds, that it took them hours to get to sleep. Where were we, you ask? Why, in our driveway, of course. The only sen- sible place to do a dry-or in this case wet-run of the trailer before really hit- ting the highway. About l A.M. l awoke, frozen, and realized another piece of vital instruc- tion I hadn’t gotten during the handoff was how to work the heating system. l fumbled with a flashlight and the outside gas tanks and finally figured it out. The THE AGE OF next morning, however, I learned that I AIRSTREAM had been too slow: My 2-year-old son, J.F.K. exits a mobile Walker, awoke with a nice head cold. hospital in 1961; The next blow: Our destination-the parked in Red Square dry lakebed of El Mirage to watch the in 1960; and setting last of the year’s speed trials-was shut a speed record down because of 35 mph winds. with a '65 Dodge. Instead we braved the ten-mile drive to a waterside park in Newport Beach, the can eventually crept in (which would reignited my enthusiasm. We were in back-to-basics mode (albeit with lots of happen to me in anything short of a mov- Calif. And although a questionable in- modern conveniences) and enjoying every able Four Seasons), and we ended our terior aroma grew steadily stronger, the simple minute of it. We even forgot to test journey. I realized that I had initially novelty of our temporary home, the gor- geous setting, and our sunset pizza party the flat-screen TV. missed the real point: Airstreams are hot After a few days the realities of life in again because they are high-end folk art, Feeoafxckfortunejoyride (Ci)/10tmail.c0m sculptures that represent Amer- ican pride and skill. In an age where people at the pointy end of the earning curve are starting to scale back on all that is big MY PLEA T0 ALAN MULALLY and wasteful, Airstreams are authentic statements about the In which the author begs Ford 3 CEO to produce the Ford Airslrearn. simple life without sacrificing looks or comfort-especially current family mover from Ford. The Ford Airstream DEAR ALAN: I am writingto you because I recently when you customize them (see had the opportunity to spend an afternoon with concept actually achieved something that I honestly box). To that point, 40% (and your advanced-design team and their brilliant Ford thought would never be possible: It made me desire growing) of today’s Airstream Airstream hybrid hydrogen fuel-cell concept that you what is essentially a minivan. If it can win over a buyers are “design aficionados” unveiled in Detroit. I was once again struck by its family-vehicle skeptic like me, imagine hovv easy it who see Airstreams as cool retro vvill be to conquer buyers who already vvant such a back-to-the-future interpretation of Airstreams iconic collectibles. They use them in shell, its clever solutions ior entertainment and com- thing-even with a simple gas engine or hybrid sys- new ways, from mobile archi- fort, and its svvish yet simple interior. You may agree tem. But you must already know this. So when will tecture and fashion statement those superlatives are not usually put together tor any you announce production? -Sincerely, Sue to guest house. (Tony furniture supplier Design Within Reach HIGH CONCEPT now offers an incredibly chic Ford’s appealing 16-footer.) Airstream van I just hope that Airstream can bridge all its different cus- tomers and remain faithful to the details (bring back the sun- burstl). As is true with many longtime brands, the loyalists $95 have kept it alive-but it is the new blood who will make or -..I I Qi# break the future. E 84 ' F O R T U N E November 26, 2007 Document reconstruction in PDF. GUI for DIA and reconstruction. The SCRIBO module: a DIA Framework Provides Basic routines Basic DIA toolchains Text in document Document layout analysis Text in picture High-level data structures Novel algorithms and techniques Standard I/O GUI and Command Line Interface (CLI) Facts 3 years of development 40K lines of C++ Open Source GPL v2 Used in Nepomuk/KDE Assets End-to-end tools From digital document to HTML and PDF reconstruction. Based on a well established IP library. Milena: a Generic Image Processing Library [2] Provides Data structures Safe data types More than 70 algorithms Memory management Facts 10 years of development Version 1.0 released on July 2009 120K lines of C++ Open Source GPL v2 References [1] SCRIBO, Semi-automatic and Collaborative Retrieval of Information Based on Ontologies. http://www.scribo.ws. [2] Roland Levillain, Thierry G´ eraud, and Laurent Najman. Why and how to design a generic and efficient image processing framework: The case of the Milena library. In Proc. of the IEEE Intl. Conference on Image Processing (ICIP), 2010. ICDAR 2011, Beijing, September 18-21, 2011

The SCRIBO Module of the OLENA Platform: a Free 12pt ...theo/posters/geraud.2011.icdar...The SCRIBO Module of the OLENA Platform: a Free Software Framework for Document Image Analysis

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The SCRIBO Module of the OLENA Platform: a Free 12pt ...theo/posters/geraud.2011.icdar...The SCRIBO Module of the OLENA Platform: a Free Software Framework for Document Image Analysis

The SCRIBO Module of the OLENA Platform: a FreeSoftware Framework for Document Image Analysis

Guillaume Lazzara, Roland Levillain, Thierry Geraud,Yann Jacquelet, Julien Marquegnies, Arthur Crepin-Leblond

EPITA Research and Development Laboratory (LRDE), [email protected]

At a Glance

The Issue A Document Image Analysis (DIA) processing chain cannot handle alltypes of documents.

The Point It is necessary to provide specific treatments for each kind ofdocuments.

Our Contribution A framework to design DIA software, preserving flexibility andefficiency.

The Outcome The implementation of our proposal, the SCRIBO module,illustrates the benefits of this approach.

Desired Properties of a Modern DIA Framework

Flexibility Reusable building blocks toadapt processing chains to specificdocuments.

Efficiency Handle large amounts ofdocuments.

Multiple interfaces Command line andGraphical Interfaces available.

Easy to integrate high-level ApplicationProgramming Interface (API) andsupport for various platforms.

Motivations

Research

Application

Non-initiatedusers

Domain experts

OCRopus

Scribo Gamera

Qgar

Leptonica

Implement a framework with all ourdesired properties.

Provide easy-to-use applications forDIA

Make research progress in DIAaccessible to end-user applications.

Using our Image Processing (IP)library in concrete use cases.

The Olena platform

`

GUI CLIWeb

services

Image Processing C++ Library

(Milena)

(SCRIBO)

DIA framework. . .

DIA Applications

More information

Online demoshttp://olena.lrde.epita.fr/Demos

Websitehttp://olena.lrde.epita.fr/

[email protected]

The SCRIBO Project [1]

Project conducted in the context of the“System@tic Paris-Region” Cluster (France).

9 Partners : AFP, CEA-List, EPITA,INRIA-Alpage, Mandriva, Nuxeo, Proxem,Tagmatica, XWiki.

3 years of development.

Budget of 3,5Me

.

Applications and Uses Cases

Original document image.

\»\\` -1

\\“\§|§' 'camp, and in such surprisingly spacious

beds, that it took them hours to get to

sleep. Where were we, you ask? Why, in

our driveway, of course. The only sen-

sible place to do a dry-or in this case

wet-run of the trailer before really hit-

ting the highway.

About l A.M. l awoke, frozen, and

realized another piece of vital instruc-

tion I hadn’t gotten during the handoff

was how to work the heating system. l

fumbled with a flashlight and the outside

gas tanks and finally figured it out. The THE AGE OF

next morning, however, I learned that I AIRSTREAM

had been too slow: My 2-year-old son, J.F.K. exits a mobile

Walker, awoke with a nice head cold. hospital in 1961;

The next blow: Our destination-the parked in Red Square

dry lakebed of El Mirage to watch the in 1960; and setting

last of the year’s speed trials-was shut a speed record

down because of 35 mph winds. with a '65 Dodge.

Instead we braved the ten-mile drive

to a waterside park in Newport Beach, the can eventually crept in (which wouldreignited my enthusiasm. We were in

back-to-basics mode (albeit with lots of happen to me in anything short of a mov-Calif. And although a questionable in-

modern conveniences) and enjoying every able Four Seasons), and we ended ourterior aroma grew steadily stronger, the

simple minute of it. We even forgot to test journey. I realized that I had initiallynovelty of our temporary home, the gor-

geous setting, and our sunset pizza party the flat-screen TV. missed the real point: Airstreams are hot

After a few days the realities of life in again because they are high-end folk art,Feeoafxckfortunejoyride (Ci)/10tmail.c0m

sculptures that represent Amer-

ican pride and skill. In an age

where people at the pointy end

of the earning curve are starting

to scale back on all that is bigMY PLEA T0 ALAN MULALLYand wasteful, Airstreams are

authentic statements about theIn which the author begs Ford 3 CEO to produce the Ford Airslrearn.simple life without sacrificing

looks or comfort-especiallycurrent family mover from Ford. The Ford AirstreamDEAR ALAN: I am writingto you because I recently

when you customize them (seehad the opportunity to spend an afternoon with concept actually achieved something that I honestly

box). To that point, 40% (andyour advanced-design team and their brilliant Ford thought would never be possible: It made me desire

growing) of today’s AirstreamAirstream hybrid hydrogen fuel-cell concept that you what is essentially a minivan. If it can win over a

buyers are “design aficionados”unveiled in Detroit. I was once again struck by its family-vehicle skeptic like me, imagine hovv easy it

who see Airstreams as cool retrovvill be to conquer buyers who already vvant such aback-to-the-future interpretation of Airstreams iconic

collectibles. They use them inshell, its clever solutions ior entertainment and com- thing-even with a simple gas engine or hybrid sys-

new ways, from mobile archi-fort, and its svvish yet simple interior. You may agree tem. But you must already know this. So when will

tecture and fashion statementthose superlatives are not usually put together tor any you announce production? -Sincerely, Sue

to guest house. (Tony furniture

supplier Design Within ReachHIGH CONCEPT

now offers an incredibly chicFord’s appealing

16-footer.)Airstream van

I just hope that Airstream

can bridge all its different cus-

tomers and remain faithful to

the details (bring back the sun-

burstl). As is true with many

longtime brands, the loyalists$95have kept it alive-but it is the

new blood who will make or-..I I Qi#break the future. E

84 ' F O R T U N E November 26, 2007

Document reconstruction in PDF. GUI for DIA and reconstruction.

The SCRIBO module: a DIA Framework

Provides

Basic routinesBasic DIA toolchains

Text in documentDocument layout analysisText in picture

High-level data structuresNovel algorithms and techniquesStandard I/OGUI and Command Line Interface (CLI)

Facts

3 years of development40K lines of C++Open Source GPL v2Used in Nepomuk/KDE

Assets

End-to-end tools → From digital document toHTML and PDF reconstruction.Based on a well established IP library.

Milena: a Generic Image Processing Library [2]

Provides

Data structuresSafe data typesMore than 70 algorithmsMemory management

Facts

10 years of developmentVersion 1.0 released on July 2009120K lines of C++Open Source GPL v2

References

[1] SCRIBO, Semi-automatic and Collaborative Retrieval of Information Basedon Ontologies.http://www.scribo.ws.

[2] Roland Levillain, Thierry Geraud, and Laurent Najman.Why and how to design a generic and efficient image processing framework:The case of the Milena library.In Proc. of the IEEE Intl. Conference on Image Processing (ICIP), 2010.

ICDAR 2011, Beijing, September 18-21, 2011