36
User Guide Design Station www.artsyltech.com docAlpha 5.0

User Guide - Artsyl Technologies Partner Portal · User Guide Design Station docAlpha 5.0. 2 ... The Design Station is not part of the production envi-

  • Upload
    lexuyen

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

User Guide

Design Station

www.artsyltech.com

docAlpha 5.0

Page 2: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

2www.artsyltech.com

Design Station User Guide

Contents

1. docAlpha Design Station Overview 5

2. Working with the Design Station 5

2.1. Starting the Design Station 5 2.2. Working with the Design Station User Interface 8

2.2.1. Design Station User Interface Overview 9

2.2.1.1. Properties Window 10

2.2.1.2. Batch Window 11

2.2.1.2.1. Definition Pop-Up Menu 11

2.2.1.2.2. Document Pop-Up Menu 11

2.2.1.2.3. Page Pop-Up Menu 12

2.2.1.3. Definition Window 12

2.2.1.4. Current Page Image Window 12

2.2.1.5. Hypothesis Tree and Error Messages Window 13

2.2.1.6. docAlpha Design Station Menu 13

2.2.1.6.1. File Menu 13

2.2.1.6.2. Batch Menu 15

2.2.1.6.3. Tools Menu 16

2.2.1.6.4. Help Menu 16

2.2.1.7. Main Toolbar 17

2.2.1.8. Status Panel 18

Page 3: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

3 www.artsyltech.com

Design Station User Guide

3. Working with the Semi-Structured Document Definitions 18

3.1. Semi-Structured Definition Properties 19

3.2. Semi-Structured Elements Pop-Up Menu 19

3.3. Semi-Structured Elements Toolbar 20

3.3.1. Working with the Anchor Tool 20

3.4. Search Elements 20

3.4.1. Search Element Types 21

3.4.2. Configuring Search Elements 22

3.4.2.1. Properties Common for All Elements 22

3.4.2.2. Properties Unique to Specific Elements 24

3.4.2.2.1. Static Text Element 24

3.4.2.2.2. Character String Element 24

3.4.2.2.3. POSIX Basic Regular Expressions Syntax 26

3.4.2.2.4. POSIX Extended Regular Expressions 26

3.4.2.2.5. POSIX Character Classes 27

3.4.2.2.6. Text Element 27

3.4.2.2.7. Form ID Element 27

3.4.2.2.8. Separator Element 27

3.4.2.2.9. White Gap Element 28

3.4.2.2.10. Checkmark Element 28

Page 4: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

4www.artsyltech.com

Design Station User Guide

3.4.2.2.11. Image Element 28

3.4.2.2.12. Barcode Element 28

3.4.2.2.13. Generic Objects Element 29 3.4.2.2.14. Alternative Element 29

3.4.2.2.15. Group Element 29

3.4.2.2.16. List Element 29

4. Working with the Fixed-Forms Document Definitions 30

5. Data Extraction Type Libraries and Library Mode 32

6. How to work and extract different types of barcodes 34

6.1. Scenarios 34

Page 5: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

5 www.artsyltech.com

Design Station User Guide

1. docAlpha Design Station OverviewThe production environment of docAlpha includes Au-

to-Registration, Scanning, Recognition, Verification and Export Stations. The Monitoring Station is not a part of production workflow cycle but it can be used to monitor and impact the processing done on the production envi-ronment stations.

The Design Station is not part of the production envi-ronment but instead is the workstation on which the new fully automated classification and data capture defi-nitions are created. The Design Station allows working with both fixed-form technology and semi-structured documents extraction, or IDR (Intelligent Documents Recognition) technology.

Artsyl IDR technology is engine-independent, meaning that different OCR, ICR, OBR & OMR engines can be used with the Designer Station. It also allows using more than one engine at a time with any capture definition. You can even use multiple recognition engines within one capture field, up to the maximum number of the engines that are pur-chased and configured in the docAlpha license.

The Design Station provides convenient tools for cre-ating, testing and fine-tuning the classification and data capture definitions. It can be run on a dedicated machine for the definitions designer, or can also be installed on the same machine as the Administration Station is in-stalled (for the convenience of being called for a specific workflow definitions right from inside the workflow tree in the Admin Station).

The Design stations architecture is a Windows Appli-cation. It can run on all docAlpha-supported operation systems (MS Windows XP, Vista, Windows 7, Server 2003, Server 2008, each operation system listed supports both 32-bit and 64-bit platforms). Refer to the Installation Guide for the details of operation system versions and service packs required.

All docAlpha stations use the Concurrent License model. If you log off the Design Station on one machine, you can use the same license to start it on another ma-chine.

2. Working with the Design StationdocAlpha Design Station main purpose is to provide

an interactive development, testing and fine-tuning en-vironment to develop fully automatic classification and data capture definitions.

2.1. Starting the Design Station

To start the Design Station, select Start All Programs Artsyl docAlpha Artsyl docAlpha Stations do-cAlpha Design Station from the Start menu or use the desktop product accelerator icon. As with any other sta-tion, the user will be required to enter the user name and password for authentication:

Page 6: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

6www.artsyltech.com

Design Station User Guide

The alternative way is to start the Design Station from the Single Sign-On Utility that allows signing up just once and then launch any station that the operator has credentials to start without having to re-enter the user name and password:

by clicking on the “Design Station” button.

Note: The Single Sign-On Utility window (the docAlpha Startup Panel) lists only the stations to which the opera-tor has the credentials to log in. If the station that the op-erator needs to start does not show up on the panel, the operator is lacking the credentials to start it and needs to contact the Administrator of docAlpha to get the proper credentials.

The starting process includes a detailed step-by-step audit process that provides detailed information for any possible connectivity, licensing and other issues that may prevent the operator from logging in successfully and starting up the station. If the login is unsuccessful, the step-by-step tracking table is shown which shows what steps were performed successfully and where the prob-lem happened:

In the screenshot above, the login process went through the four steps successfully and failed on step 5 – checking the user login credentials.

Below is the detailed explanation of each login step, listing the step name, explanation of what is done on that step, and what to do (troubleshooting steps) in case the login process fails at that step:

Login Step Name

Details of the Step Troubleshooting

GetNetwork

Data

This step is respon-sible for getting network identity and detecting the gateway to connect to the server.

1. Computer is dis-connected from the network – check the network connections.

2. There are two or more gateways on the computer (for example in case of VPN) – choose the correct gateway op-tion in the login screen.

ServerConnection

This step checks if the main docAlpha appli-cation server runs on the specified address and replies to the client requests

Check if docAlpha Server Service is started and that firewall set-tings allows connection via the specified ports (the ports are 8008 and 5000).

ServerLicense Check

This step checks that docAlpha server itself has a valid license

Check license options on docAlpha server. If licensing was installed locally, check whether the license server service is in the online state.

SQL Server Check

This step checks that docAlpha Application Server is connected properly to the SQL Server

Check that the SQL Serv-er services are running and that docAlpha SQL connectivity options are correct.

UserCredentials

Check

This step checks user credentials.

Check the user name and password. Check that the user has enough rights to start the station.

StationVersions

Check

This step checks that the station has the appropriate file version.

Install the appropriate stations update or turn on the Auto- Updater option.

Page 7: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

7 www.artsyltech.com

Design Station User Guide

Get Stations Identity

This step checks the station ID in the database, or creates a new ID on the first station start

Errors on this step mean a critical inconsistency in the internal database. Please contact Artsyl support to troubleshoot if this happens.

AcquireLicense

This step checks for an available concur-rent seat license

Check online stations of Design type. The total number of simultane-ously running stations cannot exceed the number of the Design Stations in the license. If you need to increase the number of stations, please contact your Art-syl authorized reseller.

Set Station Online

Set station status to ‘online’.

Only critical connec-tivity errors can cause this step to fail. Please contact Artsyl support to troubleshoot your installation network configuration.

After the authentication is passed successfully, the De-sign Station main UI window is loaded. Since the docAl-pha IDR Kernel is engine-independent, upon starting the station the user is asked to configure with which recogni-tion engines he wants to work:

• Main Engine: This parameter determines which en-gine will be used for full-page pre-recognition of all pro-cessed pages. Based on the findings of that full-page pre-recognition, the IDR analysis is done. Make sure that you select the Main Engine from among the engines that your Production Environment stations are using in the docAlpha License.

Note: if the workflow comes to the Recognition Station with the Main Engine selected that is not licensed, or the pages for which are all used up for the current period, the job will be denied recognition and will be sent to the Rec-ognition-Exceptions queue.

• Additional Engines: In this section, select the addi-tional engines, on top of the Main Engine, that you want to use in the document definitions that you will be creat-ing with the Design Station. The Additional Engines are not used for document pre-recognition, but they can be used for the zonal-level detailed recognition run performed once the field is located based on the Main Engine pre-recognition results. Make sure you select the engines for which you have a proper license to use in the production environment.

Note: if the workflow comes to the Recognition Sta-tion with the Additional Engines selected that are not licensed, or the pages for which are all used up for the current period, the job will be not be denied recognition, it will still go through, however, all fields marked to be recognized with a missing Additional Engine will be in-stead recognized with the Main Engine.

After making the selection of the Main and Additional Engines to be used in this Design session, it will look like this::

Page 8: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

8www.artsyltech.com

Design Station User Guide

2.2. Working with the Design Station User Interface

The Design Station user interface provides convenient tools to create, test and fine-tune the automatic docu-ment classification and data extraction definitions.

After such definitions are created and tested in the De-sign Station to properly classify documents and extract data from the documents, they are saved as definition files which then are used at the Admin Station when creating a workflow to process this class of documents. When the workflow parameters are configured and the destination for data and document is set, the workflow is published to make it active and ready for production.

A typical workflow performs the automatic processing on the Recognition Station, and that’s where the auto-matic definitions will be matched to the incoming doc-uments to perform classification and data extraction. During the recognition, the definitions are used as part of the batch, with many definitions possibly used togeth-

er within one workflow batch. During the matching pro-cess, the best candidate definition is selected for each incoming document. The definitions can be page-level or document-level. The page-level definitions can be additionally used to do page-to-document analysis and conversion – an assembly process based on the expected documents and batch structure.

To facilitate such matching and classification accura-cy, the work in the Designer Station UI is done in a batch mode, allowing to test in real-world scenarios of a mix-ture of different document types scanned in together. The Designer UI shows which definition is matching to what testing documents and why. A typical testing and fine-tuning session in the Designer Station checks for both accuracy of data extraction with specific document definitions and accuracy of selecting the correct docu-ment definition at matching and classification step.

Page 9: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

9 www.artsyltech.com

Design Station User Guide

2.2.1. Design Station User Interface Overview

The screenshot below presents the main 10 sections of the user interface that the Design Station provides:

1. Main Menu

2. Main Toolbar

3.Semi-Structured Definitions Toolbar (displayed only when working with Semi-Structured Definitions)

4. Fixed-Form Definitions Toolbar (displayed only when working with Fixed-Form Definitions)

5. Batch Window

6. Definition Window

7. Properties Window

8. Current Page Image Window

9. Hypothesis Tree and Error Messages Window

10. Status Panel These interface sections and their functions are de-

tailed below.

Page 10: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

10www.artsyltech.com

Design Station User Guide

2.2.1.1. Properties Window

The Properties Window is the most often used window when working with the Design Station.

When you select any object in any other window of the Design Station UI, the properties of that object are de-tailed in the Properties Window. Some of the properties are read-only, while other properties and settings can be modified using the Properties Window.

The Properties Window has its own toolbar that con-tains the buttons described below, followed by the icon of the currently selected object and the name of the cur-rently selected object whose properties are displayed in the Properties Window. The Properties Toolbar buttons are:

Sort by Category: Sorts the entries in the Proper-

ties Window so that they are grouped by their logical categories

Sort by Alphabet: Sorts the entries in the Proper-ties Window alphabetically

or

Disabled/Enabled: Shows the status of being en-abled or disabled (displayed only for objects that can be potentially disabled). When the object is enabled,

it’s “Enabled” status is displayed.This is a toggle button: If clicked by the mouse, the

status will become “Disabled”. If clicked again,

the status will become “Enabled” again.

In the lower part of the Properties Window, the bold-faced Name followed by detailed text description of the currently selected property is displayed.

To change any property in the Properties Window, first select it by a single mouse click or keyboard navigation. If the property allows for being entered as text, you can type in and modify the property value using the key-board.

If a property has a strictly defined list of possible values, like the “Export Block Type” property on the screenshot above, a drop-down selection button is displayed that allows selecting the value of the property from a drop-down list. You may also select such drop-down values by typing in the first character of the value, with the con-trol automatically scrolling the list to the value that starts from that character.

Complex properties that have their own user interface windows or dialogue boxes, such as the “Constraints” from the screenshot above, display a dialogue box launch button on the right.

Page 11: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

11 www.artsyltech.com

Design Station User Guide

2.2.1.2. Batch Window

The Batch Window displays the current contents of the batch on which the operator is working. The batch contains two main sections – Definitions and Documents:

The “Definitions” node contains all definitions that are part of this batch. The “Documents” node contains all the testing documents that have been added to the batch. Each “Document” node in turn can be opened to browse to the pages of that document.

To load a document or definition to the Design-er Station Batch use the File Menu, buttons of the main toolbar, or just drag-&-drop the document image file or the XML definition file into the Batch Window.

To start working on a specific definition and to display its details in the Definition Window, just select it in the Batch Window. In the same manner, to load a specific page to the Current Page Image Window and start test-ing how the definitions apply to that page, just select the page in the Batch Window. If instead of selecting a spe-cific page you click on the document-level node in the Batch Tree, the first page of that document will be loaded to the Current Page Image Window.

2.2.1.2.1. Definition Pop-Up Menu

If you right-click on any definition, its pop-up menu will display. The definition pop-up menu allows to enable/disable the definition, and to permanently delete the definition.

A disabled definition is displayed with a grayed-out icon. Once disabled, it will not participate in the recog-nition and matching processes. That allows testing how

the batch would work if that definition was not part of the batch at all.

Note: The permanent deletion should only be used when you are absolutely sure you will not need that defi-nition again, since there is no Undo for that operation. For all temporary testing needs use the “Disable” action instead.

2.2.1.2.2. Document Pop-Up Menu

If you right-click on any document, its pop-up menu will display. The document pop-up menu allows delet-ing the pre-recognition (raw OCR run) results by using

the “Delete Extracted Data” command and deleting the matched classification results and extracted data fields by using the “Delete Recognized Data” command.

Page 12: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

12www.artsyltech.com

Design Station User Guide

2.2.1.2.3. Page Pop-Up Menu

If you right-click on any document page, its pop-up menu will display. The page pop-up menu allows delet-ing the pre-recognition (raw OCR run) results by using the “Delete Extracted Data” command and deleting the matched classification results and extracted data fields by using the “Delete Recognized Data” command, similar to the same commands available for the document level.

In addition to those commands, the page-level pop-up menu also offers setting the expected correct definition that should match to this page. The sub-menu for the “Set Correct Definition” menu shows the list of all definitions that are in the batch, plus two additional options: “Ignore Matching” (default setting until the operator makes spe-cific match selection) and “None Should Match” (a special case when the document is an annex page to which none of the batch definitions should match):

After each test-matching on any page, document or the whole batch, the statuses of matching are updated for the re-matched pages and documents. There can be three matching statuses for the pages, based on the “Set Correct Definition” – defined expectations for the match-ing:

Ignore Matching: The matching expectations

were not set yet with the “Set Correct Definition” tool (default “Ignore Matching” is still in effect)

Correct Match: The actual document defini-tion matched to the page is the same one as was expected to match to it based on the “Set Correct Definition” settings

Incorrect Match: The definition that actual-ly matched to the page was different than the expected one based on the “Set Correct Definition” settings

2.2.1.3. Definition Window

The Definition Window displays either the currently selected definition or its data extraction type library based on the selection of the “Definition / Library Switch” toggle button.

Refer to the Chapter 4, “Working with the Semi-Struc-tured Document Definitions”, and Chapter 5, “Working

with the Fixed-Form Document Definitions”, for more details on how to work with the Definition Window tog-gled to the “Definition” view.

Refer to the Chapter 6, “Working with the Data Ex-

traction Type Libraries”, for more on how to work with the Definition Window toggled to the “Definition” view.

2.2.1.4. Current Page Image Window

The Current Page Image Window displays the currently open page image. To select another page, click on the de-sired page in the Batch Window. To see the details of the

page image parameters, click anywhere on the image in the Current Page Image Window and check the parame-ters displayed in the Properties Window.

Page 13: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

13 www.artsyltech.com

Design Station User Guide

2.2.1.5. Hypothesis Tree and Error Messages Window

The Hypothesis Tree and Error Messages Window dis-plays as its name implies either the hypothesis tree from the last test-matching operation, if the matching opera-tion was performed successfully, or, if there were critical errors in settings that prevented building the hypothesis tree at all, the window displays the list of error messag-es with details what and where was defined incorrectly. Since the errors may be related to different elements in the tree and the element tree may have thousands of el-ements, the convenient ability is to double-click the error

message, which scrolls the elements tree to the position of the faulted element and brings it into the focus.

Refer to the Chapter 4, “Working with the Semi-Struc-tured Document Definitions”, for more details on work-ing with the hypothesis tree, semi-structured elements and their properties, and troubleshooting of errors dis-played in the Error Messages Window instead of the Hy-pothesis Tree.

2.2.1.6. docAlpha Design Station Menu

This section details all operations available through the main menu of the Design station.

2.2.1.6.1. File Menu

The File Menu offers access to the following actions:

• New Batch: This menu item allows creating a new do-cAlpha Designer project batch. It displays a dialogue that allows creating a new empty sub-directory that will hold all the project batch files. You may also select an existing directory, but it must be an empty directory in this case. As a result, an empty docAlpha Designer proj-ect batch is created and loaded to the Batch Window of the Designer Station UI.

• Open Batch: This menu item allows opening an exist-ing batch. You can open the current-version batches as well as the previous docAlpha version batches with this command, but in case of the previous version batches,

all matching results will be invalid and re-matching will be required. The opened batch is loaded into the Batch Window of the Designer Station UI.

• Close Batch: This menu item allows closing and un-loading the current batch.

• New Definition: This menu item allows creating a new definition and adding it to the current batch:

Page 14: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

14www.artsyltech.com

Design Station User Guide

The “New Definition” menu command offers to create either a Flexible (Semi-Structured) or Fixed- Form Defi-nition, and give the definition a name.

After selecting the definition type, below in the “Based on” drop-down list it is possible to select an ex-isting definition from the current batch based on which the new definition should be created. This allows creat-ing a quick copy of the existing definition and working on it to modify what needs to be different in the new version. If the “Based on” is left with the default value “Empty”, then an empty definition of the selected type is created.

• Open Definition: This menu item allows opening an existing definition and adding it to the current batch. In the dialogue box that shows up, browse to the definition file to open it. A copy of the file to which you browse is created and added to the current batch. Any work you do on it in the current batch does not impact the old original that you browsed to.

• Open Document: This menu item allows opening an image document and adding it to the current batch.

• Save Definition: This menu item allows saving the definition that is currently selected in the Definition Win-dow.

• Save All Definitions: This menu item allows saving all definitions from the current batch.

• Export Definition: This menu item allows storing the definition in “full details” mode. The definition files stored inside the Designer Batch do not contain the values for any parameters if they are not modified by the Design station operator. So the default values for any parame-ters are not stored. The “Export Definition” button stores

the entire set of definition parameters and values, even those of them that contain the default values. This facil-itates the transition between the versions and it should be used if migrating to a new major version of the prod-uct is planned.

• Initiate Definition Based on Profile: This menu item al-lows creating a mock-up of automatic capture definition based on the Verification Profile file. The dialogue box of-fers the list of workflows and their profiles from among which the operator can select the needed Verification Profile. The command then will create a set of mock-up elements representing each field in the Verification Pro-file. This ensures that the names of the capture elements are identical and bringing in the automatic definition to the workflow will not violate any element names already used with the basic capture verification profiles there.

• Load Definition from Server: This menu item allows loading a definition from docAlpha Server. In the dia-logue box that is displayed you can select the workflow from the list of workflows. Each workflow may contain a list of definitions. You can select both a single definition and a whole workflow. If a whole workflow is selected, then ALL definitions of that workflow are added to the current Designer Station batch.

• Recent Batches: This sub-menu lists the most recent batches that were opened by the Design Station and allows opening any of them by selecting from the sub-menu list.

• Exit: This menu item allows closing the docAlpha Design Station. If there are any definitions that have new changes in them that were not saved yet, a dialogue box will be shown listing the names of such definitions and offering to save changes to them or to cancel the station closing operation.

Page 15: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

15 www.artsyltech.com

Design Station User Guide

2.2.1.6.2. Batch Menu

The Batch Menu offers the access to the following ac-tions:

• Full-Page OCR: This menu item allows performing

pre-recognition of the currently selected page or docu-ment with the Main Engine.

• Full-Page OCR All: This menu item allows performing pre-recognition of all pages of the current batch with the Main Engine.

• Document Level Definitions: This is a toggle menu command that allows turning on or off the Document-Lev-el mode for the definitions. In the Document-Level mode, the definitions are matched to the entire multi-page doc-ument ignoring the pages separations. This mode is use-

ful for the documents that do not have proper separa-tion of information per pages and that “flow” through the pages freely and without any rules.

• Test Definition: This command allows test-matching of the currently selected in Definition Window definition to the selected document or page. It ignores entirely all other definitions, and matches the selected definition in full isolation. This allows clarifying the reasons why the expected definition does not match to the selected page when instead of the expected definition some other defi-nitions matched.

• Match Definition Named: This command allows test-matching of the selected definition to the selected document or page. It is very similar to Test Definition, except it allows selecting which definition to match from the list of definitions. It also ignores entirely all other definitions, and matches the selected definition in full isolation. This clarifies the reasons why the expected definition does not match to the selected page when instead of the expected definition some other definitions is matched.

• Match Definition: This command allows test-matching all definitions to the selected document or page. The best match will be selected out of them and all fields extracted based on the matched definition.

• Match Definition All: This command allows

test-matching all definitions to all documents in the cur-rent batch. The best match definition will be selected for each document and all its fields will be extracted based on the matched definition.

Page 16: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

16www.artsyltech.com

Design Station User Guide

2.2.1.6.3. Tools Menu

The Batch Menu offers access to the following ac-tions:

• Active Engine: This menu item allows selecting the main recognition engine and the additional recognition engines. This operation can be performed only when there is no open batch. If a batch is already open, close the batch first, and then change the engine configura-tion:

The Main Engine is used during the full-page pre-recognition phase. It is also used by default for the zonal-level recognition. The Additional Engines can be used for zonal-level recognition with or with-out the Main Engine.

The checkmark “Show this dialogue at startup” al-lows showing or not showing the engines configuration screen at the Designer Station startup. If for example you always work with the same engine or two, configure them once and turn off displaying of this window at startup.

• Options: This menu item allows setting up the Design Station options:

• Time-Out: This parameter defines the maximum time allowed for processing (recognition and matching) of any one page in the batch. It allows limiting the time wasted on real bad pages where the recognition engine and the IDR kernel cannot find meaningful information even after a very long time.

Note: the operation time-out is an only an approximate cut-off since the time is measured specifically for the rec-ognition and IDR matching phases, and does not account for file and other operations, so the real time spent may be longer than the defined cut-off.

2.2.1.6.4. Help Menu

The Help Menu offers access to the following actions:

Page 17: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

17 www.artsyltech.com

Design Station User Guide

• Contents (F1): provides access to this help file.

• Logs Open Log Folder: opens the directory that contains detailed log files for the Monitoring Station. Note that the detailed log files are compressed encrypt-ed files; they would only become necessary if requested by the vendor support team.

• Logs Open User Log: provides access to User Log Viewer utility that shows the most important events and issues in user-friendly text format arranged by the date and time of the event.

• Logs Set Log Depth: allows setting the log depth (verbosity) level. Note that the administrator can override the local settings using the Server Configuration Utility and enforce a different level of log depth to all stations.

• Configurations Open Configuration: allows view-ing and editing of the Designer Station configuration XML file.

• About: provides detailed information about the sta-

tion and all components and interfaces used. The About window contains “Copy to Clipboard” button that allows copying all the details to the clipboard in case they are needed for reporting, to be able to simply copy-paste all the details:

2.2.1.7. Main Toolbar

The Main Toolbar of the Designer Station is displayed at the top of the main window user interface:

Page 18: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

18www.artsyltech.com

Design Station User Guide

It offers the access to the following actions:

New Definition: Creates a new definition and adds it to the current batch

Open Definition: Opens an existing definition(s) and adds it to the current batch

Open Document: Opens image document(s) and loads it to the current batch

Delete: Allows deleting the currently selected ob-ject in the Batch Window, which can be a definition, a document or a page of a document

Save Definition: Saves the currently selected defi-nition(s)

Save All: Saves all definitions of the current batch

Export Definition: Exports out the currently select-ed batch in the “Detailed Mode”. Refer to the “File Menu” section for the detailed explanation of the ex-port operation

Pre-Recognize: Performs pre-recognition OCR run on the selected page(s)

Match: Performs matching of all definitions to se-lect the best match to the selected page(s) or docu-ment(s)

Definition / Library Switch: Toggles the Definition Window between displaying the current definition and its Data Extraction Type Library

Show Blocks: shows or hides the data extracting fields found on the page

Show Words: shows or hides the words detected by pre-recognition OCR run

Show Text Lines: shows or hides the text lines de-tected by pre-recognition OCR run

Show Separators: shows or hides the horizontal and vertical lines (separators) detected by pre-recog-nition OCR run

Show Barcodes: shows or hides the barcodes de-tected by pre-recognition OCR run

Show Images: shows or hides the image zones de-tected by pre-recognition OCR run

Zoom In: Zooms in the image in the Current Page Image Window

Zoom Out: Zooms out the image in the Current Page Image Window

2.2.1.8. Status Panel

The Main Status Panel of the Designer Station is dis-played at the very bottom of the main window user in-terface. It displays the current progress messages and warnings and shows the mouse cursor coordinates

when the mouse is hovered over the Current Page Image Window.

3. Working with the Semi-Structured Document DefinitionsSemi-Structured Documents Capture is relative-

ly new but already widely accepted in the industry ap-proach to process documents and forms. It works well for the cases when you know what data fields you need to capture from a document class, however you cannot tell the system ahead of time exactly where that field will be printed. For example, different vendors would print infor-mation on their invoices in quite different locations, but

Page 19: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

19 www.artsyltech.com

Design Station User Guide

you still need to capture the same set of fields (invoices number, invoice date, total amount, etc.) from all of those different forms. This is the realm called Semi-Structured documents, and this capture scenario is a good fit to pro-cess such documents in fully automatic mode.

This scenario involves creating a semi-structured doc-ument definition for each class of semi-structured doc-uments. You can choose to create one semi-structured

form per vendor or one semi-structured form that covers hundreds of vendors. The more flexible you make your semi-structured document definition, the more docu-ment sub-types it will cover in a single design. Howev-er, the more tries it takes to test it on all those classes and make it working flawlessly for all the variations. This chapter covers all details of semi-structured definitions, their properties and working with them.

3.1. Semi-Structured Definition Properties

Each semi-structured definition has the following properties that can be viewed and/or modified using the Properties Window:

• Name: The name of the definition. Needs to be a proper identifier.

• Caption: The caption of the definition as the user sees it. May contain any characters you want.

• File Name: The name of the XML file that stores the current definition.

• Type: The definition type (either Semi-Structured, also called Flexible, or a Fixed-Form definition).

• Languages: The list of languages that need to be used during the pre-recognition full-page OCR run. Note that some languages may not be supported by each of the recognition engines, and that also some of the lan-guage combinations (such as Cyrillic group languages at the same time as East Asian group languages) may not be supported by the engine.

• Protected: A flag that signals if the definition file is to be stored as open-format text XML or as encrypted file with the logic of the definition hidden from the user who uses it as part of a docAlpha workflow. Provides ability to build protected vertical solutions for the channel.

3.2. Semi-Structured Elements Pop-Up Menu

If you right-click on any semi-structured element inside the Definition Window, its pop-up menu will show up:

The commands provided by the definition element pop-up menu are:

• Delete: Deleting the currently selected semi-struc-tured search element.

• Add: The menu opens into a sub-menu that lists all kinds of semi-structured elements that can be added to the Defi-nition tree. If the current element on which you clicked is a Group or Alternative element, the new element will be added to the end of the list of its member elements. In all other cases the new element is added immediately below the current.

• Replace: Allows replacing one element type with anoth-er. All properties that are common between the two element types are preserved. The properties that are not common be-tween the two are lost. The three compound element types that do not allow replacements are Group, Alternative and List.

Page 20: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

20www.artsyltech.com

Design Station User Guide

3.3. Semi-Structured Elements Toolbar

The Semi-Structured Elements Toolbar offers the fol-lowing buttons:

Show Search Zone: Shows or hides the background high-lighting that fills the searching zone for the currently selected search element

Anchor Tool: Allows anchor-ing the current element on one of the previously found el-ements. The element’s search zone becomes related to the other element’s position

Edit Search Zone: A toggle button that turns on and off the drag-by-mouse editing mode for the search zone and its bor-ders

Back to Parent Level: When inside a compound element (a group, a list or an alternative), allows to return to the parent level of the element tree or hy-pothesis tree

Note: The Semi-Structured Element Toolbar is dis-played only if the current definition is Semi-Structured.

3.3.1. Working with the Anchor Tool

The anchor tool allows quick “anchoring” an element based on one of the previous elements. The search zone of the current element becomes dependent on the loca-tion of one of the previous elements. When the previous element moves, the search zone for the current element moves in sync with it. This provides the necessary flexi-bility that a typical “semi-structured” financial and other real-world document requires.

To set up the anchoring for the current element, fol-low these four steps:

1. Click on the “Anchor” tool.

2. Select the element for which you want to set up the anchoring.

3. Click on the element to which you want to be anchored.

4. Draw a rectangular search zone on the image in the Current Page Image Window. To do that, press and hold the left mouse button as you move from the top left to the bottom right corner of the desired search zone.

When you release the mouse, the search zone is trans-formed into a set of four relations (Above, Below, Left Of and Right Of ) that are saved in the Constraints of the cur-rent search element.

Note: If you hold down the “Shift” button when you re-lease the mouse, the zone will be added as an additional search zone to the existing search zones of the current el-ement. If you do not hold the Shift button, the new zone will replace all previous zones. Therefore, holding down the “Shift” button allows creating multiple search zones for the search element with the Anchor Tool.

3.4. Search Elements

Search elements are the main building blocks of a semi-structured definition. They can be used to find the needed information directly, or to serve as the “stepping stones” leading towards finding the necessary element.

The result of matching a semi-structured definition to any image is a tree of possible variants where the elements could have been found on that image, called a Hypothe-sis Tree. Each element has its search zone defined in rela-

Page 21: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

21 www.artsyltech.com

Design Station User Guide

tion either to some of the previously found elements, or to the page location, or both. The element’s search zone is defined internally with four relations (Below, Above, Left Of, Right Of ) or a group of such relations. Aside from the search zone, the relations can dictate priorities and conditions for finding the elements – for example, de-manding to locate the nearest element, or an element located within some variable distance.

The whole semi-structured definition itself is a search element, too (an element of type “Group”). Since there is often many ways to find an element based on its prop-erties and search constraints, a tree of those possibili-ties is built in a way that is presented to the operator in a visual form as the Hypothesis Tree in the Hypothesis Tree Window. Each level of the tree is marked with all variants of finding a specific element: the first element on the first level, the second on the second level, etc. Each branch of the tree gives a possible end point vari-

ant of finding all elements on the given image, and is called a hypothesis.

Each hypothesis has a quality assigned to it, from zero

to 100 percent, written as a number from 0.0 to 1.0. The quality of each hypothesis of finding each search ele-ment is determined based on the suitability of the found hypothesis to the properties of the search element, rela-tions such as distance, as well as the confidence of char-acters that form the text of the hypothesis. The whole branch – from finding the first element to the last one – is evaluated, step by step, with the penalties of each step accumulated. The final quality of the whole branch determines the winning branch that will be selected to report the positions of all elements found on that branch. A very similar operation is used to determine the winning definition among all definitions that are matched to the selected image document.

3.4.1. Search Element Types

docAlpha provides a number of semi-structured doc-ument definition elements, used as building blocks to create semi-structured document definitions. They help locate various types of data and formatting elements in the documents. They include:

• Static TextFinds a string based on search string variants, with possibility for misrecognized characters and many additional tweaking parameters.

• Form IDA special type of a Static Text element that is re-quired to match this definition. If it is not found on an image, the definition should not trymatching to that image.

• Character StringFinds a string of text based on its contents (reg-ular expression), size, placement, internal spaces and other parameters.

• TextFinds a text fragment, used mostly for finding multi-line text fragments.

• BarcodeLocates 1-D and 2-D barcodes.

• SeparatorFinds horizontal and vertical lines, including broken & patterned lines.

• Generic ObjectFinds image fragments. A tool commonly used to find signatures, photos and other zones not falling into standard categories.

• White GapLocates a white gap among printed elements. You can control how “white” (i.e. sensitive to internal garbage) as well as how big it should be, and what type of elements you consider and what kinds you ignore when looking for the gap. Allows capturing fields that float freely in a wide area for which zone-type restrictions are not effective.

• CheckmarkAn element to capture checkmark fields. Does not do any special searches in the zone, instead, returns the whole search zone so that the OMR engine reads it as a checkmark field.

Page 22: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

22www.artsyltech.com

Design Station User Guide

• ImageAn element to capture image blocks. Does not search for anything inside the search zone, in-stead, returns the whole search zone as a binary image.

• AlternativeA logic branching element. Allows structuring the capture logic into two or more logical lines. Allows creating child groups and elements with it to implement alternative strategies of finding fields

• GroupLogical grouping of the elements together. As soon as a group member element reaches zero quality, the rest of the group is not calculated, which improves the speed greatly. Also allows logical grouping of alternatives, in conjunction with Alternative element.

• ListFinds lists, tables and transactions such as EOB printouts. Can capture rectangular, non-rectan-gular / transaction tables and un-formatted lists. You can specify what defines a member element in the list, and that element will be repeatedly searched for based on your conditions.

3.4.2. Configuring Search Elements

The elements can be configured using their properties and the constraints that limit their search zone, mutual location, distance, etc. Some of those properties are com-mon for all elements, and some are unique to one ele-

ment type. Below is a list of all common properties with brief explanation of each followed by a review of proper-ties unique to each of the element types.

3.4.2.1. Properties Common for All Elements

The following properties are common to all element types:

• Name: Internal name, important when building con-straints and relating them to an element the currently se-lected semi-structured search element.

• Export Block Type: Determines if the block is a cap-ture zone or just an internal stepping stone to locate oth-er fields. “None” means it’s internal, and any other value reports it as a found capture zone. The type is determined by the selection (“Text”, “Checkmark”, “Barcode”, etc.).

• Constraints: Constraints can limit where the element is searched for and which searching hypothesis wins out of several possible choices.

• Zone: constraints: Above, Below, LeftOf, RightOf al-low to select the anchor element compared to which the search area is defined.

It is also possible to set up a non-zero offset. The { 0, 0 } coordinates are in the top left corner of the page, so pos-itive offsets mean “lower” or “to the right”, and negative offsets mean “upper” or “to the left”.

If you want to refer to an element from the same group, use its Name.

If you are referring to an element from another group, use the full name or the “.GroupName.elementName” no-tation, starting from “.”

To refer to the page itself, use “.” followed by the defi-nition root element name notation, for example, “LeftOf: .Invoice.left”.

The page has four 1-coordinate parameters (left, right, top, bottom) useful for Zone constraints and four 2-co-ordinate parameters (left-top, right-top, bottom-left and bottom-right) useful for Nearest constraints that require a 2-coordinate location reference point.

Page 23: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

23 www.artsyltech.com

Design Station User Guide

Each element can be referred to as the whole zone, us-ing the phantom properties ph-left, ph-right, ph-top and ph-bottom, or as the exact border of the element, using the properties left, right, center, top and bottom.

Phantom properties are affected by phantom override setting (they either remain when their element is not found or dropped in case of override).

Special recommendations for defining zones to search for the next row in a table or a new transaction in a trans-action-based document: each row is found relative to the previous row, not to an absolute coordinate or an element; to refer to the previous line, use references with the no-tation YourGroup.prev.YourRefElement. YourGroup is the name of the root group of the List element, and YourRefE-lement is one of the elements of that group. YourGroup can also include the sub-group names if necessary.

See the List Element description for more information about working with lists, tables and transactions.

• Nearest constraints: Select the element based on the condition to be nearest to another element, or a point on the page. Only elements that fit in the search zone de-fined by Zone constraints are considered.

• Distance constraints: Allow limiting the search to be within the distance defined in the constraints relative to another element or to a point on the page.

• Exclude constraints: allow excluding from the search zone the area that is already found to belong to anoth-er element. Allows a cascading approach to defining the search zone by drafting the rude zone first and then de-ducting all “unwanted” sections from the search area.

• Character Confidence Quality: Defines how much character recognition confidence impacts the quality of the hypothesis. By default every element you create is highly sensitive to the confidence of recognition coming from the OCR engine. The default function used is

0.999 + 0.001 * x

where x means average confidence calculated based on every character within the field returned

by the OCR engine for the zone candidate hypoth-esis.

To make your field totally ignore confidence, simply make the confidence “1”.

If you replace the default function with your own, make sure that it reflects zero to 100 per cent range of possible character confidences into 0.0 to 1.0 quality range.

• Null Quality: The quality of the graph element node in case the element is not found. Not finding an element is always a choice, and this value is used to evaluate this choice compared to hypotheses that DO find the element and make a decision on which one to take as the winner.

Note: Setting null quality to zero make the ele-ment a required element. A better way to create re-quire-for-matching elements though is to use the Form ID elements that are geared for this task.

• Quality Threshold: The cut-off point for Null Hypoth-

eses is if there is at least one candidate in which the field can be found where the quality of the candidate is better than the Quality Threshold cut-off point. Then the Null Hy-pothesis (not finding the element) won’t be selected, and the real hypothesis wins.

• Number of Hypotheses: Maximum number of hy-potheses allowed to be generated by the IDR engine for searching for this element.

• Phantom: Controls phantom zone influence and phan-tom override. Setting to “false” allows totally discarding the influence of a missing element on its child elements. The default value is “true”, meaning that even a totally missing element still impacts where its child elements can be lo-cated, based on the search zone of the missing element.

Setting Phantom to “false” drops any phantom-border re-strictions (the restrictions that use ph-top, ph- bottom, ph-left and ph-right border references) entirely if the anchor element is not found. Otherwise, if Phantom is set to “true”, or if using exact-border restrictions (top, bottom, left, right instead of ph-top, ph-bottom, ph-right, ph-left), the search zone is still restrict-ed for the child elements of the missing element, based on the general search area (“phantom zone”) of the missing element..

Page 24: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

24www.artsyltech.com

Design Station User Guide

3.4.2.2. Properties Unique to Specific Elements

The following properties are unique to specific element types and are described in the context of the elements with which they are used.

3.4.2.2.1. Static Text Element

The Static Text element has the following unique properties:

• Variants: The most important property. That’s how the text is searched for. If necessary to specify more than one search variant, separate the variants with the vertical pipe “|” character. At least one variant is necessary for the element to function properly.

• Maximum misrecognition tolerance control:

§ Count Limit: Maximum number of incorrect char-acters allowed for fuzzy search to still match your variant.

§ Ratio Limit: Maximum ratio (percentage) of in-correct characters allowed for fuzzy search to still match your variant compared to the total number of characters in the search phrase.

• Penalties impacting text search: The following pen-alties can be used to favor the needed hypothesis for the static text elements:

§ Case Penalty: The penalty for incorrect registering of symbols. From the quality of the hypothesis takes away,

the selected definition of the penalty, each time, when the symbol in the hypotheses is different from the register from which, it was written in the expected value of the Static Text.

Setting this penalty to “0” (default) makes the search case-insensitive.

§ Errors Penalty: The penalty for the incorrect

characters. A function with the argument of total quan-tity of incorrect characters (characters deviating from the best search variant possible). The default function is

0.0001 * x

meaning that each incorrect character takes away 0.01% of hypothesis quality. You can replace it with your own function that is more or less aggressive to misrecog-nized characters to either narrow down or expand possi-ble hypothesis quantity.

o Partial Words Penalty: The penalty for words cut partially by the zone borders. The penalty is applicable as soon as at least one word necessary for the search variant is chopped at the beginning or end of the word.

3.4.2.2.2. Character String Element

The Character String element has the following unique properties:

• Space Length Limit: The maximum allowed spacing inside the hypothesis. The entire text of the recognized page is split to text blocks based on this threshold spacing size first, then each of the text blocks is analyzed to find the character string required.

• Penalties impacting character string search:

o Character Count Penalty: The penalty that al-lows prioritizing the length of the necessary answer to be found, in characters.

By default it is “0”, which means there are no priorities based on length.

The common use of this penalty is in conjunction with a regular expression that can return a variable number of characters. If you use a function here that decreases when the argument (number of characters) increases, you will be assigning smaller penalties to longer matched hypotheses, which means you will favour finding the hy-pothesis of the maximum length.

Some common examples of using the character count penalty:

Page 25: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

25 www.artsyltech.com

Design Station User Guide

Locating a description of up to 20 characters long, and trying to maximize the length of the captured text (to capture entire text, not its sub-string):

(20 - x) * 0.001

That will give zero penalty to finding 20 characters, and very little penalty for a shorter hypothesis, but the short-er it gets, the bigger is the penalty, so the system will take the longest as the answer.

Another example: Locating a date in the format MM/DD/YY and optimizing to capture exactly 8 characters, with penalty of 1% for going longer than 8 characters, as well as 1% penalty for going shorter than 8 characters:

abs (x - 8) * 0.01

That will give zero penalty to finding 8 characters rep-resenting a correct date, but will penalize 1% per each extra or each short character in the hypothesis, clearing out garbage but preserving important digits.

§ Partial Words Penalty: The penalty for words cut partially by the zone borders. The penalty is applica-ble as soon as at least one word necessary for the charac-ter string is chopped at the beginning or end of the word.

§ Regex Error Penalty: The penalty for any devi-ations from the regular expression. The argument is the number of errors, or deviation characters.

The default function is 0.0001* x (penalty of just 0.01% per deviation character), which can be made more ag-gressive to limit search to closer matches only.

§ Total Spaces Penalty: The penalty for the accu-mulated inter-words spacing for the candidate hypothe-sis. Calculated across all the hypotheses and then trans-lated into the measurement units used in Space Length Limit parameter (inches, centimetres, etc.). Allows penal-izing hypotheses with too much space inside them.

§ Word Count Penalty: The penalty for the quan-tity of words that form a hypothesis. Can be used to fa-

vour one-word, few-words or many-words choices when analyzing candidate hypotheses.

• Regular Expressions: Character strings provide very powerful search capabilities based on regular expres-sions, using extended POSIX notation.

Character strings provide very powerful search capabil-ities based on regular expressions, using extended POSIX notation.

Unlike most competing packages, docAlpha not only allows searching based on exact match on a regular ex-pression, but also provides tolerance for misrecogni-tions, or deviations from the regular expression.

The following parameters control the regular expres-sion application:

§ Error Limit: Maximum number of incorrect characters allowed for the regular expression search to still match your expression.

Use caution allowing for too many errors, usually 1-5 errors allowance (depending on how long the string is) is enough to take care of misrecognitions and find the right text.

The more deviations you allow, the longer the search will be.

§ Error Ratio Limit: Maximum ratio (percentage) of incorrect characters allowed for the regular expression search to still match your expression, compared to the total number of characters in the candidate expression. Must range from 0 to 1 to be a proper percentage value.

§ Regular Expression: The most important pa-

rameter, defines what exactly is allowed in the field.

docAlpha regular expressions are fully compliant with the POSIX-Extended format of expressions. Below is the description of allowed characters and operators to be used for regular expressions.

Page 26: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

26www.artsyltech.com

Design Station User Guide

3.4.2.2.3. POSIX Basic Regular Expressions Syntax

In the POSIX regular expression syntax, most characters are treated as literals - they match only themselves (i.e., a matches “a”). The exceptions, listed below, are called meta-characters or meta-sequences.

Meta- character Description

. Matches any single character. Within POSIX bracket expressions, the dot character matches a literal dot. For ex-ample, a.c matches “abc”, etc., but [a.c] matches only “a”, “.”, or “c”.

[ ] A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] match-es “a”, “b”, or “c”. [a-z] specifies a range which matches any lowercase letter from “a” to “z”. These forms can be mixed: [abcx-z] matches “a”, “b”, “c”, “x”, “y”, or “z”, as does [a-cx-z].

The - character is treated as a literal character if it is the last or the first character within the brackets, or if it is escaped with a backslash: [abc-], [-abc], or [a\-bc].

[^ ] Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than “a”, “b”, or “c”. [^a-z] matches any single character that is not a lowercase letter from “a” to “z”. As above, literal characters and ranges can be mixed.

^ Matches the starting position within the string.

( ) Defines a marked sub-expression. A marked sub-expression is also called a block or capturing group.

* Matches the preceding element zero or more times. For example, ab*c matches “ac”, “abc”, “abbbc”, etc. [xyz]* matches “”, “x”, “y”, “z”, “zx”, “zyx”, “xyzzy”, and so on. (ab)* matches “”, “ab”, “abab”, “ababab”, and so on.

{m,n} Matches the preceding element at least m and not more than n times. For example, a{3,5} matches only “aaa”, “aaaa”, and “aaaaa”.

Examples:.at matches any three-character string ending with

“at”, including “hat”, “cat”, and “bat”.

[hc]at matches “hat” and “cat”.

[^b]at matches all strings matched by .at except “bat”.

^[hc]at matches “hat” and “cat”, but only at the begin-ning of the string or line.

3.4.2.2.4. POSIX Extended Regular Expressions

docAlpha also supports the following meta-characters added to the Extended version of POSIX standard:

Meta- character Description

? Matches the preceding element zero or one time. For example, ba? matches “b” or “ba”.

+ Matches the preceding element one or more times. For example, ba+ matches “ba”, “baa”, “baaa”, and so on.

| The choice (aka alternation or set union) operator matches either the expression before or the expression after the operator. For example, abc|def matches “abc” or “def”.

Examples:

o [hc]+at matches “hat”, “cat”, “hhat”, “chat”, “hcat”, “ccchat”, and so on, but not “at”.

o [hc]?at matches “hat”, “cat”, and “at”.

o [hc]*at matches “hat”, “cat”, “hhat”, “chat”, “hcat”,

“ccchat”, “at”, and so on.

o cat|dog matches “cat” or “dog”.

Page 27: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

27 www.artsyltech.com

Design Station User Guide

3.4.2.2.5. POSIX Character Classes

Since many ranges of characters depend on the chosen locale setting (i.e., in some locale settings letters are orga-nized in the order abc...zABC...Z, while in some others as aAbBcC...zZ), the POSIX standard defines some classes or categories of characters that do not depend on the char-acter sorting order of a particular zone/locale selected.

docAlpha supports the POSIX character classes as shown in the following table:

POSIX Class Perl ASCII Description

[:alnum:] [A-Za-z0-9] Alphanumeric characters

[:alpha:] [A-Za-z] Alphabetic characters

[:digit:] \d [0-9] Digits

[:lower:] [a-z] Lowercase letters

[:upper:] [A-Z] Uppercase letters

POSIX character classes can only be used within brack-et expressions.

For example, [[:upper:]ab] matches the uppercase let-ters and lowercase “a” and “b”.

3.4.2.2.6. Text Element

An element designed to capture all multi-line text that fits into the search zone. Has no unique parameters.

3.4.2.2.7. Form ID Element

An element used for documents classification and defi-nitions matching. A special case of a Static Text element that has parameters tweaked for optimum use in match-

ing and classification. If a Form ID element is not found on a document, this definition will not even try to match to that document.

3.4.2.2.8. Separator Element

The separators have the following unique parameters:

• Direction: The direction of the separator (is it a verti-cal or a horizontal line).

• Fits Entirely: Allows or prohibits considering lines that start and/or end outside of the search zone as candi-date hypotheses.

If set to “True”, any line that has pixels outside the search zone won’t be considered as a candidate.

• Min Length Absolute and Max Length Absolute: The minimum and maximum length of the line to be con-sidered as a candidate.

• Min Length Relative and Max Length Relative: The minimum and maximum length ratio of the line length to

the search zone length to be considered as a candidate. Allows any constant in the [0..1] interval.

• Max Gap: The maximum length of a gap between two line segments for them to still be considered parts of the same line.

• Gap Penalty: The function based on the TOTAL length of ALL gaps between line segments combined to form the line, in inches. For example, 0.01 * x means 1% per inch penalty.

• Corridor: This property impacts all pieces (segments) of a line. The total width of a corridor where all line seg-ments lie. That allows processing slightly skewed or or-thogonally shifted lines and line segments, a common problem for separators after faxing.

Page 28: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

28www.artsyltech.com

Design Station User Guide

3.4.2.2.9. White Gap Element

The white gap elements have the following unique pa-rameters:

• Direction: The direction of the white gap (is it vertical or horizontal).

• Min Width Height: The minimum width (or height, depending on orientation) of the considered hypothesis.

• Types: Allows you to select which objects are consid-ered when detecting the span of the gap.

This parameter allows you to ignore some objects when calculating the gaps, for example, to ignore the fill-ing text and locate cells based on black line borders, or vice versa, to ignore the borders, and locate the text even if it is printed carelessly crossing the black border of the pre-printed cells.

• White Gap sensitivity controls: The white gap calcu-lation is based on a histogram, a vertical projection histo-gram for the horizontal gap and a horizontal projection histogram for the vertical gap.

That histogram sensitivity controls how tolerant the gap is to intervening objects. For example, you may wish to ignore occasional punctuation marks inside the gap

that appear there due to scanning defects, but be sensi-tive enough to detect one full word.

The following White Gap element parameters give you full control of histogram-based sensitivity:

o Threshold Coefficient: The threshold coeffi-cient multiplied by the histogram maximum for the area of the white gap element provides the dynamic histo-gram threshold, the “trigger point” where the gap be-comes interrupted.

o Lower and Upper Threshold Limit: The thresh-old limits allow you to set up the “noised white color” and the “cut-off peak black color” for the histogram.

The histogram cut-off threshold is limited by those two parameters. If any peak goes over the upper limit, it is replaced with the upper limit, and if any part of the his-togram goes below the minimum, it’s replaced with the minimum level.

The use of the limits allows you to process the gap with a higher dynamic range (in photography terms, it “reveals the subtle hues”) on overly-saturated or almost-blank documents to detect the gaps efficiently.

3.4.2.2.10. Checkmark Element

An element designed to return the whole search zone as a checkmark field. Has no unique parameters.

3.4.2.2.11. Image Element

An element designed to return the whole search zone as an image field. Has no unique parameters.

3.4.2.2.12. Barcode Element

The Barcode elements have the following unique prop-erties:

• Value: Allows setting the barcode value if a search-by-value is required.

• Barcode Type: Allows selecting which types of bar-codes should be considered as hypotheses.

• Barcode Orientation: Allows selecting which place-ment orientations should be considered as the potential hypotheses.

Page 29: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

29 www.artsyltech.com

Design Station User Guide

3.4.2.2.13. Generic Objects Element

The compound-hypothesis Generic Objects Element has the following unique properties:

• Fits Entirely: Allows or prohibits considering objects located partly outside the search zone to be hypothesis candidates. If set to “True”, any object that has at least one pixel outside the search zone won’t be considered as a candidate.

• Types: Allows selecting what type of objects should be considered within the search zone as candidate hy-potheses.

• Max Height, Max Width, Min Height, and Min Width: Allow setting the minimum and maximum sizes of objects in the search zone that should be considered as candidate element hypotheses.

3.4.2.2.14. Alternative Element

This compound element is used to the split the logic of finding elements into more than one choice. It has no unique parameters.

3.4.2.2.15. Group Element

Logical grouping and optimization element. It has no unique parameters.

3.4.2.2.16. List Element

A compound element geared for finding and parsing tables, lists and transactions. The main working part is a “Group” element inside it that represents one row. That “group”, or a row, is then searched over and over again, to reveal the rest of the lines. The List Element has the following unique properties:

• Rows: Limits the maximum number of rows to be found in a list, table or transaction set.

The default setting of quantity of lines to the extracted is set to be 10. At the beginning to speed up matching you can lower it to 2 or 3, and once the first several rows become locating themselves properly, change it to the expected maximum number of rows.

• Headers: Defines the name of the Group element that is defined above this List element and that includes search elements for detection of column headers for this List element.

• Aliases: Contains semicolon-separated pairs of Name=Value that define the rules of search element name substitution. Used for referencing the previous line for the first row of the table.

For example: We are capturing a table and working on the detection of cells in the Description column.

First we define a group that locates the headers, say called grHeaders. Say, in grHeaders the header the De-scription column is located with a Static Text element called stDescription.

Next we define the List element, and that List element automatically creates the member Group element, which contains the definition for capturing all elements belong-ing to one row (for tables, that means all cells of one row, for transactions - all members of one transaction). Say, that member Group element is called grRow, and the De-scription column cell element is defined as a Character String element in grRow group, and is called “Descrip-tion”.

To bind the grHeaders to our List element, we type “gr-Headers” in the “Headers” parameter of the List element.

Now we need to bind the columns in the member Group element to the column header location elements in the Headers Group. That’s where the aliases are used.

Page 30: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

30www.artsyltech.com

Design Station User Guide

For our example, for the Description column, we would define the “Aliases” property in the List element:

“Description=stDescription”

After that is done, we can use constraints of the kind: Below: grRow.prev.Description.ph-bottom

and this will work correctly for all rows of the table. For any rows starting from the 2nd row, the relation is really against the previous row element. But for the first row, there is no prior row. For the first row, the Alias steps in and the relation is translated as referring to the Header row instead of the previous row in the table.

4. Working with the Fixed-Forms Document DefinitionsFixed-Form Capture is the old, traditional approach

to capture data. The approach includes finding some unique “matching” markers on the page, setting up some “distortions compensation” elements to compen-sate for the defects of scanning and faxing, and then de-fine the capture zones based on the fixed coordinates on the page.

The benefit of the approach is the simplicity of the set-up. The fields are simply drawn with the mouse, their co-ordinates remembered, and all data is captured based on those coordinates (after compensation for the distortions using the distortion compensation elements).

The drawback of the approach is the low tolerance for flexibility and mobility of data capture elements on the documents. Unless the documents were specifically de-signed to be captured in this manner, they will probably be not “fixed” enough to use the coordinate-based ap-proach.

Still, for the forms specifically created to be processed with the Fixed-Forms technology, the Fixed-Forms defini-tions offer the quick method to set up and process forms classification and extraction jobs.

Below is the step-by-step instructions how to add a Fixed-Form definition to the batch, set up its fields for data capture from a fixed-form style document, test and fine-tune a fixed forms task.

1. Add documents on which the testing will be done during the definitions development, by clicking on the “+” button on the toolbar:

After the documents are added, any of them can be viewed in the Current Image Document Window by browsing in the Batch Window and selecting the neces-sary document and page:

2. Now add a Definition to the Batch. To add a definition, click on the “New Definition” button on the toolbar:

Give the definition a name here, and se-

lect Fixed Form Definition to create a fixed form-based definition, with capture based on coordinates, versus semi-structured document definition, with capture based on logical conditions:

Page 31: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

31 www.artsyltech.com

Design Station User Guide

Now the Definitions list will show one definition in it.

3. The next step is to do the full-page OCR to detect what the raw recognition results are available for capturing on the page:

4. Each form needs to be reliably identified by some

unique markers. Unique combination of black squares at the corners or even better some unique pre-printed text like the form name work well for that purpose.

Forms get damaged in real life scanning; so multiple unique markers are better than a single marker. It is rec-ommended therefore to select more than one static text marker, for example, one close to the top of the page and another one way down at the bottom.

Use the following icon on the main toolbar to set these markers:

The identification zones that you created are now high-

lighted with a purple highlight on the main document image:

5. Use the Fixed Forms Toolbar buttons for creating fixed form elements:

Fixed Forms Toolbar buttons and their functions:

• / - Editing/Testing Mode toggle button:

o When in form Editing Mode, pressing switches the Designer to Testing Mode, to test-match your form against sample images;

o When in form Testing Mode, pressing switches the Designer to Editing Mode, to edit or add fields.

• and - mutually exclusive buttons that toggle each other. They control the Scroll - Draw Fixed Zone mode selection:

o - turns on Scroll mode in which mouse ac-tions are used to move the blocks over the page

o - turns on Draw mode in which mouse ac-tions are used to draw new blocks on the form

• The third group of buttons

Page 32: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

32www.artsyltech.com

Design Station User Guide

contains the buttons that select the current type of the block that you create when you draw it with the mouse in Draw mode:

- Create a static identification marker (see the step 4 above)

- Auto-detect corner reference blocks (black squares)

- Draw machine-printed (OCR) text blocks

- Draw hand-printed (ICR) text blocks

- Draw checkmark (OMR) blocks

- Draw a group of checkmarks (OMR blocks)

- Draw barcode (OBR) blocks

- Draw image blocks

- Correct Block Positions when replacing the background form image

6. Once all elements are added and configured for this

definition, you can add more definitions and set up their elements.

7. Once all definitions are configured and tested, save the batch locally and then use it as a basis for a newly created workflow.

5. Data Extraction Type Libraries and Library ModeEach definition may have an associated Data Extraction

Type Library. Such library contains a set of defined cus-tom data extraction types, fine-tuned to perform well on specialized data extraction tasks, using data types, image pre-processing tricks, special zonal form-out fea-tures, filtering and formatting, character and sub-string replacement rules to optimize the capture accuracy and confidence.

Data Extraction Type Libraries can be viewed inside the

Definition Window by switching to it using the “Defi-

nition / Library Switch” toggle button. Once finished working with the Library Mode, click on the “Definition / Library Switch” button again to toggle the Definition Window to show the current definition again.

The custom data type libraries allow customizing recog-nition parameters for the blocks. Such set of customized

parameters is given its own name and can be re-used for other block of this definition.

For example, if capturing numeric-only fields is often done from hand-printed forms, a custom element can be created for not only capturing the numeric field, but also doing a series of character-level replacements for au-to-correction of the block content, such as

I --> 1A --> 4B --> 8O --> 0, etc.

This facilitates reduction of the manual labor (valida-tion & correction) done on the Verification station by the operator.

To implement that scenario and create a new custom data type, switch to the Library Mode, right-click on the

Page 33: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

33 www.artsyltech.com

Design Station User Guide

“Block Types” root node in the Definitions Window and add a new custom block type to your library just like add-ing a new block to your definition.

Once you added the block type, you can add the char-acter replacement rule:

Note: the character replacement rule works at a char-acter level, and allows replacing as many character in-stances as happens on the captured form field.

You can also use the pre-defined filters, or limit the recognition alphabet to dictate to the OCR/ICR engines which characters can be used for recognition:

In the above example, only digits and punctuation

marks are allowed to be recognized by the OCR/ICR en-gine at recognition time.

If you are using more than one recognition engine in your workflow, you can select that the custom block type is using an alternative recognition engine. For example, in the screenshot below the OpenText RecoStar engine is selected:

The custom block types can be also used to dictate the orientation of the text recognition. Note that the list of available orientations depends on the selected recog-nition engine. Some engines, for example, do not have an option to read the text upside down, and only allow normal (“default”) orientation or bottom-to-top / top-to- bottom for vertical text:

After all those parameters for the custom block type are set, you can switch out of the Library Mode by clicking the “Definition / Library Switch” toggle button again.

To apply the custom block type recognition rules to any of your blocks, first click on your block name in the elements tree to open it, then select that custom block type in the “Block Type” drop-down list for your definition block:

To delete a custom block type, switch to the Library Mode, select the custom block type you want to delete; right-click on it and select “Delete” option from the pop-up menu.

Page 34: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

34www.artsyltech.com

Design Station User Guide

6. Hot to work and extract different types of barcodes

6.1. Main scenarios

There are seven scenarios (per each Engine type) how to extract barcode using “barcode” or “white gap” search types at design station:

1) White Gap + Barcode type for block – to use this provide next settings:

- Create “White Gap” search type at design station

- Set “Export block type” to Barcode:

- Set “types” to none, to do this uncheck all checks in Enum Editor:

2) White Gap + Barcode block type + barcode library type - setting scenario is next:

- Create “White Gap” search type at design station

- Switch to Library mode using button

- Add new “Barcode” block type in library mode

- Provide the name for the type (the name of this type will automatically appear in drop-down list for “Ex-portBlockType” in definition creation mode)

- Set “BarcodeType” to necessary barcode type using dropdown list:

- Switch back from the library mode to definition cre-

ation mode using button

- Select the created at first step “white gap” search type and set “ExportBlockType” option to use the created library barcode type

- Set “types” to none, to do this uncheck all checks in Enum Editor.

3) White Gap + Barcode + barcode library type + zonal recognizer - setting scenario is next:

- Create “White Gap” search type at design station

- Switch to Library mode using button

- Add new “Barcode” block type in library mode

- Provide the name for the type (the name of this type will automatically appear in drop-down list for “Ex-portBlockType” in definition creation mode)

Page 35: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

35 www.artsyltech.com

Design Station User Guide

- Set “BarcodeType” to necessary barcode type using dropdown list:

- Set “Zonal Recognizer” to engine (Recostar or Nuance) accordingly to the main engine that is set for defini-tion (the engine should be the same and will use to re recognize the barcode`s zone only)

- Switch back from the library mode to definition cre-

ation mode using button

- Select the created at first step “white gap” search type and set “ExportBlockType” option to use the created library barcode type

- Set “types” to none, to do this uncheck all checks in Enum Editor.

4) Barcode Type + Barcode type for block

- Create “Barcode” search type at design station

- Set “Export block type” to Barcode:

5) Barcode Type + Barcode type for block +Main BarcodeType

- Create “Barcode” search type at design station

- Set “ExportBlockType” to Barcode

- Set “BarcodeType” to necessary barcode type using Enum Editor node (uncheck all other types) with all available barcodes:

6) Barcode Type + barcode library type

- Create “Barcode” search type at design station

- Switch to Library mode using button

- Add new “Barcode” block type in library mode

- Provide the name for the type (the name of this type will automatically appear in drop-down list for “Export-BlockType” in definition creation mode)

- Set “BarcodeType” to necessary barcode type using dropdown list:

- Switch back from the library mode to definition cre-

ation mode using button

Page 36: User Guide - Artsyl Technologies Partner Portal · User Guide Design Station  docAlpha 5.0.  2 ... The Design Station is not part of the production envi-

36www.artsyltech.com

Design Station User Guide

- Select the created at first step “white gap” search type and set “ExportBlockType” option to use the created li-brary barcode type

7) Barcode Type + barcode library type + zonal rec-ognizer - setting scenario is next:

- Create “White Gap” search type at design station

- Switch to Library mode using button

- Add new “Barcode” block type in library mode

- Provide the name for the type (the name of this type will automatically appear in drop-down list for “Export-BlockType” in definition creation mode)

- Set “BarcodeType” to necessary barcode type using dropdown list:

- Set “Zonal Recognizer” to engine (Recostar or Nuance) accordingly to the main engine that is set for definition (the engine should be the same and will use to re recog-nize the barcode`s zone only)

- Switch back from the library mode to definition cre-

ation mode using button

- Select the created at first step “white gap” search type and set “ExportBlockType” option to use the created li-brary barcode type

7.2 Scenarios that could be apply to extract the bar-code accordingly to its type and used engine

Barcode`s type Nuance Engine Recostar Engine

Codabar 1-7 1-7

Code 32 None None

Code 39 1-7 1-7

Code 39 NSS None 2, 3, 6, 7

Code 93 None 1-7

Code 128 1-7 1-7

EAN 8 1, 2, 3, 4, 6, 7 1-7

EAN 13 1-7 1-7

GS1-128 2, 3, 7 2, 3, 7

UPC-A 2, 3, 7 None

UPC-E None None

Industrial 2 of 5 None 2, 3, 6, 7

Interleaved 2 of 5 1-7 1-7

Airline 2 of 5 None None

Matrix 2 of 5 None 2, 3, 6, 7

Intelligent Mail None None

PLANET None 2, 3, 6, 7

POSTNET 2, 3, 6, 7 2, 3, 6, 7

Aztec Code ? ?

Data Matrix None 2, 3, 6, 7

PDF 417 2, 3, 6, 7 2, 3, 6, 7

QR Code None 2, 3, 6, 7

Patch Code Type1 None None

Patch Code Type2 None 2, 3, 6, 7

Patch Code Type3 None None

Patch Code Type4 None 2, 3, 6, 7

Patch Code Type5 None 2, 3, 6, 7

Patch Code Type6 None 2, 3, 6, 7