37
Papermerge May 16, 2020

Papermerge - Read the Docs

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Papermerge - Read the Docs

Papermerge

May 16, 2020

Page 2: Papermerge - Read the Docs
Page 3: Papermerge - Read the Docs

Contents

1 Requirements 31.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Imagemagick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Poppler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Tesseract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Installation 52.1 OS Specific Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 1. Web App + Workers Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.1.1 Ubuntu Bionic 18.04 (LTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 2. Web App Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2.1 Ubuntu Bionic 18.04 (LTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 3. Worker Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Manual Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Package Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Web App . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.3 Worker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.4 Recurring Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Systemd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1 Package Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 Web App . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Ansible (Semiautomated) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6 Jenkins + Ansible (Fully Automated Deployment) . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Languages Support 15

4 REST API 174.1 How It Works? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 Get a Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.1.2 Use the Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 REST API Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Page Management 235.1 Delete Page(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Reorder Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

i

Page 4: Papermerge - Read the Docs

5.3 Cut & Paste . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Settings 256.1 STORAGE_ROOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2 S3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.3 OCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.4 DATABASES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.5 STATICFILES_DIRS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 Developers Guide 277.1 Contributing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.1.1 Fix a Typo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.1.2 Open an Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.1.3 Add Your Language Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.2.1 1. Frontend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287.2.2 2. Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297.2.3 3. Workers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.3 Branching Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297.3.1 Worker, Papermege-js Branching Model? . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.3.2 Git Branching/Tagging Blitz Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.4 Language Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.4.1 What is Language Support? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.4.2 User Interface Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.4.3 Document Content Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

8 Indices and tables 33

ii

Page 5: Papermerge - Read the Docs

Papermerge

I have nothing against paper. Paper is a brilliant invention of humanity. But in the 21st century I find it more appropriatefor paper-based documents to be digitized (scanned). Once scanned, appropriate software can be used to find anydocument in a fraction of a second, just by typing a few keywords.

Papermerge is a document management system designed to work with scanned documents. As well as OCR withfull text search, it provides the look and feel of major modern file browsers, with a hierarchical structure for files andfolders, so that you can organize your documents in a similar way to Dropbox (via web) or Google Drive.

Contents 1

Page 6: Papermerge - Read the Docs

Papermerge

2 Contents

Page 7: Papermerge - Read the Docs

CHAPTER 1

Requirements

Papermerge depends on following software:

• Python >= 3.8.0

• Tesseract - because of OCR

• Imagemagick - Image operations

• Poppler - PDF operations

• PostgreSQL >= 11.0 because of Full Text Search

1.1 Python

Papermerge is a Python 3 application.

1.2 Imagemagick

Papermerge uses Imagemagick to convert between images format

1.3 Poppler

More exactly poppler utils are used. For exampple pdfinfo command line utility is used to find out number of page inPDF document.

3

Page 8: Papermerge - Read the Docs

Papermerge

1.4 Tesseract

If you never heard of Tesseract software - it is google’s open source Optical Character Recognition software. It extractstext from images. It works fantastically well for wide range of languages.

1.5 Database

One of Papermerge’s core philosophies is “Find Any Document”. PostgreSQL database comes with Full Text SearchSupport (FTS) out of the box. Papermerge uses websearch_to_tsquery PostgreSQL function which was intro-duced in PostgreSQL version 11.0.

With FTS - full text search - you can search documents in similar way people are used to search web pages in google(bing, yandex, duckduckgo) search engine - you just type some words - and search result will display only documentswith those words sorted by their relevancy.

4 Chapter 1. Requirements

Page 9: Papermerge - Read the Docs

CHAPTER 2

Installation

There are different methods to install Papermerge. They differ by amount of effort required and purpose.

2.1 OS Specific Packages

Here are given instructions on how to install operating system specific packages. There are three cases.

1. Both web app and workers are on same machine

2. Web app machine

3. Worker machine

2.1.1 1. Web App + Workers Machine

2.1.1.1 Ubuntu Bionic 18.04 (LTS)

Install required ubuntu packages:

sudo apt-get updatesudo apt-get install python3 python3-pip python3-venv \

poppler-utils \imagemagick \build-essential \poppler-utils \tesseract-ocr \tesseract-ocr-deu \tesseract-ocr-eng

Notice that for tesseract only english and german (Deutsch) language packages are needed.

Ubuntu Bionic 18.04 comes with postgres 10 package. Papermerge on the other hand requires at least version 11 ofPostgres.

5

Page 10: Papermerge - Read the Docs

Papermerge

Install Postgres version 11:

# add the repositorysudo tee /etc/apt/sources.list.d/pgdg.list <<ENDdeb http://apt.postgresql.org/pub/repos/apt/ bionic-pgdg mainEND

# get the signing key and import itwget https://www.postgresql.org/media/keys/ACCC4CF8.ascsudo apt-key add ACCC4CF8.asc

# fetch the metadata from the new reposudo apt-get update

2.1.2 2. Web App Machine

Tesseract should not run on Web App only computer.

2.1.2.1 Ubuntu Bionic 18.04 (LTS)

Install required ubuntu packages:

sudo apt-get updatesudo apt-get install python3 python3-pip python3-venv \

poppler-utils \imagemagick \build-essential \poppler-utils

2.1.3 3. Worker Machine

Worker is the one performing heavy task of extracting text from images. So it must have tesseract packages installed.

2.2 Manual Way

Papermerge has two parts:

• Web application

• Worker - which is used for OCR operation

With this installation method both parts will run on the same computer. This installation method is suitable fordevelopers. In this method no configuration is automated, so it is a perfect method if you want to understand themechanics of the project.

If you follow along in this document and still have trouble, please open an issue on GitHub: so I can fill in the gaps.

2.2.1 Package Dependencies

In this setup, Web App and Workers run on the same machine.

Install os specific packages for webapp + worker

6 Chapter 2. Installation

Page 11: Papermerge - Read the Docs

Papermerge

Check that Postgres version 11 is is up and running:

sudo systemctl status [email protected]

Create new role for postgres database:

sudo -u postgres createuser --interactive

When asked Shall the new role be allowed to create databases? please answer yes (when running tests, django createsa temporary database)

Create new database owned by previously created user:

sudo -u postgres createdb -O <user-created-in-prev-step> <dbname>

Set a password for user:

sudo -u postgres psqlALTER USER <username> WITH PASSWORD '<password>';

2.2.2 Web App

Once we have prepared database, tesseract and other dependencies, let’s start with paperpermerge itself.

Clone main papermerge project:

git clone https://github.com/ciur/papermerge papermerge-proj

Clone papermerge-js project (this is the frontend part):

git clone https://github.com/ciur/papermerge-js

Create python’s virtual environment .env:

cd papermerge-projpython3 -m venv .venv

Activate python’s virtual environment:

source .venv/bin/activate

Install required python packages (now you are in papermerge-proj directory):

# while in <papermerge-proj> folderpip install -r requirements.txt

Rename file config/settings/development.example.py to config/settings/development.py. This file is default forDJANGO_SETTINGS_MODULE and it is included in .gitignore.

Adjust following settings in config/settings/development.py:

• DATABASES - name, username and password of database you created in PostgreSQL

• STATICFILES_DIRS - include path to <absolute_path_to_papermerge_js_clone>/static

• MEDIA_ROOT - absolute path to media folder

• STORAGE_ROOT- absolute path to same media root, but with a “local:/” prefix

2.2. Manual Way 7

Page 12: Papermerge - Read the Docs

Papermerge

Note:

1. Make sure that data_folder_in and data_folder_out point to the same location.

2. Make sure that folder pointed by data_folder_in and data_folder_out exists.

Then, as in any django based project, run migrations, create super user and run build in webserver:

cd <papermerge-proj>./manage.py migrate./manage.py createsuperuser./manage.py runserver

At this point, you should be able to see (styled) login page. You should be able as well to login with administrativeuser you created before with ./manage.py createsuperuser command.

At this step, must be able to access login screen and it should look like in screenshot below.

Also, you can upload some document and see their preview.

But because there is no worker configured yet, documents are basically plain images. Let’s configure worker!

8 Chapter 2. Installation

Page 13: Papermerge - Read the Docs

Papermerge

2.2.3 Worker

Let’s add a worker on the same machine with Web Application we configured above. We will use the same python’svirtual environment as for Web Application.

Note: Workers are the ones who depend on (and use) tesseract not Web App.

Clone repo and install (in same python’s virtual environment as Web App) required packages:

git clone https://github.com/ciur/papermerge-workercd papermerge-workerpip install -r requirements.txt

Create a file <papermerge-worker>/config.py with following configuration:

worker_concurrency = 1broker_url = "filesystem://"broker_transport_options = {

'data_folder_in': '/home/vagrant/papermerge-proj/run/broker/data_in','data_folder_out': '/home/vagrant/papermerge-proj/run/broker/data_in',

}worker_hijack_root_logger = Truetask_default_exchange = 'papermerge'task_ignore_result = Falseresult_expires = 86400result_backend = 'rpc://'include = 'pmworker.tasks'accept_content = ['pickle', 'json']s3_storage = 's3:/<not_used>'local_storage = "local:/home/vagrant/papermerge-proj/run/media/"

2.2. Manual Way 9

Page 14: Papermerge - Read the Docs

Papermerge

Important: Folder pointed by data_folder_in and data_folder_out must exists and be the same one asin configuration for Web Application.

Now, while in <papermerge-worker> folder, run command:

CELERY_CONFIG_MODULE=config celery worker -A pmworker.celery -Q papermerge -l info

At this stage, if you keep both built in webserver (./manage.py runserver command above) and worker running inforeground and upload a couple of PDF documents, and obvisouly give worker few minutes time to OCR the document,document becomes more than an image - you can now select text in it!

Fig. 1: Now you should be able to select text

2.2.4 Recurring Commands

At this point, if you will try to search a document - nothing will show up in search results. It is because, workers OCRa document and place results into a .txt file, thus extracted text is not yet in database.

A special Papermerge command txt2db will read .txt files and insert them in associated documents’ (documents’pages) database entries.

Afterwards another command update_fts will prepare a special a database column with correct information aboutdocument (more precicely - page).

Run commands manually:

10 Chapter 2. Installation

Page 15: Papermerge - Read the Docs

Papermerge

cd <papermerge-proj>./manage.py txt2db./manage.py update_fts

Note: In manual setup (i.e. without any Papermerge’s background services running), if you want a document tobe available for search, you need to run ./manage.py txt2db and ./manage.py update_fts commandseverytime after document is OCRed.

2.3 Systemd

In this installation method you use a special papermerge command startetc to generate a bunch of configurationfiles in <papermerge-proj>/run/etc folder. Then only with one single command:

systemctl --user start papermerge

you start a full fledged staging environment with nginx, gunicorn, one worker and recurring commands running asservices on a single machine. I really love this method and I use in my local development environment. This methodrelies on systemd and its --user argument.

2.3.1 Package Dependencies

You will need to install os specific packages for webapp + worker first. Then make sure that PostreSQL is up andrunning.

Make sure that your machine has both nginx and systemd available:

nginx -Vsystemd --version

2.3.2 Web App

Clone main papermerge project:

git clone https://github.com/ciur/papermerge papermerge-proj

Clone papermerge-js project (this is the frontend part):

git clone https://github.com/ciur/papermerge-js

Create python’s virtual environment .env:

cd papermerge-projpython3 -m venv .venv

Activate python’s virtual environment:

source .venv/bin/activate

Install required python packages (now you are in papermerge-proj directory):

2.3. Systemd 11

Page 16: Papermerge - Read the Docs

Papermerge

# while in <papermerge-proj> folderpip install -r requirements.txt

Rename file config/settings/development.example.py to config/settings/development.py. This file is default forDJANGO_SETTINGS_MODULE and it is included in .gitignore.

Adjust following settings in config/settings/development.py:

• DATABASES - name, username and password of database you created in PostgreSQL

• MEDIA_ROOT - absolute path to media folder

• STORAGE_ROOT- absolute path to same media root, but with a “local:/” prefix

Note:

1. Make sure that data_folder_in and data_folder_out point to the same location.

2. Make sure that folder pointed by data_folder_in and data_folder_out exists.

Then, as in any django based project, run migrations and create super user:

cd <papermerge-proj>./manage.py migrate./manage.py createsuperuser

Run startetc command:

./manage.py startetc

Just out of curiousity, have a look <papermerge-proj>/run at folder generated by startetc command. Folder<papermerge-proj> should have following structure:

runbroker

data_indata_outdata_processed

etcgunicorn.conf.pynginx.confpapermerge.envpmworker.envpmworker.pysystemd

papermerge.servicepapermerge.targetpm_nginx.servicepmworker.servicetxt2db.servicetxt2db.timerupdate_fts.serviceupdate_fts.timer

logtmp

Systemd can be used to manage user services. For that –user flag is used. User services must be referenced in ~/.config/systemd/user folder. By the way, I made a video about systemd –user feature.

12 Chapter 2. Installation

Page 17: Papermerge - Read the Docs

Papermerge

Create ~/.config/systemd/user if you don’t have it. Then reference (create symbolic links)<papermerge-proj>/run/etc/systemd/ units in ~/.config/systemd/user folder:

cd ~/.config/systemd/userln -s <papermerge-proj>/run/etc/systemd/* .

Important: Path <papermerge-proj>/run/etc/systemd/* must be absolute.

Start papermerge:

systemctl --user start papermerge.target

2.4 Docker

With this method you will need git, docker and docker-compose installed.

1. Install Docker

2. Install docker-compose

3. Clone Papermerge Repository:

git clone https://github.com/ciur/papermerge papermerge-proj

4. Run docker compose command (which will pull images from DockerHub):

cd papermerge-proj/dockerdocker-compose up -d

This will pull and start the necessary containers. If you wish, you can use docker-compose up --build -fdocker-compose-dev.yml -d command instead to build local images.

Check if services are up and running:

docker-compose ps

Papermerge Web Service is available at http://localhost:8000 For initial sign in use:

URL: http://localhost:8000username: adminpassword: admin

You can check logs of each service with:

docker-compose logs workerdocker-compose logs appdocker-compose logs db

2.5 Ansible (Semiautomated)

Coming soon. . .

2.4. Docker 13

Page 18: Papermerge - Read the Docs

Papermerge

2.6 Jenkins + Ansible (Fully Automated Deployment)

To be added. . .

14 Chapter 2. Installation

Page 19: Papermerge - Read the Docs

CHAPTER 3

Languages Support

Theorethically all languages supported by tesseract (over 130) can be used.

But for my own needs only two were required:

• German

• English

Thus, only support for these two languages is provided. Both localization (of user interface) and OCRing documentsin german and english are basically hardcoded into the project.

15

Page 20: Papermerge - Read the Docs

Papermerge

16 Chapter 3. Languages Support

Page 21: Papermerge - Read the Docs

CHAPTER 4

REST API

Screencast demo

REST API is a way to interact with Papermerge far beyond Web Browser realm. It gives you power to extend Paper-merge in many interesting ways. For example it allows you to write a simple bash script to automate uploading of filesfrom your local (or remote) computer’s specific location.

Another practical scenario where REST API can be used is to automatically (well, you need some sort of 3rd partyscript for that) import attached documents from a given email account.

4.1 How It Works?

Instead of usual Sign In, with username and password, via Web Browser, you will sign in with a token (a fancy namefor sequence of numbers and letters) from practically any software which supports http protocol.

Thus, working with REST API is two step process:

1. get a token

2. use the token from 3rd party REST API client

4.1.1 Get a Token

1. Click User Menu (top right corner) -> API Tokens

2. Click New Token

3. You will to decide on number of hours the token will be valid. Default is 4464 hours, which is roughly equivalentof 6 months. Click Save button.

4. After you click Save button, two information messages will be displayed. Write down your token from Remem-ber the token: . . . info window.

17

Page 22: Papermerge - Read the Docs

Papermerge

Fig. 1: “API Tokens” in User Menu (step 1)

Fig. 2: “New token” button (step 2)

18 Chapter 4. REST API

Page 23: Papermerge - Read the Docs

Papermerge

Important: Write down your token. For security reasons, it is will be displayed only once. In picture below, it is theone marked in red.

Important: Tokens are saved in database encrypted. Token’s encrypted version is called digest. In tokens tables(by the way, you can have as many token you like) first column displays first 16 characters of the digest. It is a way toidentify the token. In picture below, token’s digest is marked with green.

4.1.2 Use the Token

Once you have your REST API token, you can use Papermerge with any HTTP client, just remember to include RESTAPI token as header using following format:

Authorization: Token <you token here>

Let’s see some examples with curl. The simpliest REST API call is:

curl -H "Authorization: Token 7502db85f8d40bc7f4f5ab0a4e4fee4a" \<HOST>/api/documents

If get 2XX response, it means your Authorization header and token are correct.

Upload local file to remote host specified with <HOST>:

curl -H "Authorization: Token 7502db85f8d40bc7f4f5ab0a4e4fee4a" \-T /home/eugen/documents/demo/2019/berlin1.pdf \<HOST>/api/document/upload/berlin_x1.pdf

Notice that local file name is berlin1.pdf while it features in url as berlin_x1.pdf. This way I can rename local file.

You can upload files without specifying their remote name, in that case remote file will have same name as local file:

curl -H "Authorization: Token 7502db85f8d40bc7f4f5ab0a4e4fee4a" \-T /home/eugen/documents/demo/2019/berlin1.pdf \<HOST>/api/document/upload/

Note: Notice the trailing / character. When uploading file with curl without specifing file name URL must endwith /. This is a way to notify curl that we don’t want to rename files.

Your (REST API) uploaded files will end up in Inbox.

4.2 REST API Reference

REST API authorization header:

• name: Authorization

• value format Token <your-token-here>

Example:

4.2. REST API Reference 19

Page 24: Papermerge - Read the Docs

Papermerge

Fig. 3: In red color is your (example) token (step 4)

20 Chapter 4. REST API

Page 25: Papermerge - Read the Docs

Papermerge

Fig. 4: Files uploaded with REST API end up in Inbox.

4.2. REST API Reference 21

Page 26: Papermerge - Read the Docs

Papermerge

curl ... -H "Authorization: Token <your-token-here>"

REST API URLs:

URL HTTP Method Description/api/documents GET json list of all documents/api/document/<id> GET json info about document with id=<id>/api/document/upload/ PUT Uploads unnamed file (random name will be assigned)/api/document/upload/<filename> PUT Uploads named file

22 Chapter 4. REST API

Page 27: Papermerge - Read the Docs

CHAPTER 5

Page Management

Screencast demo

Page management is new set of features of Papermerge to manage pages. In other words you can delete, reorder, cutand paste pages.

Many times scanning documents in bulk results in documents with blank pages; some pages my be out of order ormaybe part of totally different document. Even if user notices these flaws immediately it is time consuming andfrustrating to redo scanning process. Thus it is a welcome feature of Papermerge to allow user to fix out of order pagesin application.

5.1 Delete Page(s)

Delete those blank pages. Although my scanner has automatic “remove blank pages” feature, it misses some blankpage. So I find it very practical to allow user to remove blank pages by himself/herself.

5.2 Reorder Pages

Out of order pages occur very often during scanning process. Papermerge allows users to change pages order withinthe document.

5.3 Cut & Paste

You can move document pages around from one document to another. Once you cut one or several pages from adocument, you can paste them either inside another document - pages will become part of new document or you canpaste pages in file browser, this will create entirely new document from cut pages.

23

Page 28: Papermerge - Read the Docs

Papermerge

24 Chapter 5. Page Management

Page 29: Papermerge - Read the Docs

CHAPTER 6

Settings

These are configurations settings for Papermerge - Web App. Configuration settings are used in same manner as forany Django based project.

Settings which are common for all environments (production, development, staging) are defined in papermerge.config.settings.base module.

If you want to reuse papermerge.config.settings.base, create python file, for example staging.py,and import all settings from base module:

from .base import *

DEBUG = FalseSTATIC_ROOT = '/www/static/'

Example above assumes that staging.py was created in same folder with base.py. Don’t forget to pointDJANGO_SETTINGS_MODULE environment variable to your settings module.

6.1 STORAGE_ROOT

• local:/<path to local folder>

• s3:/<path to bucket>

Defines either local or a remote location where documents are stored. In case of local, it’s meaning is same of forDjango’s MEDIA_ROOT. In case of s3 storage it indicates path to the S3 bucket.

Examples:

STORAGE_ROOT = 'local:/home/vagrant/papermerge-proj/run/media' # good for→˓development envSTORAGE_ROOT = 's3:/yourbucketname/alldocuments' # suitable for production

25

Page 30: Papermerge - Read the Docs

Papermerge

Note: In case when you choose not to use S3 storage both STORAGE_ROOT needs to be set to local://... pathand S3 option must be set to False. And other way around, if you want to use S3 storage, both SOTRAGE_ROOT andS3 needs to be set accordingly (S3=True, STORAGE_ROOT=’s3:/bucketname’).

6.2 S3

• True|False

Instructs papermerge if you want to use S3 storage. S3=True is more suitable for production environ-ments.

Note: In case S3=True you need to point ref:STORAGE_ROOT to s3 location.

6.3 OCR

• True|False

Enables or disables OCR features. With OCR=False no workers needs to be configured;

6.4 DATABASES

This is Django specific configuration settings. Papermerge uses PostgreSQL as database, which meansthat ENGINE options must be set to django.db.backends.postgresql. Example:

DATABASES = {'default': {

'NAME': 'db_name','ENGINE': 'django.db.backends.postgresql','USER': 'db_user','PASSWORD': 'db_password'

},}

6.5 STATICFILES_DIRS

Include absolute path where papermege-js static files are.

Example:

STATICFILES_DIRS = ['/home/vagrant/papermerge-js/static'

]

26 Chapter 6. Settings

Page 31: Papermerge - Read the Docs

CHAPTER 7

Developers Guide

Documentation, notes and general info for developers (myself included).

7.1 Contributing

This documents describes in detail how you can contribute to papermerge project.

7.1.1 Fix a Typo

Contribute to the project just by fixing text typos. Like tis one. Yes, English is not my native lnguage and I do lots oftypoz.

Fixing documentation typos is easiest and fastest way to contribute to the project. Even even if you correct one minortyping mistake I will add you to the list of contributors.

7.1.2 Open an Issue

Another way to contribute is open issues. Obviously this means you need to at least run once application and test it.

7.1.3 Add Your Language Support

Adding language support is not as trivial as fixing a typo or opening an issue, but it is not that difficult either. In anycase, there is a separate page in developer guide for it.

7.2 Design

A brief description of the architecture of Papermerge and why such design decisions were taken. Papermerge projecthas 2 parts:

27

Page 32: Papermerge - Read the Docs

Papermerge

• Web Application

• Workers

Web application is further devided into Frontend and Backend. As result there are 3 separate repositories that are partof one whole.

Fig. 1: High level design. Backend and frontend are separate.

7.2.1 1. Frontend

Papermerge-js Repository

Warning: Name papermerge-js is misleading, because it implies that it is only javascript is used, which is nottrue. This project manages all static assets: javascript, css, images, fonts.

Modern web applications tend to use a lot of javascript and css. Javascript code, as opposite to code written in Python,become increasingly difficult to manage. Same is for css. To deal with codebase complexity, I decided to split frontendas completely separate project. This project is a Webpack project. In practice this makes it little bit easier to deal withgrowing javascript code complexity. The outcome of this project, among others, are two important files:

<papermerge-js>/static/js/papermerge.js<papermerge-js>/static/css/papermerge.css

There are static files as well, like images and fonts. However images and fonts, are just placed in<papermerge-js>/static and nothing really interesting happens with them.

28 Chapter 7. Developers Guide

Page 33: Papermerge - Read the Docs

Papermerge

7.2.2 2. Backend

Papermerge-proj Repository

Backend is a standard Django application. It uses static files from frontend part. Throughout documentation it isrefered as backend because term webapp is more general (webapp = backend + frontend).

7.2.3 3. Workers

Papermerge-worker Repository

Workers perform OCR on the documents. Documents are passed as reference (see note below) from backend to theworkers via a shared location. In simplest setup when everything runs on same machine, shared location is just a folderon local machine accessible by worker and by backend. In production, shared location is a S3 bucket.

Note: There are at least two distinct methods of passing documents from backend to the workers. First method, whichis very simple, but wrong: backend will just transfer entire document byte by byte to the worker. Without diving deepinto technical details, this method is not scalable because it deplets backend’s memory very quickly.

Backend instead instructs workers which documents they need to OCR by telling workers document id (it passes userid and language name as well).

Fig. 2: Backend passes documents to workers by reference.

7.3 Branching Model

All current development goes into master branch.

Papermerge versions branch from master branch and are tagged for specific version. This is easier to explain with apicture.

7.3. Branching Model 29

Page 34: Papermerge - Read the Docs

Papermerge

Fig. 3: Branching model used by Papermerge project.

30 Chapter 7. Developers Guide

Page 35: Papermerge - Read the Docs

Papermerge

• Stable version branches are named stable/1.0.x, stable/1.1.x etc.

• Git tagging is used to mark specific software version e.g. v1.0.0, v1.1.0, v1.2.0 and so on.

7.3.1 Worker, Papermege-js Branching Model?

Well, both worker and papermerge-js will follow the same model.

Note: I started above described branching model somewhere around 14th February 2020 and I have applied it onlyon main project - unfortunatelly at that moment I forgot about other two parts.

As temporary workaround I tagged both worker and papermerge-js with v1.1.0 tags to mark their compatility point intime with main project.

7.3.2 Git Branching/Tagging Blitz Introduction

To checkout a branch stable/1.1.x, use command:

$ git checkout stable/1.1.x

To checkout a tagged commit, say a commit tagged v1.1.0, you use same command as checking out a branch:

$ git checkout v1.1.0

7.4 Language Support

By default, papermerge is hardcoded to work with documents in only two languages - German and English. The-oretically it can support more than 100 languages. However, I, as developer and user of this software, included inpapermerge only what was usefull for me (German and English).

You can contribute to this project by adding support (and testing it) for you own language. It is extremely rewardingexperience, because:

• it is fun

• you will learn a lot

• you will create something useful for you and others

7.4.1 What is Language Support?

There are two parts to consider:

• User Interface language (text like username, Log out)

• Document Content (actual content of your documents)

7.4. Language Support 31

Page 36: Papermerge - Read the Docs

Papermerge

7.4.2 User Interface Language

User Interface language is text you user sees and interacts with. Say labels for username in German will be Benutzer-name, or text for Log out in German is Abmelden. To localize user interface (UI) in your own language you need befamiliar with Django way. It is because main web application is Django project.

Contributing to the project in this sense means basically creating/updating file paperme-rge/<langcode>/LC_MESSAGES/django.po file.

7.4.3 Document Content Language

Every document upload to papermerge will be OCRed by tesseract command line utility. Tesseract command requires-l <lang> argument - to indicate the language of the document. This is the heart of document language support. Havea look a worker’s shortcuts module extract_hocr and extract_txt functions. Both functions built tesseract commandwith language as first argument.

To check what languages you have installed for tesseract, use command:

$ tesseract --list-langs

In my case, it lists deu and eng - which are codes for German and English languages.

OCRing of the documents (tesseract -l deu path/to/doc) happens on worker side. I explained this because it is importantto know, but for adding language support - you don’t need to change anything in the worker, because worker onlytakes orders and blindly executes them.

The entry point, for the worker part is task module with it’s ocr_page function. Again, no need to change anythinghere, I mention this only because it is important to know.

First thing you need to have a look into and change is dynamic_preferences_module where configuration for to lan-guage is defined.

You will need to add a new choice in OcrLanguage class. The code for the new language must match language codelisted by tesseract --list-langs. This change will add a new entry in UI and will allow user to choose newlanguage for the document.

But the tricky part is doing the change on database level. The thing is papermerge makes use of PostgreSQL full textsearch feature, which means it needs to store an updated version of tsvector type column. How to create and searchtsvector type columns is described in postgres documentation.

Every time page.text column is changed a database level trigger is fired to updated language specific tsvector column.Triggers for this job are defined in papermerge/core/pgsql/01_triggers.sql file.

Another important language related sql file is papermerge/core/pgsql/03_update_lang_cols.sql. This sql code is exe-cuted periodically by papermerge/core/management/commands/update_fts.py command. It is responsable for movingdocument page.text to page.text_deu or text to text_eng.

Both page.text_eng and page.text_deu are tsvector type columns with preset weight ‘C’.

32 Chapter 7. Developers Guide

Page 37: Papermerge - Read the Docs

CHAPTER 8

Indices and tables

• genindex

• modindex

• search

33