Project Preferences

The first GiDoc entry, 0: Preferences:, opens a window as that shown in Fig. 2, by which GiDoc preferences can be set on a transcription task or project basis. There are two buttons in the top part for project creation or load. The main window area, in the middle part, is divided into four tabs: Project, Preprocessing, Training and Recognition.

**Figure 2:** Project tab.

As shown in Fig. 2, the Project tab consists of three items:

Name. The project name is mainly used to define a directory under the global GiDoc preferences directory, "$HOME/.gidoc", where all project-specific information will be saved.
Document directory. Apart from the project-specific directory defined above, an independent document directory is defined here where GiDoc will look for (handwritten) text images. It is assumed that they are saved in GIMP's native XCF file type.
Lock transcriptions: If enabled, a warning appears when trying to modify the page transcriptions.

As can be seen in Fig. 3, the Preprocessing tab includes preferences for both, document and line preprocessing. Document preferences comprises two items:

Histogram type. Line detection is based on a rather standard projection-based method in which horizontally-averaged pixel values or black/white transitions are projected vertically. This option is to decide between pixel values or black/white transitions.
Number of lines. Maximum number of lines in a text block.

**Figure 3:** Preprocessing tab.

Line preferences refers to preprocessing and feature extraction for HMM modelling of text line images. Each text line image is first preprocessed and then transformed into a sequence of (fixed-dimension) feature vectors in accordance with the following preferences:

Custom procedure. Preprocessing of a text line image highly depends on image quality and user preferences. GiDoc provides a Gimp script function which implements a rather standard procedure, though here a different custom procedure can be specified.
Pixels over and under baseline. Conventional feature extraction methods work on a bounding box enclosing the whole text line image. However, GiDoc assumes that only text baselines are available, and thus here two options are included to define (bounds for) the locations of the upper and lower bounding box lines, with respect to the baseline.
Feature extraction type. This is to select the preferred feature extraction method from the two methods implemented in GiDoc : PRHLT and FKI. Please see Sect. for more details on them.
Number of features and derivatives. These options only apply to PRHLT feature extraction.
Image normalized height. This option only applies to FKI feature extraction.

As it name indicates, the Training tab groups all options related to model training. As shown in Fig. 4, this tab is divided in two parts: HMMs and language model.

**Figure 4:** Training Options

Options included in the HMMs part are:

HMM trainer. HTK HMM training software.
Symbol list. HTK symbol list file.
HMMs. HTK HMMs file.
Iterations. Number of iterations performed in each training step.
Initialise. If enabled, HMMs are trained from scratch. Otherwise, GiDoc will retrain previous models.
Number of states. Number of states per HMM model.
Mixture components. Number of gaussian mixtures per state.

The language model part includes the following options:

Word net. HTK language model file.
Vocabulary. HTK dictionary file.
SRILM command. Paremeters passed to SRILM's ngram-count command to calculate the language model.
Special Characters. File containing characters that split words in the language model creation.

The last tab, Recognition, includes options for both, recognition and verification (see Fig. 5):

WIP. HTK word insertion penalty.
GSF. HTK grammar scale factor.
Maximum activate states. HTK recognition pruning.
Verification. If enabled, recognition is followed by hypothesis verification.
Reliable threshold. Recognized words with confidence above this threshold are assumed to be reliable (correctly recognized).
Unreliable threshold. Recognized words with confidence below this threshold are considered unreliable (incorrectly recognized).

**Figure 5:** Recognition Options

giDoc Team