Available at http://www.zfdg.de
Sofern nicht anders angegeben
Available at http://www.zfdg.de"> (c) Forschungsverbund MWW
Einreichung als Artikel in der ZfdG durch die Autoren
Transformation der WORD-Vorlage nach XML/TEI-P5 durch Oxgarage und XSLT-Skripten
Lektorat des Textes durch die Redaktion.
Medienrechte liegen bei den Autoren
All links checked
Die automatische Schreiberidentifizierung erlangte viel Aufmerksamkeit während des letzten Jahrzehnts. Jedoch beschränkte sich die meiste Arbeit auf zeitgenössische Vergleichsdatensätzen. Diese Datensätze beinhalten typischerweise keinerlei Rauschen oder Artefakte. In dieser Arbeit wird analysiert ob die aktuell beste Methode der automatischen Schreiberidentifizierung ähnlich gut auf historischen handgeschriebenen Daten funktioniert. Im Gegensatz zu zeitgenössischen Daten enthalten historische Daten oft Artefakte wie Löcher, Risse oder Wasserflecken, was eine zuverlässige Identifikation fehleranfällig macht. Wir führten Experimente an zwei großen Briefkollektionen mit gegebener Authentizität durch und erlangten vielversprechende Ergebnisse von 82% und 89% TOP-1 Genauigkeit.
In recent years, Automatic Writer Identification (AWI) has received a lot of attention in the document analysis community. However, most research has been conducted on contemporary benchmark sets. These datasets typically do not contain any noise or artefacts caused by the conversion methodology. This article analyses how current state-of-the-art methods in writer identification perform on historical documents. In contrast to contemporary documents, historical data often contain artefacts such as holes, rips, or water stains which make reliable identification error-prone. Experiments were conducted on two large letter collections with known authenticity and promising results of 82% and 89% TOP-1 accuracy were achieved.
Similar to someone’s face or fingerprints, handwritten text can serve as a
biometric identifier. In this sense, writer
Among the document analysis community, Automatic Writer Identification (AWI)
has gained significant attention in the last decade. Several competitions
with the objective of identifying a given scribe were organised at prominent
conferences such as the International Conference on Document Analysis and
Recognition (ICDAR) and the International Conference on Frontiers in
Handwriting Recognition (ICFHR). Nevertheless, these competitions were
conducted on contemporary or even artificial data sets.
Only recently has AWI also been applied to historical documents. For example,
the project DAmalS (Datenbank zur Authentifizierung mittelalterlicher
Schreiberhände)
Since we are dealing with historical documents, the focus of our work is similar; however, our dataset consists of hundreds rather than dozens of scribal hands. Actually we employ two large datasets of letters: the Clusius dataset and the Schuchardt Dataset (see figure 1). Furthermore, our evaluation is not limited to a document (manuscript) as a whole, since we also ran our evaluation on the basis of single pages as a metric unit.
Before AWI methods can be applied, several preprocessing steps usually need
to be carried out. First of all, the actual text regions of a page image
must be separated from regions containing non-text elements. For example,
the scanning protocols of libraries involve the addition of a color pattern
and a ruler so that the color information and real scale can be
reconstructed. However, for the purpose of document analysis, these parts of
the page image need to be removed. Additionally, artefacts in the background
of the document, such as folds or graphical illustrations, are not relevant
for writer identification since we only want to analyse the text. Thus, we
first detect the text areas in the document image, as described in more
detail in the following section (figure 2). In the second
step, the colour of the documents, or, more precisely, of the regions
containing text, is reduced to 1 bit, i.e., the text regions are binarized
(figure
3). As a result, the contour of the script is represented as a
black line (figure 4). In the third step, features of the contour are
extracted. A background model is computed from all feature descriptors of
the whole dataset (or training set), and this model is in turn further used
to compute global image descriptors for each page of the collection. Note
that this process is very similar to speaker identification in audio
signals.
A bottom-up approach is used to analyze the page image in order to detect
the text regions and to separate them from any other type of region,
such as graphics or noise. This approach demonstrates more robustness
with respect to noise or poorly pre-processed images than top-down
approaches. First, the characters are grouped to words in the binary
image using Local Projection Profiles (LPPs). Issues arising from merged
ascenders and descenders between text lines are resolved using a rough
text line estimation based on a first derivative anisotropic Gaussian
filtering. Then, continuous local maxima are detected in the filtered
image in order to split text lines that are merged. After these
processing stages, the contour of words is known. In order to maintain
processing speed and the complexity of subsequent algorithms, it is
preferable to represent words using an enclosing rectangle rather than
their contour. We introduced profile boxes (see figure 5) that are
computed by robustly fitting lines to a word's upper and lower profile.
Having detected both lines, the profile box is defined to have the mean
angle of both lines, a height which is the mean distance between the
lines, and a width corresponding to the maximal length of both lines. A
detailed description of the text detection is presented in the work of
Diem et al.
We used the writer identification method of Christlein et al.
The following paragraph provides a brief outline of the algorithm; for
more details please refer to the original work by Christlein et al. In
contrast to typical allograph-based writer identification methods
In order to work with historical data, we modified this approach
regarding its binarization step. Otsu’s binarization is a global
binarization method. It finds an optimal threshold to separate
foreground from background using all pixel information of the image.
However, a global threshold is often suboptimal when dealing with noisy
data. For example, non-uniform illumination, or the large amount of
zero-pixels from the cardboard surrounding the document, generates
non-optimal thresholds. Thus, we employ the local threshold-based method
of Bradley et al.,
The Clusius dataset consists of 1600 letters written to and by one of the
most important sixteenth century botanists, Carolus Clusius (1526-1609).
It was provided by the Huygens Institute for the History of the
Netherlands (Royal Netherlands Academy of Arts and Sciences), which is
creating a digital edition of Clusius’ correspondence in the
collaborative editing tool eLaborate.
The letters were written by 330 different authors, in 6 different languages, and from 12 European countries. A unique feature of this correspondence is that the authors come from different backgrounds, including scholars, physicians and aristocrats, but also chemists and gardeners, and there are many women. This variety provides an extremely diverse glimpse into linguistic and handwriting characteristics in the second half of the sixteenth century, including the clear Latin handwriting of Clusius himself, or the almost (for us) unreadable handwriting in Viennese dialect of a lower-Austrian noblewoman. The correspondence is mainly about the exchange of plants and information, but also comprises news on politics, friends and family, court gossip, etc. An example image can be seen in figure 1 (left).
While the total number of the preserved correspondence comprises 1600 letters, only the 1175 letters that are preserved in Leiden University Library have been digitized and therefore could be used for the experiment. All scribes were already identified, though some letters have co-authors, and most aristocrats had secretaries.
The Schuchardt dataset was provided by the project Network of
Knowledge
This eminent scholar, who displayed an extraordinary networking
capability, left more than 13,000 letters addressed to him from more
than 2000 individual writers and ca. 100 institutions over a timespan of
77 years (1850-1927). The letters are in more than 20 languages.Hugo Schuchardt Archiv,
Although Schuchardt’s correspondence has been categorized manually as mentioned above, reliable writer recognition could be used on Schuchardt’s papers to attribute loose sheets and notes preserved within the section »Werkmanuskripte« which were often sent to Schuchardt by his correspondents but were separated from the letters originally containing them and do not bear any signature. In the future, the testing and improving of writer recognition on a large dataset of identified scribes of a heterogeneous collection might pave the way for its use in the framework of inventorying handwritten archives.
The dataset images are evaluated in a leave-one-page-out scheme; from the individual test set, one query image is tested against all remaining ones, resulting in an ordered list in which the first returned image has the highest probability of having been written by the same author. From the retrieval lists, we can compute the accuracy of the algorithm. As error metrics, we use the ›Soft‹ Top-k, ›Hard‹ Top-k and the mean Average Precision (mAP). The Soft Top-k denotes the precision at rank k. In other words, it describes the probability that the correct writer is among the first k retrieved documents. In contrast, the Hard Top-k rate gives the probability that the first k documents are written by the same author as the query document. The mean Average Precision is a metric commonly used in information retrieval. For each query document, the average precision of relevant documents in the retrieval list is computed. Therefore, the precision is given by the number of relevant documents in the retrieval list up to rank k divided by k.
First, we evaluated the datasets as a whole, meaning that we did not separate the datasets into independent training and test sets. Thus, the background model stems from the same data to be evaluated. We decided to use all pages from each scribe who contributed at least two pages. For the Clusius dataset, this resulted in 2029 pages from 182 different scribes. The Schuchardt dataset has 12,846 pages written by 193 different scribes. Note that we discarded several images from both datasets that are not associated with the actual letters (or postcards).
Initial results showed that the TOP-k accuracies are very promising (table 1). In 82% and 89% of all cases, the author of the query page was identified correctly for the Clusius and the Schuchardt datasets, respectively. The high TOP-10 rates of 90% and 97% also suggest a quick detection of the correct writer in the shortlists.
However, the rather low mAP values of 29% and 34% indicate that there are pages in the datasets which are very difficult to identify. Most likely this is related to images containing very little text, such as letters or postcards containing only the address. Also note that the number of documents per author was very unbalanced, so that the Clusius dataset has six authors who each contributed more than 50 images. Thus, document pages from these authors are likely to be identified more easily than pages from authors who appear more rarely in the dataset.
To allow for a fairer comparison, we conducted another set of experiments in which we picked exactly four pages from each author. The remaining pages of these authors and images from authors who contributed less than four pages were used solely to train the background model. Table 2 shows how the identification accuracy drops in comparison to the full datasets. The TOP-1 accuracy scores are about 60% and the correct author is identified only in about 75% of the cases among the first ten most similar images. Not surprisingly, this indicates once again that the more training images we have, the easier it becomes to identify the correct scribe.
This experiment also makes it possible to compare the results with a
clean competition dataset.
We see a number of highly interesting use cases for AWI and would like to discuss some of them:
(1) First of all, AWI offers the chance to query large collections of
archival documents based on the writing style of a single person. Without
having seen or transcribed the page images, a user of this type of search
interface would be able to access the writings of a specific hand or scribe.
This method provides completely new ways to search an archive and supports
the idea that it is more important to digitize large collections of
documents and to enrich them automatically, rather than combining manual
metadata labeling with the digitization workflow itself. One could therefore
think of applications in which a user could collect the writings of a person
of interest within several archives using a pre-trained model of this hand,
for example. A prototype implementation of such an AWI based search and
retrieval tool will become available to the public via the Transkribus
platform.
(a) Give me all those page images which are very likely
written by the same hand as my query page (query by example). The
result will be a ranked list of page images according to their distance from
the query page.
(b) Another - similar - query could be: I have trained a
writer model based on several dozens or hundreds of page images. Give me
all page images which are similar to this writer model (query by
writer).
(c) If several writer models are available, another query could be: Order all page images according to the writers contained
in the collection. It must be emphasized that in all three cases,
the user will need to deal with probabilities and may need to review and
discard a relatively large number of false alarms but would still be able to
access the writings of a specific person in a previously unknown way.
(2) A second use case deals with digital editions: As a matter of fact, many
documents which stem from famous persons are not written by these famous
persons themselves. To mention just one prominent example: Jeremy Bentham
(1748–1832) left behind tens of thousands of pages of unpublished works,
personal papers, and correspondence. These papers are currently being
transcribed and published by the transcribe Bentham project
(3) A third use case takes a similar direction. The description of different hands is currently often based on intrinsic knowledge of the scholar and simple examples. A scholar may ›know‹ a specific hand, and describe its features, such as how certain characters are formed, but there are hardly any objective measures that can be used to identify a specific hand. Since AWI computes specific descriptors for every text region, the distance between text regions (in our case, pages or documents, but also smaller units such as blocks or lines) can be computed and may provide an objective view of a given hand.
In this article, we provided a first large-scale analysis for Automatic Writer Identification. Our results reveal that the current algorithm shows significant potential for practical use. At this stage, the technology can already be used to search through a large amount of data to give a short list of authors most likely to be the correct ones. This method can dramatically reduce the manual effort connected with searching for the writing of a specific person or for clustering a given collection. More broadly, it provides new ways to search through archive material and confirms that digitisation of large amounts of archival material provides significant benefits even without cost intensive manual metadata editing.
Further research should be dedicated to a better layout analysis and
algorithms that are less error prone regarding faulty binarization.
Therefore, it would be helpful to provide the computer scientists with more
training and evaluation data to test their algorithms. With more data, new
technologies like ›deep learning‹