Abstract

This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows.

You do not currently have access to this article.