Transkribus and the Altona Case Team

 This post first appeared on the blog for the Centre for Privacy Studies: https://privacy.hypotheses.org/987


In the Altona Case Team at PRIVACY, we are working with two versions of a late 18th century text by Johann Peter Willebrand. The text appears in French as Abrégé de la police, accompagné de réflexions sur l'accroissement des villes and in German as Innbegriff der Policey: nebst Betrachtungen über das Wachsthum der Städte.

Cover page of Abrégé de la police

To make our lives easier,  our team thought that it would be a good idea to run the PDFs through OCR, to have searchable and editable texts that we could work with. However, we got huge differences in accuracy with different OCR tools.

We started with the French version of the text, which we downloaded in PDF format from Google books. First, we tried Abbyy FineReader. This is a very good (proprietary) app to run OCR on scanned text written in modern languages , but when dealing with our early modern material, the results were far from acceptable. Next, we tried Transkribus. For 18th century print French, we could choose three different trained models that were publicly available. We tried all of them, with very different results.

Bellow you will see a screenshot with the four results we obtained with the different tools, displayed side by side.

[caption id="attachment_991" align="alignnone" width="1915"]Comparison of OCR tools: 1) Abbyy FineReader (modern French), 2) Transkribus model French_18thC_Print, 3) Transkribus model LaMOP-Livre_Rouge_1, 4) Transkribus model Parallèle des Anciens et des Modernes M2 Comparison of OCR tools: 1) Abbyy FineReader (modern French), 2) Transkribus model French_18thC_Print, 3) Transkribus model LaMOP-Livre_Rouge_1, 4) Transkribus model Parallèle des Anciens et des Modernes M2[/caption]

Clearly the winner was the Transkribus model French_18thC_Print (developed at the KB National Library of the Netherlands).

Then came a surprise: as we fiddled with the Transkribus interface to learn more about this particular model, we discovered that it was one of the models developped by Annemieke Romein, with whom we had a couple of Transkribus workshops online. What a serendipitous discovery!

Now we are even more excited to bring Annemieke for an in-person training at UCPH, in a cooperation between our team at the Centre for Privacy Studies,  and Prof. Gunvor Simonsen, and her team of the ERC project  In the Same Sea: The Lesser Antilles as a Common World of Slavery and Freedom, at the Saxo Institute.

Here is the model's description on Transkribus:

[Transkribus model French_18thC_Print] is based on printed texts in French (Romantype Font) that was used in Flanders (Low Countries), during the 18th century. The type of sources used for this model, are books of ordinances, which contained the norms ('laws') at the time.
This model has been the result of one of the KB National Library of the Netherlands Researcher-in-Residence position 2019. The project was called 'Entangled Histories'. The books used for this specific model, have been provided by the Bodleian Library Oxford (RECUEIL DES ÉDITS, DÉCLARATIONS, LETTRES-PATENTES, &c. ENREGISTRÉS
AU PARLEMENT DE FLANDRES). For more information regarding the background of the model and how to cite it, please visit: www.https://lab.kb.nl/dataset/entangled-histories-ordinances-low-countries

We look forward to working with Annemieke. If you want to learn more about her work with Transkribus, check out her website caromein.nl or find her on Twitter @CARomein

Comentários

Postagens mais visitadas deste blog

Le Jeu de Robin et Marion

Atomium, Flea Market, Fête de la BD... biking around Brussels

Pesquisando minhas origens