Building a New Digital Resource for European Global Studies

Annotation of Asia Directories in Mirador

Annotating Asia Directories in Mirador

Atomizer for OCRed Data

Tokenizing Workflow to Correct Extracted Data

The Global Information at a Glance project is a multi-year research investment by the Europainstitut. Inaugurated in 2015 to extract digital information from an entire edition of the Asia Directories published annually between 1863 and 1941, outputs from the project are already supporting research programmes at the Swiss Diplomatic Records Service (DODIS), at Aix-Marseille University and at ENS-Lyon. For researchers at EIB, the scale of new information about European presence in the Far East - from thousands of densely tabulated pages per volume - is providing unprecedented access to the processes that originally structured Europe's relationships with today's Pacific Rim nations.

Overview

The Directories project is remarkable because of its integration of new technologies, which are enabling development of a comprehensive digital resource for the research community rather than simply a digital version of the printed volumes. OCR, including the position and certainty of recognition of each of the characters in print, forms the input to multiple fully automated and assisted-scholar workflows. These identify text fragments describing persons, occupations, corporations etc. and build local dictionaries - comparing words encountered in adjacent years, for example. Consequently a high level of automation to correct word-level recognition errors is possible, as well as disambiguation of typographical errors and accommodation of evolving representation conventions which are encountered over the extended period of publication of the edition. The text fragments can be tokenised in place depending on their regularity, or exported for analysis by a range of specialized instruments and then reintegrated. As a result, verified instances of hundreds of thousands of persons, employment, corporations, products and locations, as well as events, maps and published notices including advertisements are being extracted as digital corpora in their own right at EIB, with growing networks of relationships and discovery mechanisms. All text fragment boundaries, whether automatically generated from OCR, machine learning or manual identification are stored using the Web Annotation Data Model (WADM) - making them sustainable and easy to convert for ongoing analysis. The International Image Interchange Framework (IIIF) is used to deliver page image fragments defined by WADM into extraction workflows and for efficient verification and contextual research by scholars. Data Futures' freizo platform enables rapid prototyping of research workflows and manages page images, OCR data and WADM, as well as extracted collections of entities, advertisements, maps and other outputs.

freizo provides conversion of data for connection with external analysis instruments and also maintains an Invenio-3 repository, which provides a range of standards-based metadata representations and services forming the Asian Directories and Chronicles Digital Resource (ADCDR). Invenio supports a wide range of plug-in technologies for connection with the library community and external research groups and is the long-term data management technology for OpenAIRE.

Outlook

During the Autumn of 2018 the first components of ADCDR will be launched as services to the research community. In addition to user documentation enabling effective use of this resource, descriptions of extraction processes and techniques will also be released to support analysis preservation and independent verification. These documents, which will be added to this website under the headings below, will describe analysis workflows and also technical infrastructure for management of multiple groups of contributors, verifiers and users in distributed organizations, which is essential to undertake extraction of such large-scale corpora.

Contents

  1. Digitization, OCR and management of page asset files.
  2. Construction of digital collection to support research workflows, presentation and protection of investment
  3. Organizing contributor community: authentication, permissions, tracking, curation and verification.
  4. Preliminary analysis to establish structure of volumes, enabling extraction planning.
  5. Automated primary analysis of tabulated sections of volumes; contributor-assisted secondary analysis workflows.
  6. Task definition and contributor-assisted analysis workflows for unstructured volume sections.
  7. Task-based heat-mapping to increase efficiency of OCR error management.
  8. Automated creation of metadata and sustainability using WADM annotation.
  9. Intermediate dataset generation for contributor workpackages; bi-directional transfer with external instruments.
  10. Derived corpora and intermediate presentation of outcomes.
  11. Primary and derived corpora snapshots to the Invenio-3 repository, persistent identifier strategy.
  12. Long-term internet presentation, access for the research community, and preservation.