This paper looks at how deep integration between text and data is attempted in The Codex project. Standoff properties are used to mediate between the plain text stream and entities modelled in the Neo4j graph database. A dynamic standoff property text editor was constructed to enable real-time changes to text and annotations without invalidating standoff property indexes. An examination of the multidimensional affordances offered by standoff properties is explored, with reference to how annotations and graph entities can combine to construct an ›atlas of history‹ using Codex.
In diesem Beitrag wird The Codex vorgestellt, ein Projekt, in dem basierend auf Standoff Properties Texte multidimensional annotiert und die Annotationen in eine Graphdatenbank eingebettet werden können. Darüberhinaus sind Basistext und Annotationen im Unterschied zu vielen anderen Standoff-Markup-Systemen editierbar. Auch Annotationen selbst können wieder annotiert werden, was den Forschungsdiskurs leichter nachvollziehbar machen könnte.
- 1. Introduction
- 2. Annotations
- 2.1 Presentation
- 2.1.1 Page, line, paragraph, sentence, column, etc.
- 2.1.2 Hyphens as a zero point annotation
- 2.2 Semantic
- 2.2.1 Agents
- 2.2.2 Claims
- 2.2.3 Texts
- 2.2.4 Meta-Relations
- 2.2.5 Concepts
- 2.2.6 Data Sets and Data Points
- 2.2.7 Times
- 3. Standoff Properties
- 4. The Modelling of Doubt
- 5. Implications
- 6. Conclusion
- Bibliographic References
- List of Figures with Captions
The Codex is a project that aims to achieve deep integration between text and structured data. Although data can be ›extracted‹ from texts and stored in a database – whether that data be persons or places referenced, events recounted, numerical quantities reported, etc. – the non-sequential nature of databases can make it difficult to put the data back into context. While becoming amenable to computational analysis, the crucial narrative or argumentative structure is usually lost. Text itself can be considered to be a kind of database, one that is on the one hand constrained by its sequential presentation but on the other hand makes capable the modelling of thought in its multidimensional complexity. XML is a powerful tool for the modelling of text by allowing regions of text to be marked up with semantic tags or elements, and languages such as XPATH can be used to query the XML document model. However, the use of markup itself introduces a discontinuity between the text and the data annotated. Markup changes the very text it marks up by creating a new marked-up document. Also, overlapping annotations – such as are commonly required in manuscript presentation annotations – cannot be directly expressed in XML’s tree-like hierarchical structure, necessitating workarounds such as standoff markup which further degrade the readability of the XML document. Without freely overlapping annotations the most essential multidimensional aspects of text (its multitude of meanings) cannot be adequately marked up.
The Codex aims to bridge the divide between database and text – to achieve ›deep integration‹ – by eschewing markup entirely. Standoff properties are used, instead, to represent annotations. As defined by Desmond Allan Schmidt, the use of standoff properties is a technique for recording textual properties that do not conform to a context-free grammar, and can freely overlap. While standoff properties have been proposed in the digital humanities several times, Codex offers a practical solution to adding annotations to text, allowing the user to make changes to the text in real-time without breaking existing annotations. In addition, Codex offers a selection of stylistic, presentation, and semantic annotation types, as well as tools like named entity recognition (NER) and pronoun selection. The user can easily link annotations to entities in the graph database, or create new entities and their dependencies, entirely within modal windows, allowing the user to capture and construct data within the editor. They can also add footnotes, marginalia, etc., within a modal window editor, offering the same features as the editor in the window before. Annotations themselves can be annotated with editorial commentary entered via the modal editor. In sum, through the use of a real-time standoff properties editor, a powerful modal-window entity management system, and the Neo4j graph database, Codex aims to provide an environment suitable for annotating all varieties of text, and linking annotations back to the text on a character-level.
One of the main goals of Codex, building on the deep integration of text and data, is to construct an ›atlas of history‹ from primary source texts. This phrase is a metaphor for both the kinds of annotations used as well as the graphical tools employed to visualise connections: ›atlas‹ here refers to a network of relations that are amenable to projection in various ways. In other words, the purpose of Codex is not simply to annotate entities in text, but to represent these entities as nodes and relationships in the Neo4j graph database. The end result is both an annotated document and a graph dataset, which means that the more entity annotations are added to texts the richer the network of relations between texts becomes. As a basic example, if the corpus were the collected letters of Michelangelo and references to Michelangelo’s father Lodovico Buonarroti were annotated, it would be trivial to query the database to find all other texts referring to Lodovico. But to make clearer what can be annotated and represented in the Codex system, we propose to give an overview of the main annotation types.
The types of annotations in Codex currently fall into three main categories: stylistic; presentation; and semantic. Stylistic annotations include commonly used typographical styles like italics, bold, underline, strikethrough, subscript, superscript, and forms of emphasis like spaced and uppercased text. The size, colour, and font type can also be set.
However, the more interesting categories are presentation and semantic. These are broken down as follows.
2.1.1 Page, line, paragraph, sentence, column, etc.
These annotations denote regions of text corresponding to the presentation (or layout) of a page in a manuscript or a publication. Because standoff properties can overlap freely, presentation annotations in Codex pose no danger of truncating text with presentation markup.
2.1.2 Hyphens as a zero point annotation
Hyphens present a challenge for annotating medieval manuscripts; while the editor wishes to record the location of the hyphen, it is not desirable to intersperse the plain text with hyphens as this will confound literal text searches. The hyphen annotation in Codex avoids this problem by representing hyphenation with zero-dimensional annotations. A zero-dimensional annotation is a special case of a standoff property that has a start index value but no end index value; in this way, an annotation effectively refers to a position in the text between characters. The hyphen itself is not stored in the text (leaving the words unhyphenated), but the annotation indicates the location of the hyphen in the original. Zero point annotation is a generalizable feature of standoff properties and can be used for other cases as well.
Before proceeding we should note that for each semantic annotationthere is a corresponding semantic entity in the system. The types of annotations and entities available in Codex is defined by the application code and not by an existing standard, meaning that the system can be configured by a programmer with more annotations and entities as desired. Each entity is modelled as a combination of nodes and edges in the graph database, sometimes expressed in hypernode structures (in other words, an entity modelled over a cluster of nodes). Creating a semantic annotation for a text is tantamount to either selecting a pre-existing entity, or creating a new, corresponding entity in the modal window interface. Conversely, entities can also be created and managed outside of the Codex editor in sections of the Codex interface specific to that entity type. Therefore, entities can be considered as an independent data-set from texts but capable of integration with texts via semantic annotations. This means that entities can be exported or imported irrespective of texts in which they may or may not be mentioned: entities don’t have to be mentioned in (or inferred from) texts to exist in the system.
An agent refers to any kind of entity that is mentioned in the text. A non-exhaustive list of agent types includes people, places, and objects (natural and man-made).Collective agents like organisations, families and other groups can be represented, as well as aspects of an agent at a point in time. Metadata about an agent can be recorded in associated property nodes which are dynamically defined in the interface as needed. For example, one can record the gender of a person, their height, weight, etc., and create new property types as required.
Agents can also be related to each other via dynamically-created relations. (In Codex, these are called meta-relations, for reasons which are discussed later in the section on this entity.) The example below shows some of Lorenzo de’ Medici’s genealogical relationships, as well as his connection to transient agent collectives like his presence in a group of six Florentine ambassadors to Rome in 1471, and in another embassy to Rome in 1483.
A claim refers to a statement concerning one or more agents, usually with respect to a place and a time. A claim is essentially a statement that usually takes the form of an event (an event claim), but can also represent a thought or an opinion. A claim entity in Codex is not taken as a statement of fact, substantiated or otherwise, but is rather a data-structure resembling a verb-phrase with prepositional agents. For example, the statement that »Lorenzo de' Medici died on his estate at Careggi« made by Luca Landucci in his diary entry of April 8th, 1492, is modelled in Codex as a ›(Subject) Lorenzo de’ Medici, ›(Event) died‹, ›(At) Careggi‹, visualised in the Codex interface and Neo4j database browser in Figure 4 and Figure 5. Our approach is not to assert whether or not the statement reports a fact, merely to allow the editor to annotate the statement.
A text entity in Codex is composed of plain text and a collection of standoff-properties. For convenience, texts can be assigned a ›type‹ indicating their function (such as ›body‹, ›footnote‹, ›margin note‹, etc.) but there are no limitations about the kind of text stored. Therefore, a text can contain as much of or as little of the source text as appropriate. In the case of the Luca Landucci Diary, each diary entry (of which there are several per page in the source) is stored in a separate text node, whereas each Michelangelo letter as a whole is stored in a text node. Presentation annotations are used to mark sections of the text corresponding to the source (e.g., pages, columns, etc.), and the structure annotation can be used to link texts to structure entities which represent arbitrary sections of a publication (e.g., the chapters the text belongs to).
A text annotation in Codex is an annotation that relates a region of text in a text entity to a different text entity. There are two annotation topologies available to text annotations: they can either be applied to a region of text, like most annotations; or they can be inserted ›invisibly‹ between characters. We can think of these topologies as one-dimensional and zero-dimensional annotations. It will be remembered that zero-dimensional annotations are also used to represent hyphen annotations; in the case of text annotations they could function as footnote numbers that the editor wishes to position in a text, but doesn’t want to be included in the text per se.
A meta-relation entity is a relationship between agents with a number of features distinguishing it from simple relationships or edges in a graph database:
- It is dynamically definable in the Codex interface. We have found the ability to create new relationship types in an ad hoc way to be invaluable for capturing the fluidity of relationships in texts. When the selection of relationship types is hard-wired into a program it becomes burdensome to create new ones; enabling the user to create them freely in the interface encourages the spontaneous creation of relationship types that are more fit-for-purpose;
- It is bidirectional, meaning the user is able to specify both directions of the relationship (e.g., ›parent of‹/›child of‹) rather than being forced to adopt a single direction (e.g., ›parent of‹) imposed by graph edges. This encourages the user to think in terms of the overall relationship – e.g., ›parentage‹ – rather than forcing them to arbitrarily choose a single relationship to represent by implication a bidirectional relationship. One advantage of this in constructing Cypher queries is that an agent’s participation in a meta-relation can be found without needing to consider their role in the relationship, although that role is recorded and can still be explicitly queried if desired;
- They are composable within a hierarchy, effectively allowing relationships themselves to be treated as a graph. For example, one can define an overarching relation type like ›interpersonal relationships‹ and nest subordinate types beneath it, like ›social relationships‹, ›family relationships‹, ›professional relationships‹, etc., allowing one to query relationships between agents (e.g., people) at an abstract level. Rather than being limited to finding simply the ›friends of‹ a person, one can expand a query to also retrieve ›associates of‹, ›acquaintances of‹, ›confidantes of‹, etc.
A meta-relation annotation, therefore, is an annotation that refers to a meta-relation entity. Its practical purpose is to allow the editor to annotate agent relationships from texts, extending the network of relationships between agents. Ultimately, such networks allow the user to find indirect connections between texts on the basis of the relationships between agents established in the network. In Figure 6, the orange line beneath ›son of Antonio‹ is a meta-relation annotation indicating the source of the statement that Luca Landucci was the son of Antonio Landucci.
A concept in Codex is a class or type that, taken as a whole, is part of a common ontology in the system. Note that the ontology is not common in the sense of being a ›universal‹ or ›world‹ ontology, but merely functions as a subgraph that is shared among other entity types (such as agents, claims, meta-relations, etc.) for the purposes of a common, reusable reference. Rather than constituting a universal, top-down ontology, the concept subgraph is in practise composed of any number of open or idiolectic ontologies defined by the user. Codex already contains a number of idiolectic ontologies, such as ontologies for types of events, places, relationships, professions, etc. For example, claim entities reference the ›Events‹ subset of the concept subgraph. The concept ›Event‹ is the root node of the Events ontology and contains all types (and subtypes) of events that occur in the project domain. Note that concepts can have more than one parent, if desired, as a graph is not bound by the limitations of a tree. Changing the structure of the ontology (e.g., moving a child concept to a different parent) is easily done through the Codex interface, meaning that ontological structures can be kept fluid to suit the evolving understanding of the project domain.
A concept annotation, therefore, is an annotation that refers to a concept entity in the common ontology. In Figure 7, the green underlines represent concept annotations on the words ›flautist‹ and ›prodigy‹.
2.2.6 Data Sets and Data Points
A data point in Codex is defined as a quantity with a unit of measurement that is attributable to a place and a time. A data set is a collection of data points. In the example below, the statement that »three people fell dead on this day« translates to a data-point where the value is »3«, the unit of measure is »people«, the place is »Florence«, and the time is »April 23rd, 1483« (the date of Landucci’s diary entry).
The idea of a data point is to enable the editor to extract numerical data from a text that may be of statistical interest. As indicated in the above Figure 8, a data point annotation links the text to a data point entity.
Some practical examples of data sets that can be extracted from historical sources include epidemiological data, weather records, census figures, crime statistics, etc.
A time entity in Codex is a representation of a date and time in various degrees of precision. The entity is composed of nine components, which are all optional. Options for c. include ›on‹, ›before‹, ›after‹, and ›circa‹; options for Section include ›early‹ and ›late‹; and options for Season include ›Winter‹, ›Summer‹, ›Autumn‹, ›Spring‹. The variety of options is meant to reflect the realities of date representation in historical texts. (We intend to review the W3C Time Ontology for guidance on refining this model.)
A time annotation is applied to any part of the text with an identifiable date / time, even if the text does not state a numerical date. For example, in Figure 8 above the time annotation (blue underline) links the text »this day« to the stated date of the diary entry (April 23rd, 1483).
Now that an overview of the main annotation types in Codex has been given, it is necessary to examine the standoff properties model as it forms the basis of Codex’s approach to text-as-graph.
3. Standoff Properties
Aside from markup formats, word chains present another approach to annotation. A word chain is a graph model of a text where each word is treated as a token node and structure is indicated by relating each node to its next sibling in the sentence. Lexical and presentational annotations can also be linked to token nodes to model the structure of paragraphs and larger units.
Like standoff properties, word chains represent a markup-free alternative to XML document formats, and offer a solution to the overlapping annotations problem. However, at present updates to word chains are managed through graph database queries, which requires some programming expertise. The Codex standoff property editor offers a trade-off between the multidimensional affordances of the graph and the technical simplicity, endurance, and sustainability of the text stream. Another distinction between word chains and standoff properties is that word chains take the word as the smallest token, which poses challenges for annotations inside of words (let alone how one chooses to define word boundaries).
The simplest solution from an annotation standpoint is conceivably to treat the character and not the word as the smallest token unit; however, managing a chain of character nodes as a graph data-structure would be exponentially more unwieldy than a chain of word nodes. However, this assumes that the characters themselves need to be represented as nodes; in fact, what defines an annotation is that it is a region of text with a certain intention (whether stylistic, presentational, semantic, etc.). If one moves away from the token node concept to an annotation node concept, then the text of the document can be stored in a plain text format (sans markup) and annotations can annotate the text by using start and end character indexes.
The removal of embedded markup makes the text stream easily readable to both humans and machines. It also solves the overlapping annotation problem because the properties are stored apart from the text, and not subject to hierarchical encoding conflicts. Multiple properties can refer to the same regions of text – or overlapping regions – by virtue of the start and end character indexes. Standoff properties are inherently discrete objects which coexist in a ›flat‹ hierarchy, that is to say, with no imposed hierarchy at all. If a standoff property references a linked entity in the database, it can be easily connected via an edge, allowing full traceability from entities to the regions of text they are referenced by.
A standoff property, then, is essentially a data structure (tuple) that models the following attributes:
- Type. A string representing the name (i.e., the type) of the annotation.
- StartIndex. An integer representing the index position of the first character of the annotation: 0 <= x < n, where x is the index and n is the length of the text.
- EndIndex. An integer representing the index position of the last character of the annotation within the length (same rule as the StartIndex).
In practise one would wish to extend this with a fourth attribute:
- Value. A string representing data specific to the annotation, such as the unique identifier of a referenced entity, or alternatively a colour value, text size, font, etc.
Building on our technical definition of a standoff property, Codex extends this model further to include attributes that aid with database integration.
- GUID. A 32-character string functioning as a unique identifier of the standoff property. This is required for saving the property to the database.
- UserGUID. A GUID (see above) representing the user who created the annotation.
- Index. An optional integer representing the order in which the standoff property was created.
- Text. An optional string representing the source text referred to by the annotation. This is an optional attribute to make it easier to view standoff properties in the database. The StartIndex and EndIndex attributes are the source of truth with respect to the location of the annotation in the text.
- Layer. An optional string representing arbitrary groups (layers) that the annotation is assigned to. For example, if the user wanted to group several agent annotations referencing artists, they could assign the annotations to an ›artist‹ layer. This grouping could be used for filtering either in the database, or in the editor itself.
- IsZeroPoint: A boolean value indicating whether the standoff property is a zero-point annotation; that is, an annotation that refers to an invisible index point in the text. This can be thought of as an annotation that refers to the space between two characters.
- IsDeleted: A boolean value indicating whether the standoff property has been marked as deleted.
To give a real-world example, below are images of a text in the Codex editor as well as a representation of a portion of it in as a JSON export.
Following is an extract from the JSON export of the above text, with yellow highlights showing the typical parts of the standoff property data structure. The green parts show which text the annotation covers. The blue part shows the text itself.
"text": "Lorenzo de' Medici died on his estate at Careggi; and it was said that when he heard the news of the effects of the thunderbolt, being so ill, he asked where it had fallen, and on which side; ... ",
"text": "Lorenzo de' Medici",
"text": "Lorenzo de' Medici died on his estate at Careggi",
4. The Modelling of Doubt
Before proceeding to reflect on the possibilities of deep integration between text and data, it is important to review how the above discussion bears on the subject of the January 2018 graph-conference, »The modelling of doubts«, hosted by the Akademie der Wissenschaften und der Literatur, Mainz. Although Codex wasn’t designed with the intention of modelling doubt in the strict sense of quantifying it, it can be said that it aims at modelling interpretation in the following ways:
- The ability to freely overlap annotations means that the same text regions can be annotated multiple times, allowing multiple interpretations to be captured. For instance, if two editors disagree about the identity of a person in the text, they could each add their own agent annotation to the same text region as overlaps are permitted;
- Comments can be added to annotations themselves, enabling editorial discussions about the annotation validity to be recorded;
- The ability to annotate agent properties, event claims, and meta-relations leads to full transparency around statements that are often just presented as established fact. For example, rather than simply accepting as fact the claim that Lorenzo de’ Medici died at Careggi on April 8th, 1492, the claim annotation in Codex is traceable back to the precise section of the Luca Landucci text it occurs in.
The full potential of the standoff property model on the integration between text and graph data structures has yet to be documented. Aside from the convenience of plain text free of markup, of overlapping annotations, and of annotations whose sources are traceable back to precise text ranges, there are two features of standoff properties that suggest possibilities for computational analysis:
- Standoff properties can be grouped into layers, where a layer is defined either implicitly by the annotation type or explicitly by a value stored in the Layer attribute;
- The StartIndex and EndIndex attributes offer the possibility of combining annotations that are contained by or overlap the same text range.
These qualities of layering and combination have already led to useful features in the Codex editor (such as modal windows for managing named-entity and pronoun annotation candidates), but they offer computational insights as well. It is trivial, for example, to write a Cypher query to find all annotations in a text (or in all texts) that overlap each other in various ways. Overlapping annotations have the potential to enrich the text they annotate with their combined meanings. One approach that we are exploring in Codex is the comparison of manually-entered annotations (such as agents and event claims) with machine-generated syntaxical annotations, providing a natural language analysis of texts. This is an example of how standoff properties can support a multidimensional analysis of the text, allowing human-generated and machine-generated annotations to exist together.
The Codex uses standoff properties to integrate annotations with text in a graph database. Annotations map back to text at the character level and can be overlapped without constraint. The ability to overlap and comment annotations offers a convenient system for capturing discussions around doubt. The variety of semantic annotations – including event claims, meta-relations, agent properties, etc. – leads to a system where ›factual data‹ can be easily traced back to its text sources. Beyond the modelling of interpretations, Codex seeks to enable project editors to build an ›atlas of relations‹ from their source texts, integrating graph entities with text on a character level. The intended result is not so much a marked-up document (although this is a given) but a graph dataset with ›deep roots‹ in its constituent source texts, mediated through layers of standoff property annotations. Codex’s real-time standoff property editor and modal-window entity management system are tools that we hope will assist editors in exploring the connections between structured data and text in their own digital editions.