Review 1 for Lang 2025

Siebold, Anna

doi:10.17175/2025_002_r1

Gutachten vom 13.08.2025

DOI: 10.17175/2025_002_r1

187 Aufrufe

Recommended citation: Anna Siebold: Review of Sarah Lang: Fine-Tuning Machine Learning with Historical Data. An Alchemical Object Detection Dataset for Early Modern Scientific Illustrations. In: Zeitschrift für digitale Geisteswissenschaften. 24.04.2025. https://doi.org/10.17175/2025_002_R1

I.1. Data Publication: Metadata Quality

The metadata of the dataset describes the dataset in as much detail as possible. They contain a human- and machine-readable license information. Standard data is referenced in a meaningful way (e.g. GND, Wikidata, GeoNames):

Applicable

Comment: The metadata is described well. I have one proposition: It could be specified more explicitly, somewhere at the beginning of the text, that an annotation is not only comprised of the segmentation and boundary box but also the specification of the classification.

★★★★

I.2. Data Publication: Machine Readibility of the Data Publication

Data and metadata follow a common, suitable data standard (in that specific discipline) and are available in a common file format. The data standards used (and, if applicable, their versions) are named:

Applicable

Comment: The Github link to the used .py script and the underlying image repository could be referenced earlier.

★★★★

I.3. Data Publication: Information on Contributors and Authors of the Data Publication

The authors of the dataset (and any other contributors) are clearly named and can be persistently identified (e.g. via the GND number, ORCiD), their functions are described in the metadata (e.g. via CRediT, MARC Relator or TaDiRAH):

Applicable

Comment: The text switches to "us" and "we" in [17] to [20], whilst beforehand the reader gets the impression it is one author that builds on previous work with other (named) persons involved.

★★★★

I.4. Data Publication: Accessibility of the Data Publication

The data set and its subsets are clearly and meaningfully named and well structured. The data can be accessed directly via the PID without specialized tools, free of charge and barrier-free. Any access or usage restrictions are clearly justified (e.g. for copyright or privacy reasons):

Applicable

Comment: The GitHub link and description of where the data can be viewed could be placed more prominently in the text for easier access.

★★★★

I.5. Data Publication: Long-term Availability of the Data Publication

The data is available in a file format that is suitable for long-term storage or in several file formats in order to increase the probability of long-term availability. A strategy for keeping the data and metadata permanently up-to-date, available and usable is in place and documented in the data set; the long-term archiving of the data and metadata is ensured. Changes are made transparent and traceable through versioning:

Applicable

★★★★

I.6. Data Publication: Context of the Data Publication

The data publication is an innovative tool that can be used by others and combined with other data sets. The data set contains information on completeness, context of origin, collection and processing methods and the quality of the data. Data gaps, uncertainties or difficulties are clearly stated:

Partly applicable

Comment: The dataset contains information on the context of origin, collection and processing methods, but not on completeness (in relation to other manuals and textbooks containing relevant imagery). Data gaps, uncertainties or difficulties are clearly stated. Whether it is an innovative tool, I am not able to assess.

★★☆☆

II.1. Data Paper: Context of Creation

The data paper presents the context in which the data publication was created and the basic objective in a detailed and comprehensible manner:

Applicable

★★★★

II.2. Data Paper: Information on Methodology

For all published data, the data paper provides comprehensive information on the methodology of data collection and, where applicable, data cleansing and preparation. The choice of data schemes and file formats used is clearly explained:

Applicable

★★★★

II.3. Data Paper: Gaps and Weaknesses

Gaps in the data and weaknesses in the methodology are transparently stated and justified:

Applicable

★★★★

II.4. Data Paper: Reference to Publications

The use of the data by the authors is described in a comprehensible manner and published research contributions published using the data are referred to. Reference is also made to other data publications that are similar or can be usefully combined with the data described:

Applicable

★★★★

II.5. Data Paper: Potential Usage

Comprehensible potential usage scenarios are drafted and described:

Applicable

★★★★

III. Overall Evaluation of the Data Publication / Data Paper

Accept (high quality)

Concluding Comment: The article convincingly frames annotation as a hermeneutic act, showing how the transformation of historical artefacts into data shapes ML outcomes. The emphasis on pre-processing, which often goes unmentioned, is insightful, and the discussion of failed approaches and revisions adds transparency. In my opinion, the paper is generally well balanced: it includes historical information, describes its methodology, and reflects on epistemological issues. I would suggest some minor changes: the distinction between DH and CH at the beginning is not necessary in my opinion, section 1.3 could be streamlined, the change from a neutral perspective to "we" mentioned above should be addressed and the link to GitHub should be more prominently featured for easier access.

Review 1 for Lang 2025

Begutachteter Beitrag

The metadata of the dataset describes the dataset in as much detail as possible. They contain a human- and machine-readable license information. Standard data is referenced in a meaningful way (e.g. GND, Wikidata, GeoNames):

Data and metadata follow a common, suitable data standard (in that specific discipline) and are available in a common file format. The data standards used (and, if applicable, their versions) are named:

The authors of the dataset (and any other contributors) are clearly named and can be persistently identified (e.g. via the GND number, ORCiD), their functions are described in the metadata (e.g. via CRediT, MARC Relator or TaDiRAH):

The data set and its subsets are clearly and meaningfully named and well structured. The data can be accessed directly via the PID without specialized tools, free of charge and barrier-free. Any access or usage restrictions are clearly justified (e.g. for copyright or privacy reasons):

The data publication is an innovative tool that can be used by others and combined with other data sets. The data set contains information on completeness, context of origin, collection and processing methods and the quality of the data. Data gaps, uncertainties or difficulties are clearly stated:

The data paper presents the context in which the data publication was created and the basic objective in a detailed and comprehensible manner:

For all published data, the data paper provides comprehensive information on the methodology of data collection and, where applicable, data cleansing and preparation. The choice of data schemes and file formats used is clearly explained:

Gaps in the data and weaknesses in the methodology are transparently stated and justified:

The use of the data by the authors is described in a comprehensible manner and published research contributions published using the data are referred to. Reference is also made to other data publications that are similar or can be usefully combined with the data described:

Comprehensible potential usage scenarios are drafted and described: