Recommended citation: Anna Siebold: Review of Sarah Lang: Fine-Tuning Machine Learning with Historical Data. An Alchemical Object Detection Dataset for Early Modern Scientific Illustrations. In: Zeitschrift für digitale Geisteswissenschaften. 24.04.2025. https://doi.org/10.17175/2025_002_R1
- 104 Aufrufe
The metadata of the dataset describes the dataset in as much detail as possible. They contain a human- and machine-readable license information. Standard data is referenced in a meaningful way (e.g. GND, Wikidata, GeoNames):
ApplicableComment: The metadata is described well. I have one proposition: It could be specified more explicitly, somewhere at the beginning of the text, that an annotation is not only comprised of the segmentation and boundary box but also the specification of the classification.
Data and metadata follow a common, suitable data standard (in that specific discipline) and are available in a common file format. The data standards used (and, if applicable, their versions) are named:
ApplicableComment: The Github link to the used .py script and the underlying image repository could be referenced earlier.
The authors of the dataset (and any other contributors) are clearly named and can be persistently identified (e.g. via the GND number, ORCiD), their functions are described in the metadata (e.g. via CRediT, MARC Relator or TaDiRAH):
ApplicableComment: The text switches to "us" and "we" in [17] to [20], whilst beforehand the reader gets the impression it is one author that builds on previous work with other (named) persons involved.
The data set and its subsets are clearly and meaningfully named and well structured. The data can be accessed directly via the PID without specialized tools, free of charge and barrier-free. Any access or usage restrictions are clearly justified (e.g. for copyright or privacy reasons):
ApplicableComment: The GitHub link and description of where the data can be viewed could be placed more prominently in the text for easier access.
The data is available in a file format that is suitable for long-term storage or in several file formats in order to increase the probability of long-term availability. A strategy for keeping the data and metadata permanently up-to-date, available and usable is in place and documented in the data set; the long-term archiving of the data and metadata is ensured. Changes are made transparent and traceable through versioning:
ApplicableThe data publication is an innovative tool that can be used by others and combined with other data sets. The data set contains information on completeness, context of origin, collec- tion and processing methods and the quality of the data. Data gaps, uncertainties or difficul- ties are clearly stated:
Partly applicableComment: The dataset contains information on the context of origin, collection and processing methods, but not on completeness (in relation to other manuals and textbooks containing relevant imagery). Data gaps, uncertainties or difficulties are clearly stated. Whether it is an innovative tool, I am not able to assess.
The data paper presents the context in which the data publication was created and the basic objective in a detailed and comprehensible manner:
ApplicableFor all published data, the data paper provides comprehensive information on the metho- dology of data collection and, where applicable, data cleansing and preparation. The choice of data schemes and file formats used is clearly explained:
ApplicableGaps in the data and weaknesses in the methodology are transparently stated and justified:
ApplicableThe use of the data by the authors is described in a comprehensible manner and published research contributions published using the data are referred to. Reference is also made to other data publications that are similar or can be usefully combined with the data described:
ApplicableComprehensible potential usage scenarios are drafted and described:
ApplicableAccept (high quality)
Concluding Comment: The article convincingly frames annotation as a hermeneutic act, showing how the transformation of historical artefacts into data shapes ML outcomes. The emphasis on pre-processing, which often goes unmentioned, is insightful, and the discussion of failed approaches and revisions adds transparency. In my opinion, the paper is generally well balanced: it includes historical information, describes its methodology, and reflects on epistemological issues. I would suggest some minor changes: the distinction between DH and CH at the beginning is not necessary in my opinion, section 1.3 could be streamlined, the change from a neutral perspective to "we" mentioned above should be addressed and the link to GitHub should be more prominently featured for easier access.