Over 60.000 Dutch archaeological research reports are available online, and this number is growing by around 4.000 a year. Much of this grey literature threatens to end up in a proverbial graveyard, unread and unknown. Currently it is only possible to search through the metadata of these documents, mainly via the Archis database and DANS repository. However, these metadata are often limited and sometimes inconsistent. To effectively index these texts, Named Entity Recognition (NER) is needed to correctly identify and distinguish between archaeological concepts. Standard approaches to NER are insufficient to deal with the peculiarities of these concepts. Some research has already been done on NER in archaeological texts, e.g. in the ARIADNE & Open Boek projects, but these are not combined with full-text search, or tend to focus on limited entity types, and not the full breadth of archaeological concepts. This paper will present the first phase of AGNES (Archaeological Grey literature Named Entity Search), in which machine learning is used to perform NER. The initial experiments use Conditional Random Fields and a feature set fine-tuned to archeological concepts. The identified entities are combined with a full-text index to create an effective online search, allowing researchers to answer research questions that are currently impossible to solve. The project is in cooperation with the Leiden Institute of Advanced Computer Science (LIACS), who provide a computer cluster with high computing power, allowing for the use of more resource intensive techniques and short iterative development cycles. (Alex Brandsen, Milco Wansleeben, Suzan Verberne)

0 Comments