Namescape Datasets

Datasets processed in the Namescape project

 

The following datasets have been processed

Tagged named entity data, manually verified

1) Namescape training corpus, about 1M tokens taken randomly from the Corpus-Sanders (public deliverable, delivered as part of D2)

Tagged named entity data

  1. Corpus-Sanders (550 books)
  2. Corpus-Huygens (22 books)
  3. Corpus-eBooks (~7000 books)
  4. Corpus-SONAR (~100 books) (public deliverable)
  5. Corpus-Gutenberg Dutch (public deliverable)

Named entity lexica

1) Based on Corpus Sanders (public deliverable)
2) Based on extended corpus eBooks

Corpus Sanders

A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Part of this corpus (550 novels) is available to the INL in XML format. The remaining 32 novels (some of which are English novels translated into Dutch) are in ASCII. All of the material is based on OCR scans of the originals and may contain scanning errors. Metadata (title, year of publication, publisher, page numbers) is available but needs restructuring and completion (e.g. adding with ISBN). The metadata will be made CLARIN compliant (CMDI).

 

Corpus Huygens

Consists of 22 novels manually tagged with detailed named entity information according to the standards defined in Part I. IPR for this corpus do not allow distribution.

 

Corpus eBooks

Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution.

Corpus Sonar Books

105 Dutch books; NE tagged; TEI converted from Folia and CMDI metadata.

Corpus Gutenberg Dutch

Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents.

 

Gold standard data

The Namescape gold standard corpus consists of tandom paragraphs from Corpus Sanders; size is about 1 million tokens =~ about 2000 per book ~= about 16 paragraphs per book.

The reason for choosing our training data in such a way is twofold:

  • We hope to avoid IPR problems
  • Choosing training data in this way has a positive effect on the machine learning procedures