Namescape: Mapping the Landscape of Names in Modern Dutch Literature is a demonstrator project granted in the third CLARIN-NL call. It runs from 15 March 2012 until April 1st 2013. Partners in the project are the Huygens Institute for the History of the Netherlands, the University of Amsterdam, and the Institute for Dutch Lexicology (CLARIN center).
Namescape summary: Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. The proposed project aims to fill the need by annotating a substantial amount of literary works with a rich tag set, thereby enabling the participating parties to perform their research in more depth than previously possible. Several exploratory visualization tools will help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator. The main tools will be made available as CLARIN compliant web services for use in other contexts.
From the project proposal
Research Question(s)
The research questions are: What is the usage and what are the (stylistic and narrative) functions of names in literary texts? Starting out from simple objectives like studying the relative proportion of names versus non-names in a text and the proportion of use of different types of named entities in typical literary works, quantitative overviews of names and name types etc. will be used to draw conclusions about functions of names that could not be highlighted before in a verifiable and repeatable way. A pilot analysis of 32 Dutch novels is done, but needs to be tested and expanded on a much larger corpus.
A few examples of questions emerging from earlier work on the pilot corpus serve to illustrate the type of research:
- Earlier work (van Dalen-Oskam, 2005) suggests that the ratio between the use of first names and family names is indicative of the level of intimacy in a novel
- Another interesting element is the distribution of so-called “plot internal” versus “plot external” names. Plot external names refer to persons, places or objects outside the fiction (Obama, Buenos Aires, Lord of the Rings) which seem mostly used as characterizations of plot internal, fictional characters. Different novels and authors etc. make a different use of these plot external names and for different reasons. It would be extremely useful to be able to get a quick overview of the ratio between plot internal and plot external names in a large corpus of novels, to learn more about their possible functions.
- Quantitative study of the pilot corpus led to the identification of two functions of geographical names that have not been noticed before: a higher use of different geographical names embodied a geographical taboo in a novel, and the pronouncing of lists of geographical names functioned as a calming mantra for the main character.
The demonstrator will enable scholars to check these observations in a much larger corpus tagged for named entities than otherwise would have been humanly possible to analyze. The exploratory function of the visualization tools in the demonstrator are expected to lead to many more new observations, questions, and inspirations.
Research Data
A corpus of 582 Dutch novels written and published between 1970 and 2009 will be used. Part of this corpus (550 novels) is available to the INL in XML format. The remaining 32 novels (some of which are English novels translated into Dutch) are in ASCII.
All of the material is based on OCR scans of the originals and may contain scanning errors. Metadata (title, year of publication, publisher, page numbers) is available but needs restructuring and completion (e.g. adding with ISBN). The metadata will be made CLARIN compliant (CMDI).
IPR reasons prevent the corpus as a whole or any substantial part of it to be made publicly available. Cf. section 7, D1. Open information repositories, such as Wikipedia, will be used for resolving named entities.
Technology
INL proposes to use its own adaptation of the Stanford named entity recognizer (Finkel, et al, 2005) for robust handling of OCR errors. It will be trained and customized to handle a more fine-grained annotation of named entities. The tagged names will be mapped to real entities (named entity resolution) using Dutch and English Wikipedia by means of an extension of the technique from (Meij et al 2011).
In a bootstrapping cycle, the texts are tagged by a preliminary NE tagger and results are inspected and corrected in a browser-based annotation tool (already available at INL).
The demonstrator is based on an XML database system which allows for complex content and structure search (the open source eXist DB system) and a website which is built solely using W3C approved XML-languages (Relax NG, XSLT, XQuery). Via the database all books will be coupled to other sources using their ISBN as a key (Google Books, DBNL, Amazon.com, Wikipedia, summary (“uittreksel”)-sites).
To quickly inspect the occurrence of entities in one book, in all books by an author, or in a group of books in one genre we will use the barcode browsing technique developed at UvA. This depicts a source (a book) as a column of tiny bars, each representing a paragraph. Bars containing an entity are coloured according to the entity type. This technique improves upon the well-known State of the Union visualization created by the NY Times.
The entities in each novel will be modeled as nodes of an undirected weighted network where the weighted edges are given by co-occurrence counts. Each network is made available as a valid GraphML (http://graphml.graphdrawing.org/) XML file. For each entity, a parsimonious language model [Hiemstra et al 2004] is created based on the lemmatized words occurring around the entity. We visualize network and language models together as described in [Kaptein et al 2009] and recently demonstrated in the attackogram of the Algemene Beschouwingen 2011 (http://debat.politiekinzicht.com) (UvA). In case of many entities and relations we first perform hierarchical cluster analysis and present the network as in the Prinsjesdag dictionary (http://www.inl.nl/over-het-inl/wat-doet-het-inl/nieuws/318-prinsjesdagwoordenboek) of the INL.
Description
The named entity tagging and resolution enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module will enable researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers).
Users from other communities (sociolinguistics, sentiment analysis, …) will also benefit from the NE tagged data, especially since the NE recognizer will be available as a web service, enabling researchers to annotate their own research data.
Plan
Type: Demonstrator Project
Tasks:
• Data curation of corpus data
Of the corpus data metadata (title, year of publication, publisher, page numbers) is available. It needs restructuring and completion and will be made CLARIN compliant (CMDI).
Persons involved: INL, Huygens ING – KNAW
Efforts: 1 month (0.5 + 0.5)
• Development and registration of the annotation standard
The annotation standard will be largely based on a pilot corpus, already available at Huygens. The annotation standard will be registered in ISOCAT
Persons involved: Huygens ING-KNAW
Efforts: 0,5 month
Deliverable: D1
• Gold Standard
The machine learning algorithms in a NE classifier need Gold Standard training data. A part of the proposed corpus will be used as such. A bootstrapping cycle is suggested whereby untagged texts are tagged by a preliminary NE tagger and results are inspected and corrected in a browser-based annotation tool (already available at INL).
Persons involved: INL, Huygens ING – KNAW
Efforts: 2 months (1,5 months + 0,5 months)
Technology used: browser based annotation tool available at INL
• NE Recognizer
The tag set the Stanford NE classifier uses will be tailored to project specifications. The tagger will be trained on the Gold Standard corpus described above. The trained tagger will be made available.
Persons involved: INL
Efforts: 3 months
Technology used: INL’s adapted version of Stanford’s named entity recognizer
Deliverable: D3
• NE Recognizer as a web service with user interface
The trained NE recognizer described above will be implemented as a web service (cf deliverable D3). Short texts or entire books can be sent to the web service, which will return an version enriched with named entity tags.
Persons involved: INL
Efforts: 1 month
Technology used: to be investigated
Deliverable:D4
• NE resolution
All recognized NE’s are disambiguated and mapped to the Dutch and English Wikipedia. Each mapping is given a confidence score. Based on the technique from [Meij et al 2011] and tuned to the Dutch language and to NE’s occurring in novels. Persons involved: UvA
Efforts: 2.5 months
Technology used: [Meij et al 2011]
Deliverable: D5
• Named Entity Lexicon
Based on the results of the NE recognizing and NE resolution a lexicon will be compiled of all named entities in the corpus and their attestations (occurences in context). The lexicon will be in LMF extended format and will be CLARIN compliant and ISO CAT registered. Cf deliverable D2.
Persons involved: INL
Efforts: 3 months
Deliverable: D2
• Demonstrator backend and frontend (XML DB, Advanced Search interface, Linked Data, Barcode Browser, Webapplication)
See section 6.3 for more information on the XML DB and the visualization using the barcode browser. The complete Linked Data dataset will also be made available in csv/Excel format for offline use. The technique behind the advanced search interface is based on that used for the parliamentary proceedings search interface developed within NWO PoliticalMashup and also used within the Clarin 2 WarInParliament project. The advanced search interface will present aggregates over novels, in two ways: 1) for each entity occurring in more than one novel in our corpus, we present all novels in which it occurs together with counts. 2) We list aggregates and group them in several manners (by author, by genre). These lists give direct answers to questions like: Which authors overuse proper names within the thriller-genre? Which book of WF Hermans contains more locations than he used on average?
Persons involved: UvA
Effort: 4 months
Technology: eXist DB, Relax NG, XSLT, XQuery, barcode browsing technique
Deliverable: D6, D7, D8
• Network Visualiser
See section 1.3 for a description of the network visualizer.
Persons involved: UvA, Huygens
Effort: 1.5 + .5 months
Technology:diverse (see 6.3)
Deliverable D8
• Implementation of the demonstrator within INL CLARIN center
Installation; testing of installation using tools for load test functional behavior and measure performance
Persons involved: INL, UvA
Effort: 2 days UvA, 2 weeks INL.
Deliverable D10
• Enriched Publication
Scholarly article published on a website incorporating the data and visualization means
Deliverable D9
Persons involved: INL, Huygens ING – KNAW, UvA
Effort: 4.5 months (0.5+3+1) (0.5 pm matched by INL, 1pm matched by UvA)
• Document with demonstration scenario
Persons involved: All.
Efforts: 1 week
Deliverable: D11
• Document with description of requirements and desiderata for the CLARIN infrastructure
Persons involved: All.
Efforts: 1 week
Deliverable: D12
• Document with description of harvestable metadata
Persons involved: All
Efforts: 1 week
Deliverable: D13
Deliverables and Milestones
Milestones
• Milestone M1: completion metadata curation and ISOcat extension for metadata
When: April 2012
• Milestone M2: All novels NE tagged.
When: June 2012
• Milestone M3: All data processing done and stored in XML Backend.
When: September 2012
• Milestone M4: Enriched publication, and final version Web Demonstrator
When: December 2012
Deliverables
• D1: Annotation standard for tagging relevant features of named entities in literary text
• Document, XML-scheme and ISO-cat registry of used categories.
• D2: NE lexicon
Due to IPR issues a full version of the corpus enriched with named entity tags can not be made available. Alternatively a lexicon will be compiled of the named entities found throughout the entire corpus, supplemented with their attestations (occurences in context). The lexicon will be in LMF extended format and will be CLARIN compliant and ISO CAT certified.
• D3: Trained NE recognizer
A version of the Stanford named entity recognizer as adapted for the IMPACT project by the INL, trained on the Gold Standard data set as proposed in this document will be made available on a CLARIN server at the INL.
• D4: NE recognizer as web service and application
A version of the Stanford named entity recognizer as adapted for the IMPACT project by the INL, trained on the Gold Standard data set as proposed in this document will be made available as a web service at an INL web server.
• D5: NE resolution module
• D6: eXist XML database backend containing all curated data (including linked data to third party sources).
• D7: Advanced search interface to XML data.
• D8: Visualisation module including Barcode browser for the 532 books with advanced grouping possibilities and Network visualiser
• D9: Enriched publication
• D10: Deployment of search interface to XML data and Barcode browser on INL CLARIN center
• D11: Demonstration scenario (document)
• D12: Document describing requirements and desiderata for the CLARIN infrastructure
• D13: Document with description of harvestable metadata
IPR and Ethical Issues: Risks
A full version of the corpus used can not be made publicly available due to IPR issues. Alternatively an annotated lexicon will be published containing names and their attestations (which will be of a length not violating IPR restrictions). The NE tagging -tool (Stanford tool + separate INL-module) and the NE-resolution tool will be publicly available.