Oxford NINJAL Corpus of Old Japanese (ONCOJ)

About the project

Overview

The Oxford-NINJAL Corpus of Old Japanese is a long-term project which is developing a lemmatized, parsed and comprehensively annotated corpus of all texts in Japanese from the Old Japanese period (mainly the Asuka and Nara periods of Japanese history). In its current version, the ONCOJ presents the full corpus of Old Japanese poetic texts, including the Man'yōshū. The ritualistic prose texts from the period, the Engi-shiki Norito and the Shoku-Nihongi Senmyō, are being prepared for inclusion as well. The ambition of the project is to make these important texts in Japanese from early Japanese literate civilization widely available and accessible to specialist scholars, students, and the interested general public, in a format that facilitates interactive engagement.

In this corpus, the texts are presented in a phonemic transcription, showing also the original script, and with extensive linguistic annotation for mode of writing (phonographic or logographic), morphology and syntactic parsing, displayed in the form of constituency trees. The texts are also lemmatized, with unique identifiers assigned to lexical items, and with cross-reference to and from a simple dictionary facility. The texts from the Man'yōshū are linked to corresponding texts in the NINJAL Corpus of Historical Japanese.

The longer-term goals of the project include:

Annotation: Provision of annotation for literary, biographical, historical and other information.
Translations: The texts will be supplied with translations into English, either produced within the project, or where possible incorporating already published translations.
Enhanced dictionary: In addition to the part of speech information and English translations already provided in the Dictionary, detailed information about usage (example sentences), related items, and other information will be included.

A brief history of the ONCOJ

The corpus was initially developed as a research tool in the University of Oxford research project Verb Semantics and Argument Realisation in Pre-Modern Japanese (VSARPJ) project (2009-2014; Principle investigator: Bjarke Frellesvig), funded by the Arts and Humanities Research Council, UK. The VSARPJ project was designed and set up by Bjarke Frellesvig, Janick Wrona and Peter Sells.

The design of the corpus, the linguistic principles for annotation (using the phonological, morphophonological and morphological framework of Frellesvig’s A history of the Japanese language), the amount and detail of encoded information, the data structure, the format of encoding etc. were overall established in the early days of the VSARPJ project by Bjarke Frellesvig, Janick Wrona and Peter Sells, together with Stephen Wright Horn and Kerri L Russell, who were employed as post-doctoral researchers in the VSARPJ project, with important additional input from external members of the VSARPJ project (including Anton Antonov, Satoshi Kinsui, Tomohide Kinuhata, Yasuhiro Kondo, Matt Shibatani, Yuko Yanagida, Akira Watanabe, and John Whitman).

In 2011 the corpus was established as an independent project, the Oxford Corpus of Old Japanese (OCOJ), and in 2012 the OCOJ was adopted by the British Academy as an Academy Research Project.

The OCOJ corpus building and annotation was from 2009 until 2015 mainly carried out by Stephen Wright Horn and Kerri L Russell. Although substantial improvement has been made since then, it is to a large extent due to their dedicated work during this period that the ONCOJ has reached its current state.

Additional annotation during this period work was done by amongst others Anton Antonov, Yuhki King, Laurence Mann, Maria Telegina, Dan Trott, Zixi You, as well as Benjamin Cagan, Arthur Defrance, Alexander Dudok de Wit, Thomas Jo Johansen, Aimi Kuya, Katharine Kinoshita, Linda Lanz, Petter Mæhlum.

From OCOJ to ONCOJ

From their early beginnings, both the VSARPJ and OCOJ projects enjoyed close collaboration with NINJAL and with the diachronic corpus building projects at NINJAL. NINJAL project members (including Yasuhiro Kondō, Toshinobu Ogiso and Makiro Tanaka) visited Oxford on several occasions, and Tomoaki Kōno spent part of the summer 2016 at Oxford as a visitor to the OCOJ. Both Horn and Russell spent three month periods at NINJAL as visiting researchers (2010 and 2011, respectively, funded through grants from the AHRC, UK), Frellesvig was a Visiting Professor at NINJAL for six months in 2012, and Horn spent a further year at NINJAL 2015-16 as a Research Fellow funded by the Hakuhō Foundation. From October 2016 to January 2019, Horn was employed as an Adjunct Researcher at NINJAL, in the Development of and linguistic research with a parsed corpus of Japanese project at NINJAL (Project leader: Prashant Pardeshi).

In 2016, more formal collaboration between the OCOJ and NINJAL was established, within NINJAL's The Construction of Diachronic Corpora and New Developments in Research on the History of Japanese project (Project leader: Toshinobu Ogiso), and in 2017, the name of the corpus was changed to its current name, the Oxford-NINJAL Corpus of Old Japanese, to reflect the collaborative nature of the project.

The work on the project and the publication and maintenance through this website from 2017 until the end of 2018 was in large part been possible through the expertise at NINJAL in both Ogiso's and Pardeshi's projects. Since then significant suport continues to be provided through Ogiso's projects well as through Discretionary Funds from the Director-General of NINJAL, and through the continued support of the British Academy.

The annotation work since April 2016 is being carried out by Bjarke Frellesvig, Stephen Wright Horn and Anna Sharko, with crucial help and input from Tomoaki Kōno and Alastair Butler.

Data formats

The OCOJ was originally encoded in XML, using tags which followed the conventions of the Text Encoding Initiative (TEI). In February 2018, the XML data files were converted to a Penn Historical-style bracketed format. Over the course of 2022, the data was transformed to a table format, in which each tree is expressed as an ordered set of paths. This is now the core data format of the ONCOJ. Both the bracketed format and the table format are available for download.

Publication

In contrast with the present ONCOJ, the OCOJ was never published in full, but the (now inactive) website of the OCOJ posted a digital version of all the texts, as well as simple facilities for searching the poetic texts and displaying images of constituency trees for each poem. The data which was published online in those formats reflected the state and quality of the data in the OCOJ up until May 2015, at which point the available online data was frozen. This data is obsolete, and access to it has been terminated.

The ONCOJ, which is now published through this website, reflects extensive improvements on the corpus, particularly in lemmatization and syntactic analysis, work which began in earnest April 2016 and continues to the present. In addition to text and syntactic tree displays, the online interfaces allow searches based on virtually any combination of conditions over every aspect of the annotation. Annotation data for the texts has been made fully downloadable, and the work as a whole is licensed under Creative Commons Attribution 4.0.

Search Interface

The ONCOJ is associated with a powerful user interface developed in collaboration with the “Development of and linguistic research with a parsed corpus of Japanese” project at NINJAL. The interface allows you to search for strings (terminal nodes) or parts of strings, and for tags (higher node labels) or parts of tags, and relationships between defined nodes. For example, enter a string that corresponds to a full word such as kapi (which corresponds to any of the words for ‘shell’, ‘worth’, ‘rearing’, ‘vale’, etc.), or a part-of speech level node such as ADJ, or a phrase level node label such as NP (noun phrase), or a complex expression using structural conditions, such as no written phonographically and dominated by a node with the part-of-speech of “copula”:

no > (/^PHON/ >> /^COP/)

Going forward

We encourage users to become accustomed to the newer interface. Not only is it more powerful and easier to maintain into the future, but it shares features with a growing number of corpora covering, for example, present-day Japanese and English, Japanese regional dialects, and Japanese child language development. These corpora can be accessed from the Parsed Corpora Portal Main page.