About the project

Overview

The Oxford-NINJAL Corpus of Old Japanese is a long-term project which is developing a lemmatized, parsed and comprehensively annotated corpus of all texts in Japanese from the Old Japanese period (mainly the Asuka and Nara periods of Japanese history). In its current version, the ONCOJ presents the full corpus of Old Japanese poetic texts, including the Man’yōshū. The ritualistic prose texts from the period, the Engi-shiki Norito and the Shoku-Nihongi Senmyō, are being prepared for inclusion as well. The ambition of the project is to make these important texts in Japanese from early Japanese literate civilization widely available and accessible to specialist scholars, students, and the interested general public, in a format that facilitates interactive engagement.

In this release, the texts are presented in a phonemic transcription, showing also the original script, and with extensive linguistic annotation for mode of writing (phonographic or logographic), morphology and syntactic parsing, displayed in the form of constituency trees. The texts are also lemmatized, with unique identifiers assigned to lexical items, and with cross-reference to and from a simple dictionary facility. The texts are linked to texts in the NINJAL Corpus of Historical Japanese.

The longer-term goals of the project include:

Annotation: Provision of annotation for literary, biographical, historical and other information.

Translations: The texts will be supplied with translations into English, either produced within the project, or where possible incorporating already published translations.

Dictionary: A full bilingual Old Japanese – English dictionary is being planned, alongside and as an integrated part of the corpus.

Project structure

The ONCOJ is overseen by a Project Committee which currently has the following membership

Project Director

Bjarke Frellesvig (University of Oxford)

Committee Members

Alastair Butler (Hirosaki University)

Stephen Wright Horn (NINJAL)

Toshinobu Ogiso (NINJAL)

Peter Sells (University of York)

A brief history of the ONCOJ

The corpus was initially developed as a research tool in the University of Oxford research project Verb Semantics and Argument Realisation in Pre-Modern Japanese (VSARPJ) project (2009-2014; Principle investigator: Bjarke Frellesvig), funded by the Arts and Humanities Research Council, UK. The VSARPJ project was designed and set up by Bjarke Frellesvig, Janick Wrona and Peter Sells.

The design of the corpus, the linguistic principles for annotation (using the phonological, morphophonological and morphological framework of Frellesvig’s A history of the Japanese language), the amount and detail of encoded information, the data structure, the format of encoding etc. were overall established in the early days of the VSARPJ project by Bjarke Frellesvig, Janick Wrona and Peter Sells, together with Stephen Wright Horn and Kerri L Russell, who were employed as post-doctoral researchers in the VSARPJ project, with important additional input from external members of the VSARPJ project (including Anton Antonov, Satoshi Kinsui, Tomohide Kinuhata, Yasuhiro Kondo, Matt Shibatani, Yuko Yanagida, Akira Watanabe, and John Whitman).

In 2011 the corpus was established as an independent project, the Oxford Corpus of Old Japanese (OCOJ), and in 2012 the OCOJ was adopted by the British Academy as an Academy Research Project.

The OCOJ corpus building and annotation was from 2009 until 2016 mainly carried out by Stephen Wright Horn and Kerri L Russell. Although substantial improvement has been made since then, it is to a large extent due to their dedicated work during this period that the ONCOJ has reached its current state.

Additional annotation during this period work was done by amongst others Anton Antonov, Yuhki King, Laurence Mann, Maria Telegina, Dan Trott, Zixi You, as well as Benjamin Cagan, Arthur Defrance, Alexander Dudok de Wit, Thomas Jo Johansen, Aimi Kuya, Katharine Kinoshita, Linda Lanz, Petter Mæhlum, Daniel Millichip.

From OCOJ to ONCOJ

From their early beginnings, both the VSARPJ and OCOJ projects enjoyed close collaboration with NINJAL and with the diachronic corpus building projects at NINJAL. NINJAL project members (including Yasuhiro Kondō, Toshinobu Ogiso and Makiro Tanaka) visited Oxford on several occasions, and Tomoaki Kōno spent part of the summer 2016 at Oxford as a visitor to the OCOJ. Both Stephen Wright Horn and Kerri L Russell spent three month periods at NINJAL as visiting researchers (2010 and 2011, respectively, funded through grants from the AHRC, UK), Bjarke Frellesvig was a Visiting Professor at NINJAL for six months in 2012, and Horn spent a further year at NINJAL 2015-16 as a Research Fellow funded by the Hakuhō Foundation. Horn has since October 2016 been employed as a Researcher at NINJAL, in the the Development of and linguistic research with a parsed corpus of Japanese project at NINJAL (Project leader: Prashant Pardeshi).

In 2016, more formal collaboration between the OCOJ and NINJAL was established, within NINJAL’s The Construction of Diachronic Corpora and New Developments in Research on the History of Japanese project (Project leader: Toshinobu Ogiso), and in 2017, the name of the corpus was changed to its current name, the Oxford-NINJAL Corpus of Old Japanese, to reflect the collaborative nature of the project.

The work on the project since then, and the publication and maintenance through this website, has in large part been possible through the expertise at NINJAL in both Ogiso’s and Pardeshi’s projects, as well as through Discretionary Funds from the Director-General of NINJAL, and through the continued support of the British Academy.

The annotation work since April 2016 is being carried out by Bjarke Frellesvig and Stephen Wright Horn, with crucial help and input from Tomoaki Kōno and Alastair Butler.

Data formats

The OCOJ was originally encoded in XML, using tags which followed the conventions of the Text Encoding Initiative (TEI). In February 2018, the XML data files were converted to a Penn Historical-style bracketed format. This is now the core data format of the ONCOJ, but a revised version of the XML format will still be kept up to date to the bracketed format in a TEI convertible format (that is, a format that can be converted to a TEI compatible file with some simple processing). At the moment only the bracketed format is available for download, but as soon as practically possible, also the the XML format will become downloadable.

Publication

The OCOJ was never published in full, but the (now inactive) website of the OCOJ posted a digital version of all the texts, as well as simple facilities for searching the poetic texts and displaying images of constituency trees for each poem. The data which was published online in those formats reflected the state and quality of the data in the OCOJ up until May 2015, at which point the available online data was frozen and not updated.

This data will now no longer be available as it is considerably out of date, due to the extensive work on the corpus, in particular on lemmatization and improvements in syntactic analysis, which has been undertaken especially since April 2016.

Instead, the ONCOJ is now published through this website.

fsa