THE OXFORD-NINJAL CORPUS OF OLD JAPANESE
The Oxford-NINJAL Corpus of Old Japanese (abbreviated ONCOJ) is a long-term collaborative research project between the University of Oxford and the National Institute for Japanese Language and Linguistics, which is developing a lemmatized, parsed and comprehensively annotated digital corpus of all texts in Japanese from the Old Japanese period. The ONCOJ is supported and recognized by the British Academy as an Academy Research Project, and at NINJAL it forms part of the large The Construction of Diachronic Corpora and New Developments in Research on the History of Japanese project.
Old Japanese is the earliest attested stage of the Japanese language, largely the Japanese language of the Asuka and Nara periods of Japanese history (7th and 8th century AD). This is the formative literate period upon which the development of Japanese civilization is based, and these texts are of paramount importance for the study and understanding of the origins and development of the civilization of Japan, including language, writing, literature, religion, history, and culture.
The ONCOJ has been ongoing since 2011 (until 2017 under the name “The Oxford Corpus of Old Japanese (OCOJ)”). It was initially conceived of and designed as a resource for linguistic research, but it also includes features useful for work in history, literature, culture, etc. In its present version, the ONCOJ contains the full corpus of Old Japanese poetic texts, including the Man’yōshū.
Release, features and mistakes
This website presents the fourth public release of the ONCOJ (Version 2021.3, released 7 March 2021, superseding Version 2020.1, released 31 January 2020), alongside a suite of powerful on-line search tools that enable search using virtually any aspect of the annotation, plus interfaces for downloading search results in the form of annotated data. The data presented here reflect the state of the corpus on 6 March 2021, with around 90,000 words of poetic text (at the latest count 99,828 simple lexical items (either in isolation or in compounds) combined with 15,635 bound morphemes). The texts are fully lemmatized and have annotation for mode of writing (phonographic or logographic), morphology and syntactic parsing.
The texts can be accessed from here in a simple form, with original script and phonemic transcription side by side, and with links to views of text in the form of constituency trees (among other options) in the Search interface. The texts may also be accessed directly through the Search interface which provides a variety of search tools and download facilities. Where available, links to corresponding texts in the Nara Period Series of the NINJAL Corpus of Historical Japanese are provided in the constituency tree view in the Search interface. The website also provides a full List of words and morphemes appearing in the corpus with English glosses, and with lemma ID numbers that can be entered into the Search interface. There is also a Download page explaining how to access the data for the entire corpus in several ways.
As with any annotated text corpus, there are mistakes in the ONCOJ. The corpus is ongoing work and is under continuous improvement and correction. Mistakes will be corrected as we become aware of them and as time allows. We will be grateful to be made aware of mistakes and will endeavour to eliminate mistakes with the help of users (contact). Mistakes may be found in all four main areas of annotation: lemmatization, mode of writing, morphology, and syntactic parsing, as well as in original text and transcription, and in glosses in the dictionary.
The website will be updated regularly to reflect corrections, improvements and changes in the data in the ONCOJ to add more user functionality. Substantial expansion and structural improvements are planned and ongoing in several areas.
Old Japanese source texts
The ONCOJ draws its data from critical editions of OJ texts, with texts transcribed phonemically and edited to parallel the content from corresponding items in well-known critical editions and their interpretations. Where editions differ we have in most cases followed the authority of the Nihon koten bungaku taikei (Iwanami Shoten). The OJ poetic texts are the following (showing here also the standard abbreviations for text loci, e.g., MYS for texts from the Man’yōshū).
- Kojiki kayō (KK; 古事記歌謡) : 112 poems; 2,527 words. Compiled 712 CE
- Nihon shoki kayō (NSK; 日本書紀歌謡) : 133 poems; 2444 words. Compiled 720 CE
- Fudoki kayō (FK; 風土記歌謡) : 20 poems; 271 words. Compiled 730s CE
- Bussokuseki-ka (BS; 仏足石歌) : 21 poems; 337 words. Compiled after 753 CE
- Man’yōshū (MYS; 万葉集) : 4,685 poems; 83,706 words. Compiled after 759 CE
- Shoku nihongi kayō (SNK; 続日本紀歌謡) : 8 poems; 134 words. Compiled 797 CE
- Jōgū shōtoku hōō teisetsu (JSHT; 上宮聖徳法王帝説) : 4 poems; 60 words. Date unknown
Form of the data
The original texts of the corpus were written in Chinese characters, employed both phonographically and logographically. The corpus re-casts these texts in a phonemic transcription in letters of the alphabet, richly annotated with lexical, morphological and syntactic information and structure.
The text of the annotated corpus is in the form of a phonemic transcription from the original script of the source texts, in Frellesvig-Whitman notation. The following table summarizes the differences in the way the distinction between kō-rui (甲類) and otsu-rui (乙類) syllables are represented in various notation systems (including Ohno Susumu’s system as used for example in the Iwanami kogo jiten and the Yale system of Samuel E. Martin’s The Japanese language through time).
The following table presents some examples of how these different systems write words of Old Japanese, with the ‘NJ’ column showing the shape of the word in Modern Japanese.
|Gloss||NJ||Frellesvig & Whitman||Index notation||Yale||Modified Mathias-Miller||Ohno|
|‘ear (of rice)’||ho||po||po||po||po||po|
Modes of writing
Old Japanese writing practice employed two basic modes of writing: phonographic writing, in which Chinese characters represent OJ syllables, and logographic writing, in which Chinese characters represent OJ words or morphemes. In transcription, phonographically written text is shown in italics and logographically written text is shown in plain text; text portions which have no direct representation in writing (usually functional elements of some kind) are shown in underlined plain text.
awone ga take no
tare ka orikyemu
tatenuki nasi ni
“The moss matting / of Aonegatake / in Miyoshino / — who must have woven it, / with neither warp nor weft?” (MYS.7.1120)
The morphological analysis in the ONCOJ follows the framework set out in A history of the Japanese language (Bjarke Frellesvig, Cambridge University Press, 2010). It differs from the kokugogaku tradition in some respects, mainly in positing full inflectional paradigms for verbs (and inflecting auxiliary suffixes) which include forms which in the kokugogaku tradition are divided into a verb ‘form’ (e.g., mizenkei, ren’yōkei, izenkei, etc.) and a ‘particle’ (e.g., ba, do, na, te, etc.). Thus, a form like akuredo ‘although it dawns’ is presented as the concessive form of ake- ‘to dawn’ and not as the izenkei of aku followed by the particle do.
Case and other particles (like genitive ga or focus so), modal extensions (like presumptive rasi) and the copula are presented as individual words and not as enclitics.
Constituency tree structure
Words in a text form units (constituents) which combine with other words to form larger constituents (phrases and clauses), and ultimately whole sentences. In the Search interface, the organization of units is presented in the form of tree structures.
In the trees, text is segmented into terminal nodes (strings). Strings are separated primarily by boundaries between morphemes, but within morphemes strings may also be segmented according to changes in the mode of writing. In the ONCOJ data, the mode of writing for every string is indicated using nodes labeled, the principle ones being, respectively, PHON (phonographic), LOG (logographic), NLOG (no direct representation). There is an additional category for place names, PLOG, and another for illegible items, ILL. Thus a word may be composed of one or more morphemes, but a full morpheme may be composed of several segments, depending on how it is written in the original text.
Immediately above (or “directly dominating”) the nodes indicating the mode of writing for the segment(s) for each full morpheme is a node the label of which specifies the lemma ID number for that morpheme. Directly dominating that is a node the label of which specifies the part of speech for that morpheme (e.g., PFX = prefix; N = noun; P = particle; VB = verb; ADJ = adjective, etc.). So, for example, #L000503 appears above the mode of writing node for the terminal node ga and under the part of speech node P-CASE-GEN. Part of speech nodes can label either simplex words or morphemes that combine into complex words with their own part of speech nodes. Immediately dominating nodes at the word level are phrasal nodes (e.g., NP = noun phrase; PP = particle phrase; IP = inflectional phrase; CP = complementizer phrase, etc.). Labeled nodes frequently appear with extensions that specify functional information (e.g., NP-OB1 = direct object noun phrase; PP-SBJ = subject particle phrase; IP-ADV = adverbial inflectional phrase, etc.). In the trees, the structure of an inflectional phrase (clause) is flat, with no verb phrases or functional projections. Hence, local argument dependencies and modifier dependencies are defined by sisterhood to the target head. The basic format is similar to that used in the Penn Parsed Corpora of Historical English. For an overview of these principles applied to NJ, see the annotation manual in the Keyaki Treebank (Butler, et al. 2017).
Note that line breaks are numbered from 0 ~ n -1, and the original text for the n-th line of a poem is included under the line break numbered n -1. In the figure above (the fifth line of MYS.7.1120), the original text for line 5 appears under line break number 4.
Using the corpus for advanced research
The corpus is currently associated with a powerful on-line search interface, introduced here, which allows for results of specific searches to be be downloaded. The interface allow searches to be defined using regular expressions, node labels, and structural conditions.
The new interface uses Tregex (Levy and Andrew-2006) and TGrep2 as search engines. Tregex is a language in the TGrep family having considerable precision and power, allowing the simple expression of basic logical relations, the use of extended regular expressions, and the ability to label groups and back-reference named nodes. In this interface the user also has control over the range of data to be searched over. This interface has the additional advantage of directly accessing the most up-to-date working files of the ONCOJ database.
An earlier interface gave access to the static data published on this website using string searches and TGrep-lite, a version of the TGrep family of search languages with less expressive power and a slightly more complicated syntax. Unfortunately, the design of this interface rendered it difficult to maintain, and its use was discontinued 10 October 2021. We hope that users take advantage of the newer interface.
The available on-line interface is a powerful and flexible tool sufficient for many research purposes. The full corpus is also available for download for use with off-line search tools like local installations of Tregex (Levy and Andrew, 2006) and CorpusSearch2 (Randall, 2009).
This work is licensed under a Creative Commons Attribution 4.0 International License.
Presentations of research results using the Oxford-NINJAL Corpus of Old Japanese should include a citation taking the general form of the example below (with appropriate modifications depending on the version and the date of access):
National Institute for Japanese Language and Linguistics (2020) “Oxford-NINJAL Corpus of Old Japanese” (Version 2020.1) https://oncoj.ninjal.ac.jp/ (accessed 31 January 2020)