Oxford NINJAL Corpus of Old Japanese (ONCOJ)

The Oxford-NINJAL Corpus of Old Japanese (abbreviated ONCOJ) is a long-term collaborative research project between the University of Oxford and the National Institute for Japanese Language and Linguistics, which is developing a lemmatized, parsed and comprehensively annotated digital corpus of all texts in Japanese from the Old Japanese period. The ONCOJ is supported and recognized by the British Academy as an Academy Research Project, and at NINJAL it forms part of the large The Construction of Diachronic Corpora and New Developments in Research on the History of Japanese project.

Old Japanese is the earliest attested stage of the Japanese language, largely the Japanese language of the Asuka and Nara periods of Japanese history (7th and 8th century AD). This is the formative literate period upon which the development of Japanese civilization is based, and these texts are of paramount importance for the study and understanding of the origins and development of the civilization of Japan, including language, writing, literature, religion, history, and culture.

The ONCOJ has been ongoing since 2011 (until 2017 under the name “The Oxford Corpus of Old Japanese (OCOJ)”). It was initially conceived of and designed as a resource for linguistic research, but it also includes features useful for work in history, literature, culture, etc. In its present version, the ONCOJ contains the full corpus of Old Japanese poetic texts, including the Man'yōshū.

Release, features and mistakes

This website gives the user access to the source data of the ONCOJ. At present the List of words reflects the lexicon extracted from the fifth version of the ONCOJ (Version 2021.10, released 14 October, 2021, superseding Version 2021.3, released 7 March 2021). However, the Dictionary and the source data to which it links is being udated constantly to reflect improvements in analysis. Through any one of the Dictionary, Texts, or the Search interface, the user has direct access to a suite of powerful on-line search tools that enable search using virtually any aspect of the annotation, plus interfaces for downloading search results in the form of annotated data. The data presented here comprise a corpus of around 90,000 words of poetic text (at the latest count 99,828 simple lexical items (either in isolation or in compounds) combined with 15,635 bound morphemes). The texts are fully lemmatized and have annotation for mode of writing (phonographic or logographic), morphology and syntactic parsing.

The texts are accessible here in a simple form, with original script and phonemic transcription side by side, and with links to views of text in the form of constituency trees (among other options) in the Search interface. The texts may also be accessed directly through the Search interface which provides a variety of search tools and download facilities. The Search interface also includes a Dictionary with entries for words appearing in the corpus, including UIDs for lemmas, part-of-speech information, and glosses.

The website also provides a full List of words and morphemes appearing in the corpus with English glosses, and with links to the Dictionary function in the Search interface. There is also a Download page enabling access to the data for the entire corpus in several formats.

As with any annotated text corpus, there are mistakes in the ONCOJ. The corpus is ongoing work and is under continuous improvement and correction. Mistakes will be corrected as we become aware of them and as time allows. We will be grateful to be made aware of mistakes and will endeavour to eliminate mistakes with the help of users (contact). Mistakes may be found in all four main areas of annotation: lemmatization, mode of writing, morphology, and syntactic parsing, as well as in original text and transcription, and in glosses in the dictionary.

The website will be updated regularly to reflect corrections, improvements and changes in the data in the ONCOJ, and to add more user functionality. Substantial expansion and structural improvements are planned and ongoing in several areas. The source data of the corpus itself, to which users have direct access, is updated in real time as editors make changes.

Old Japanese source texts

The ONCOJ draws its data from critical editions of OJ texts, with texts transcribed phonemically and edited to parallel the content from corresponding items in well-known critical editions and their interpretations. Where editions differ we have in most cases followed the authority of the Nihon koten bungaku taikei (Iwanami Shoten). The OJ poetic texts are the following (showing here also the standard abbreviations for text loci, e.g., MYS for texts from the Man'yōshū).

Form of the data

The original texts of the corpus were written in Chinese characters, employed both phonographically and logographically. The corpus re-casts these texts in a phonemic transcription in letters of the alphabet, richly annotated with lexical, morphological and syntactic information and structure.

Transliteration

The text of the annotated corpus is in the form of a phonemic transcription from the original script of the source texts, in Frellesvig-Whitman notation. The following table summarizes the differences in the way the distinction between kō-rui (甲類) and otsu-rui (乙類) syllables are represented in various notation systems (including Ohno Susumu's system as used for example in the Iwanami kogo jiten and the Yale system of Samuel E. Martin’s The Japanese language through time).

Syllable type Index notation Ohno Modified Mathias- Miller Yale Frellesvig & Whitman
Kō-rui Ci1 Ci Cyi Ci
Otsu-rui Ci2 Ciy Cwi
Neutral Ci Ci Ci Ci Ci
Kō-rui Ce1 Ce Cye Cye
Otsu-rui Ce2 Cey Ce
Neutral Ce Ce Ce Ce Ce
Kō-rui Co1 Co Cwo Cwo
Otsu-rui Co2 Co Co
Neutral Co Co Co Co Co

The following table presents some examples of how these different systems write words of Old Japanese, with the 'NJ' column showing the shape of the word in Modern Japanese.

Gloss NJ Frellesvig & Whitman Index notation Yale Modified Mathias-Miller Ohno
'sun' hi pi pi1 pyi pi
'fire' hi pwi pi2 piy
'blood' chi ti ti ti ti ti
'woman' me mye me1 mye me
'eye' me me me2 mey
'hand' te te te te te te
'child' ko kwo ko1 kwo ko
'this' ko ko ko2 ko
'ear (of rice)' ho po po po po po

Modes of writing

Old Japanese writing practice employed two basic modes of writing: phonographic writing, in which Chinese characters represent OJ syllables, and logographic writing, in which Chinese characters represent OJ words or morphemes. In transcription, phonographically written text is shown in italics and logographically written text is shown in plain text; text portions which have no direct representation in writing (usually functional elements of some kind) are shown in underlined plain text.

 

三芳野之
miyosinwo no
青根我峯之
awone ga take no
蘿席
kokemusiro
誰将織
tare ka orikyemu
經緯無二
tatenuki nasi ni

 

'The moss matting / of Aonegatake / in Miyoshino /
-- who must have woven it, / with neither warp nor weft?'

(MYS.7.1120)

 

[Aonegamine, Ōmine range, Nara, Japan]

Morphological analysis

The morphological analysis in the ONCOJ follows the framework set out in A history of the Japanese language (Bjarke Frellesvig, Cambridge University Press, 2010). It differs from the kokugogaku tradition in some respects, mainly in positing full inflectional paradigms for verbs (and inflecting auxiliary suffixes) which include forms which in the kokugogaku tradition are divided into a verb ‘form’ (e.g., mizenkei, ren'yōkei, izenkei, etc.) and a ‘particle’ (e.g., ba, do, na, te, etc.). Thus, a form like akuredo ‘although it dawns’ is presented as the concessive form of ake- ‘to dawn’ and not as the izenkei of aku followed by the particle do.

Case and other particles (like genitive ga or focus so), modal extensions (like presumptive rasi) and the copula are presented as individual words and not as enclitics.

Constituency tree structure

Words in a text form units (constituents) which combine with other words to form larger constituents (phrases and clauses), and ultimately whole sentences. In the Search interface, the organization of units is presented in the form of tree structures.

In the trees, text is segmented into terminal nodes (strings). Strings are separated into segments primarily by boundaries between morphemes, but within morphemes strings may also be segmented according to changes in the mode of writing. In the ONCOJ data, the mode of writing for every segment is indicated using nodes labeled, the principle ones being, respectively, PHON (phonographic), LOG (logographic), NLOG (no direct representation). There is an additional category for place names, PLOG, and another for illegible items, ILL. Thus a word may be composed of one or more morphemes, but a full morpheme may be composed of several segments, depending on how it is written in the original text.

Immediately above the segment(s) for each full morpheme is a lemma ID. For example, L000503 appears immediately above (or “directly dominates”) the string ga. Each lemma ID corresponds to a morpheme with a specific part-of-speech. Directly dominating the lemma ID node is the part-of-speech node, which has a label specifying the part-of-speech for that morpheme (e.g., PFX = prefix; N = noun; P = particle; VB = verb; ADJ = adjective, etc.). Thus, P-CASE-GEN is the label for the genitive case particle ga with lemma ID L000503. Part-of-speech nodes can either label simplex words or they can label morphemes that combine into complex words, which in turn receive their own part of speech nodes.

Directly dominating nodes at the word level are phrasal nodes (e.g., NP = noun phrase; PP = particle phrase; IP = inflectional phrase; CP = complementizer phrase, etc.). Labeled nodes frequently appear with extensions that specify functional information (e.g., NP-OB1 = direct object noun phrase; PP-SBJ = subject particle phrase; IP-ADV = adverbial inflectional phrase, etc.). In the trees, the structure of an inflectional phrase (clause) is flat, with no verb phrases or functional projections. Hence, local argument dependencies and modifier dependencies are defined by sisterhood to the target head. The basic format is similar to that used in the Penn Parsed Corpora of Historical English. For an overview of these principles applied to NJ, see the annotation manual in the Keyaki Treebank (Butler, et al. 2017).

MYS.7.1120, line 5

Note that line breaks are numbered from 0 ~ n -1, and the original text for the n-th line of a poem is included under the line break numbered n -1. In the figure above (the fifth line of MYS.7.1120), the original text for line 5 appears under line break number 4#.

Using the corpus for advanced research

The corpus is currently associated with a powerful on-line search interface, introduced here, which allows for results of specific searches to be be downloaded. The interface allows searches to be defined using regular expressions, node labels, and structural conditions.

The available on-line interface is a powerful and flexible tool sufficient for many research purposes, but note that the data on which it operates is updated in real time. For extended research projects it is often advisable to download a release version for use off-line. This not only provides the advanced researcher a stable set of data, but also allows for manipulation of the data to reflect analyses that are not included in the present online corpus but are necessary for a given research project. The full corpus is available for download for use with off-line search tools like Tregex (Levy and Andrew, 2006) and CorpusSearch2 (Randall, 2009).

Attribution

Presentations of research results using the Oxford-NINJAL Corpus of Old Japanese should include a citation taking the general form of the example below (with appropriate modifications depending on the date of access):

National Institute for Japanese Language and Linguistics (2020) “Oxford-NINJAL Corpus of Old Japanese” http://oncoj.ninjal.ac.jp/ (accessed 31 January 2020)

Terms of use

This work is licensed under a Creative Commons Attribution 4.0 International License.

Creative Commons License