The Oxford-NINJAL Corpus of Old Japanese (abbreviated ONCOJ) is a long-term collaborative research project between the University of Oxford and the National Institute for Japanese Language and Linguistics, which is developing a lemmatized, parsed and comprehensively annotated digital corpus of all texts in Japanese from the Old Japanese period. The ONCOJ is supported and recognized by the British Academy as an Academy Research Project, and at NINJAL it forms part of the large The Construction of Diachronic Corpora and New Developments in Research on the History of Japanese project.
Old Japanese is the earliest attested stage of the Japanese language, largely the Japanese language of the Asuka and Nara periods of Japanese history (7th and 8th century AD). This is the formative literate period upon which the development of Japanese civilization is based, and these texts are of paramount importance for the study and understanding of the origins and development of the civilization of Japan, including language, writing, literature, religion, history, and culture.
The ONCOJ has been ongoing since 2011 (until 2017 under the name “The Oxford Corpus of Old Japanese (OCOJ)”). It was initially conceived of and designed as a resource for linguistic research, but it also includes features useful for work in history, literature, culture, etc. In its present version, the ONCOJ contains the full corpus of Old Japanese poetic texts, including the Man'yōshū.
This website gives the user access to the source data of the ONCOJ. At present the List of words reflects the lexicon extracted from the fifth version of the ONCOJ (Version 2021.10, released 14 October, 2021, superseding Version 2021.3, released 7 March 2021). However, the Dictionary and the source data to which it links is being udated constantly to reflect improvements in analysis. Through any one of the Dictionary, Texts, or the Search interface, the user has direct access to a suite of powerful on-line search tools that enable search using virtually any aspect of the annotation, plus interfaces for downloading search results in the form of annotated data. The data presented here comprise a corpus of around 90,000 words of poetic text (at the latest count 99,828 simple lexical items (either in isolation or in compounds) combined with 15,635 bound morphemes). The texts are fully lemmatized and have annotation for mode of writing (phonographic or logographic), morphology and syntactic parsing.
The texts are accessible here in a simple form, with original script and phonemic transcription side by side, and with links to views of text in the form of constituency trees (among other options) in the Search interface. The texts may also be accessed directly through the Search interface which provides a variety of search tools and download facilities. The Search interface also includes a Dictionary with entries for words appearing in the corpus, including UIDs for lemmas, part-of-speech information, and glosses.
The website also provides a full List of words and morphemes appearing in the corpus with English glosses, and with links to the Dictionary function in the Search interface. There is also a Download page enabling access to the data for the entire corpus in several formats.
As with any annotated text corpus, there are mistakes in the ONCOJ. The corpus is ongoing work and is under continuous improvement and correction. Mistakes will be corrected as we become aware of them and as time allows. We will be grateful to be made aware of mistakes and will endeavour to eliminate mistakes with the help of users (contact). Mistakes may be found in all four main areas of annotation: lemmatization, mode of writing, morphology, and syntactic parsing, as well as in original text and transcription, and in glosses in the dictionary.
The website will be updated regularly to reflect corrections, improvements and changes in the data in the ONCOJ, and to add more user functionality. Substantial expansion and structural improvements are planned and ongoing in several areas. The source data of the corpus itself, to which users have direct access, is updated in real time as editors make changes.
The ONCOJ draws its data from critical editions of OJ texts, with texts transcribed phonemically and edited to parallel the content from corresponding items in well-known critical editions and their interpretations. Where editions differ we have in most cases followed the authority of the Nihon koten bungaku taikei (Iwanami Shoten). The OJ poetic texts are the following (showing here also the standard abbreviations for text loci, e.g., MYS for texts from the Man'yōshū).
The original texts of the corpus were written in Chinese characters, employed both phonographically and logographically. The corpus re-casts these texts in a phonemic transcription in letters of the alphabet, richly annotated with lexical, morphological and syntactic information and structure.
The text of the annotated corpus is in the form of a phonemic transcription from the original script of the source texts, in Frellesvig-Whitman notation. The following table summarizes the differences in the way the distinction between kō-rui (甲類) and otsu-rui (乙類) syllables are represented in various notation systems (including Ohno Susumu's system as used for example in the Iwanami kogo jiten and the Yale system of Samuel E. Martin’s The Japanese language through time).
|Syllable type||Index notation||Ohno||Modified Mathias- Miller||Yale||Frellesvig & Whitman|
The following table presents some examples of how these different systems write words of Old Japanese, with the 'NJ' column showing the shape of the word in Modern Japanese.
|Gloss||NJ||Frellesvig & Whitman||Index notation||Yale||Modified Mathias-Miller||Ohno|
|'ear (of rice)'||ho||po||po||po||po||po|
Old Japanese writing practice employed two basic modes of writing: phonographic writing, in which Chinese characters represent OJ syllables, and logographic writing, in which Chinese characters represent OJ words or morphemes. In transcription, phonographically written text is shown in italics and logographically written text is shown in plain text; text portions which have no direct representation in writing (usually functional elements of some kind) are shown in underlined plain text.
awone ga take no
tare ka orikyemu
tatenuki nasi ni
'The moss matting / of Aonegatake / in Miyoshino /
-- who must have woven it, / with neither warp nor weft?'
[Aonegamine, Ōmine range, Nara, Japan]
Case and other particles (like genitive ga or focus so), modal extensions (like presumptive rasi) and the copula are presented as individual words and not as enclitics.
Words in a text form units (constituents) which combine with other words to form larger constituents (phrases and clauses), and ultimately whole sentences. In the Search interface, the organization of units is presented in the form of tree structures.
In the trees, text is segmented into terminal nodes (strings). Strings are separated into segments primarily by boundaries between morphemes, but within morphemes strings may also be segmented according to changes in the mode of writing. In the ONCOJ data, the mode of writing for every segment is indicated using nodes labeled, the principle ones being, respectively, PHON (phonographic), LOG (logographic), NLOG (no direct representation). There is an additional category for place names, PLOG, and another for illegible items, ILL. Thus a word may be composed of one or more morphemes, but a full morpheme may be composed of several segments, depending on how it is written in the original text.
Immediately above the segment(s) for each full morpheme is a lemma ID. For example, L000503 appears immediately above (or “directly dominates”) the string ga. Each lemma ID corresponds to a morpheme with a specific part-of-speech. Directly dominating the lemma ID node is the part-of-speech node, which has a label specifying the part-of-speech for that morpheme (e.g., PFX = prefix; N = noun; P = particle; VB = verb; ADJ = adjective, etc.). Thus, P-CASE-GEN is the label for the genitive case particle ga with lemma ID L000503. Part-of-speech nodes can either label simplex words or they can label morphemes that combine into complex words, which in turn receive their own part of speech nodes.
Directly dominating nodes at the word level are phrasal nodes (e.g., NP = noun phrase; PP = particle phrase; IP = inflectional phrase; CP = complementizer phrase, etc.). Labeled nodes frequently appear with extensions that specify functional information (e.g., NP-OB1 = direct object noun phrase; PP-SBJ = subject particle phrase; IP-ADV = adverbial inflectional phrase, etc.). In the trees, the structure of an inflectional phrase (clause) is flat, with no verb phrases or functional projections. Hence, local argument dependencies and modifier dependencies are defined by sisterhood to the target head. The basic format is similar to that used in the Penn Parsed Corpora of Historical English. For an overview of these principles applied to NJ, see the annotation manual in the Keyaki Treebank (Butler, et al. 2017).
Note that line breaks are numbered from 0 ~ n -1, and the original text for the n-th line of a poem is included under the line break numbered n -1. In the figure above (the fifth line of MYS.7.1120), the original text for line 5 appears under line break number 4.
The corpus is currently associated with a powerful on-line search interface, introduced here, which allows for results of specific searches to be be downloaded. The interface allows searches to be defined using regular expressions, node labels, and structural conditions.
The available on-line interface is a powerful and flexible tool sufficient for many research purposes, but note that the data on which it operates is updated in real time. For extended research projects it is often advisable to download a release version for use off-line. This not only provides the advanced researcher a stable set of data, but also allows for manipulation of the data to reflect analyses that are not included in the present online corpus but are necessary for a given research project. The full corpus is available for download for use with off-line search tools like Tregex (Levy and Andrew, 2006) and CorpusSearch2 (Randall, 2009).
Presentations of research results using the Oxford-NINJAL Corpus of Old Japanese should include a citation taking the general form of the example below (with appropriate modifications depending on the date of access):
National Institute for Japanese Language and Linguistics (2020) “Oxford-NINJAL Corpus of Old Japanese” http://oncoj.ninjal.ac.jp/ (accessed 31 January 2020)
This work is licensed under a Creative Commons Attribution 4.0 International License.