Oxford NINJAL Corpus of Old Japanese (ONCOJ)


As of March 2022, all data accessible from the online Search Interface is derived directly from the source data as it is updated in real time. Updates for webpages, new functionalities, etc. will be described here as they are published.

Until March 2022, the data for this corpus was presented in periodic updates. The first update took place on 21 September 2018, replacing Version 2018.3. In addition to significant improvements in lemmatization and correction of mistakes in text, analysis or tagging, the main systematic changes introduced in Version2018.9 are:

- The format of lemma ID numbers has been uniformized, as "l0xxxxx".

- A phrase tag has been created for nominalized clauses: "IP-NMZ". These are nominal phrases and often their role is further specified as "SBJ" = subject, "OB1" = object, "PRD" = nominal predicate, etc. Nominalized clauses generally have a predicate in the adnominal ("ADN" or "ADC") or nominal ("NML") form.

- Part-of-speech tags have been introduced for pro-forms ("PRO-N" = pronoun, "PRO-ADV" = pro-adverb) and wh-forms ("WH-N" = interrogative pronoun, "WH-ADJ-STM" = interrogative adjective, "WH-ADV" = interrogative adverb, "WH-NUM" = interrogative numeral) .

- Information about makura-kotoba, place names, personal names has been moved from lemma IDs to part-of-speech tags ("MK" = makura-kotoba, "PLN" = place name, "PEN" = personal name)

- The structure of makura-kotoba has been uniformized, such that (a) each makura-kotoba is treated as one word, with part-of-speech and lemma information about constituent parts included where known, but without internal structure, and (b) makura-kotoba syntactically are tagged as "IP-EPT" = epithetical IP, a modifying constituent, but without specifying the modificational relation.

- Inflectional information has for complex inflected forms (e.g., verbal syntagms) been duplicated at the highest word level, so that it no longer is necessary to look inside a complex inflected word to find its inflection type, e.g. siranu [VB-ADN [VB-STM sira] [VAX-NEG-ADN nu] ].

- Adjectives are now uniformly marked as "ADJ-STM" = adjective stem at the lowest level, regardless of the kind of adjective. These can, rarely, form a "ADJ" on their own, but are usually either compounded with nouns, e.g. opokimi [N [ADJ-STM opo] [N kimi] ], or followed by "ACP" = adjectival copula or "COP" = copula to form an "ADJ" = inflected adjective form, e.g. takaki [ADJ-ADN [ADJ-STM taka] [ACP-ADN ki] ], which here also specifies that the adjectival copula and therefore the entire form is in the "ADN' = adnominal form.

- The structures in which numerals occur have been simplified such that numerals combine with either an "N" = noun or "CL" = classifier, and always will take part in forming an "N" = noun, e.g. pitoywo [N [NUM pito] [N ywo] ]. Thus, "NUMCLP" = numeral classifier phrases have been eliminated.

- Use of the part-of-speech "WORD" has been reduced significantly, and it is now only used in a few cases where part of speech is completely unknown.

- A number of superfluous and/or very inconsistently implemented semantic roles have been eliminated.

The fourth update was released 7 March 2021, and includes the results of the first round of hand-checked lemmatization, among other improvements.

The fifth update was released October 14 2021, updating the List of words to reflect the state of the corpus data at that point. The domain was changed and the functionalities of the Search Interface were increased. From this date, the data accessible through the online interface is being taken directly from the source data for the corpus as it is updated in real time.

In January 2023, with the publication of a fully operational Dictionary function within the Search Interface, the List of words page was removed from the present site. Over the course of 2022, the data was transformed to a table format, in which each tree is expressed as an ordered set of paths. This is now the core data format of the ONCOJ.