Search interface

The ONCOJ is associated with a powerful user interface developed by Professor Alastair Butler in collaboration with the “Development of and linguistic research with a parsed corpus of Japanese” project at NINJAL. The interfaces allows you to search for strings (terminal nodes) or parts of strings, and for tags (higher node labels) or parts of tags, and relationships between defined nodes. For example, enter a string that corresponds to a full word such as kapi (‘shell’, ‘worth’, ‘rear’, ‘vale’, etc.), or a part-of speech level node such as ADJ, or a phrase level node label such as NP (noun phrase), or a complex expression using structural conditions, such as no written phonographically and dominated by a node with the part-of-speech of “copula”.

Tregex based search function

The present corpus interface was launched in the summer of 2020, and as of 10 October 2021 is the only search interface for the online corpus. This interface can be accessed here.

This new interface has the advantage of enabling search with Tregex and TGrep2, powerful search languages allowing the use of extended regular expressions, and a simple syntax for structural conditions and logical relations. As an example of the use of generalized node descriptions and structural conditions, consider how to find nominative-marked subjects preceding accusative-marked objects within the same clause where the accusative marking is written phonologically. To set the scope of the search to include all of the 26 files that comprise the corpus, enter 1,26 in the Files dialog box. In the TGrep2 Search Expression box, enter the expression that corresponds to the data structure and click Search:

/SBJ/ < /GEN/ $.. (/OB/ < (/^P\b/ << (/PHON/ < wo)))

This describes a subject phrase that both directly dominates a genitive-marked element and also precedes a sibling object phrase that itself directly dominates a particle node that dominates a phonologically written character node which in turn directly dominates the word wo; or, in other words, a genitive marked subject that precedes an object that is marked by the accusative case particle wo, written phonographically. For any tree in the interface, selecting the “source” option for display will show how the data structure is organized. Together with the simple syntax of TGrep2 and some familiarity with the tag set in the corpus, this knowledge allows the user to search for virtually any pattern in the corpus.

In the interface, there are other additional features that will be of interest, including an easily viewed morpheme-by-morpheme correspondence between Old Japanese and English dictionary entries for all of the texts, and a separate Dictionary resource. This interface also has the advantage of directly accessing the most up-to-date working files of the ONCOJ database. The documentation for using the interface can be found under the “Help” button.

TGrep-lite based search function

A limited search interface, most useful for browsing the content of the ONCOJ, had been associated with the ONCOJ since its first publication in the spring of 2018.
As of 10 October 2021, this search interface has been discontinued.

Going forward

We encourage users to become accustomed to the present interface.
Not only is it more powerful and easier to maintain into the future than the old interface, but it shares features with a growing number of corpora covering, for example, present-day Japanese and English, Japanese regional dialects, and Japanese child language development.