Search interfaces

The ONCOJ is associated with two powerful user interfaces. Both interfaces were developed by Professor Alastair Butler in collaboration with the “Development of and linguistic research with a parsed corpus of Japanese” project at NINJAL. Both interfaces allow you to search for strings (terminal nodes) or parts of strings, and for tags (higher node labels) or parts of tags, and relationships between defined nodes. For example, enter a string that corresponds to a full word such as “kapi” (shell, worth, rear, vale, etc.), or a part-of speech level node such as “ADJ”, or a phrase level node label such as “NP” (noun phrase), or a complex expression using structural conditions, such as “no” written phonographically and dominated by a node with the part-of-speech of “copula”.

Tregex based search function

A new corpus interface was launched in the summer of 2020, and we anticipate that this will become the principal search interface for the online corpus. This interface can be accessed here.

This new interface has the advantage of enabling search with Tregex, a powerful search tool allowing the use of extended regular expressions, and a simple syntax for structural conditions and logical relations. As an example of the use of generalized node descriptions and structural conditions, consider how to find nominative-marked subjects preceding accusative-marked objects within the same clause where the accusative marking is written phonologically. The expression corresponds to the data structure in a straightforward manner:

/SBJ/ < /GEN/ $.. (/OB/ < (/^P\b/ << (/PHON/ < wo)))

This describes a subject phrase that both directly dominates a genitive-marked element and also precedes a sibling object phrase that itself directly dominates a particle node that dominates a phonologically written character node which in turn directly dominates the word wo; or, in other words, a genitive marked subject that precedes an object that is marked by the accusative case particle wo, written phonographically. For any tree in the interface, selecting the “source” option for display will show how the data structure is organized. Together with the simple syntax of Tregex and some familiarity with the tag set in the corpus, this knowledge allows the user to search for virtually any pattern in the corpus.

There are other additional features that will be of interest, including an easily viewed morpheme-by-morpheme correspondence between Old Japanese and English dictionary entries for all of the texts, and a separate Dictionary resource. This interface also has the advantage of directly accessing the most up-to-date working files of the ONCOJ database. The documentation for using the interface can be found under the “Help” button.

TGrep-lite based search function

A limited search interface, most useful for browsing the content of the ONCOJ, has been associated with the ONCOJ since its first publication in the spring of 2018. This search interface can be accessed here. This older interface makes accessible the static data uploaded to the present website on the date of its latest official update.

The syntax for the older interface is called TGrep-lite and is fully documented within the search interface, but unfortunately, the syntax is less powerful than Tregex, and while many of the same searches are possible with TGrep-lite, the syntax is more complicated. For example, the same search for genitive-marked subject phrases preceding sibling accusative-marked phrases mentioned above takes this form in TGrep-lite:

([SBJ] < [GEN]) $.. ([OB] < ([^P\b] << {PHON} == wo))

In the older interface, submitting a well-formed search expression opens a results page with attestations matched with text ID numbers that double as links to the corresponding trees. Once inside the Search interface, navigation to all the available functions (Tags, Dictionary, String search, Tree search) can be done through the buttons at the top of the Search interface page. Full documentation for each function is available through the “About” buttons (e.g., “About tree search”). There is also a button for returning to the ONCOJ front page. In the older interface, once you arrive at an Analysis view, you can click the lemma ID of any item and trigger a Dictionary search which opens a search result page in the Dictionary with the entry for that item.

Going forward

In addition to the differences between search languages in the two interfaces, there are also some minor differences in the way data access is organized. Nevertheless, we encourage users to become accustomed to the newer interface. Not only is it more powerful and easier to maintain into the future, but it shares features with a growing number of corpora covering, for example, present-day Japanese and English, Japanese regional dialects, and Japanese child language development.