Split compound nouns based on dictionary (!113) · Merge requests · umwelt-info / metadaten

Adam Reichold requested to merge tokens-by-hyphenation into main Oct 26, 2022

This uses the machine-readable dictionary available at kaikki.org to drive a token filter which splits compound nouns.

The noun list can be generated by running

> cargo xtask kaikki < path/to/kaikki.org-dictionary-German.json

which will produce data/nouns.bin.

This is then used to configure CompoundNounTokenFilter which aims to find consecutive matches against this dictionary to produce additional tokens, e.g. "hochwasserereignisse" is split into "hochwasser" and "ereignisse" both during indexing and querying.

One remaing issue the performance of debug builds: Debug builds (and hence tests) are somewhat slowed down, especially if a DFA instead of NFA is built which however is more efficient for matching even though compiling is much slower.

~~I made this dependent on #[cfg(debug_assertions)], i.e. only release builds will get the .dfa(true) treatment.~~

This is not a problem anymore as we use the dev-opt build profile now.

Furthermore, I am still trying to contact the team at kaikki.org to find out whether we can use their data and what kind of attribution they would like to have. If this works out, we need to consider which files to put under version control so that the tests work or whether those should use a synthetic dictionary.

Finally, I chose the minimal variant of the dictionary for now, the kaikki.org dataset contains many additional forms of the given words but this could significantly increase the build and run time of the resulting automata, c.f. #112 for a follow-up issue to collect options for improvement.

Closes #99 (closed)

Edited Nov 10, 2022 by Adam Reichold

Split compound nouns based on dictionary

Merge request reports