Split compound nouns based on dictionary
This uses the machine-readable dictionary available at kaikki.org to drive a token filter which splits compound nouns.
The noun list can be generated by running
> cargo xtask kaikki < path/to/kaikki.org-dictionary-German.json
which will produce data/nouns.bin
.
This is then used to configure CompoundNounTokenFilter
which aims
to find consecutive matches against this dictionary
to produce additional tokens, e.g. "hochwasserereignisse" is split into
"hochwasser" and "ereignisse" both during indexing and querying.
One remaing issue the performance of debug builds: Debug builds
(and hence tests) are somewhat slowed down, especially
if a DFA instead of NFA is built which however is more efficient
for matching even though compiling is much slower.
I made this dependent on #[cfg(debug_assertions)]
, i.e. only release builds will get the .dfa(true)
treatment.
This is not a problem anymore as we use the dev-opt
build profile now.
Furthermore, I am still trying to contact the team at kaikki.org to find out whether we can use their data and what kind of attribution they would like to have. If this works out, we need to consider which files to put under version control so that the tests work or whether those should use a synthetic dictionary.
Finally, I chose the minimal variant of the dictionary for now, the kaikki.org dataset contains many additional forms of the given words but this could significantly increase the build and run time of the resulting automata, c.f. #112 for a follow-up issue to collect options for improvement.
Closes #99 (closed)