-
Adam Reichold authored
This uses the machine-readable dictionary available at kaikki.org to drive a token filter which splits compound nouns. The noun list can be generated by running ```console > cargo xtask kaikki < path/to/kaikki.org-dictionary-German.json ``` which will produce `data/nouns.bin`. This is then used to configure `CompoundNounTokenFilter` which aims to find consecutive matches against this dictionary to produce additional tokens, e.g. "hochwasserereignisse" is split into "hochwasser" and "ereignisse" both during indexing and querying. I chose the minimal and naive variant of the dictionary for now, the kaikki.org dataset contains many additional forms of the given words but this could significantly increase the build and run time of the resulting automata.
832ff2dc
Validating GitLab CI configuration…
Learn more