Über Open CoDE Software Wiki Diskussionen Gitlab

Skip to content
  • Adam Reichold's avatar
    Split compound nouns based on dictionary · 832ff2dc
    Adam Reichold authored
    This uses the machine-readable dictionary available at kaikki.org
    to drive a token filter which splits compound nouns.
    
    The noun list can be generated by running
    
    ```console
    > cargo xtask kaikki < path/to/kaikki.org-dictionary-German.json
    ```
    
    which will produce `data/nouns.bin`.
    
    This is then used to configure `CompoundNounTokenFilter` which aims
    to find consecutive matches against this dictionary
    to produce additional tokens, e.g. "hochwasserereignisse" is split into
    "hochwasser" and "ereignisse" both  during indexing and querying.
    
    I chose the minimal and naive variant of the dictionary for now,
    the kaikki.org dataset contains many additional forms of the given
    words but this could significantly increase the build and run time of the
    resulting automata.
    832ff2dc
Validating GitLab CI configuration… Learn more