Improve quality of dictionary used for splitting compounds
The quality of the splits provided by the mechanism introduced by !113 (merged) is mainly determined by the quality of the dictionary which is currently built by naivly taken all base nouns from the kaikki.org dataset.
There are many possiblities to improve this, e.g.:
-
Consider stemming, e.g. when encounting "backen", also add "back" so that "Brotbackautomat" is split. (We already have stemmers in our dependency closure.) -
kaikki.org sometimes provides data on the parts of a compound, e.g. "Brotbackautomat", and we might want to include only the parts, not the whole. -
But we have to be careful with the above, e.g. we probably do not want to split to "Hochwasser" into "hoch" and "wasser" which will looses the specific meaning of the token.