hosted by Gna!

emores database diagram The aim of emores, an abductive Empirical MOrphological REaSoning engine based on the Stuttgart Finite State Transducer Tools, is to facilitate a guided brute force attack on a specific problem of word morphology in computational linguistics: extending the lexicon from corpus data. For a particular inflected natural language, it requires a hand crafted SFST finite state transducer and a seed lexicon covering all of its regular inflection classes. When fed with new word forms from a corpus text, it guesses which lemmas could have generated it (induction) and what other word forms could be explained with that lemmas (deduction). This generated data is written to the database (see the db diagram on the right) and is analysed using SQL to measure the explanatory power of the guessed lemmas with respect to the corpora (abduction). If all possible word forms for a guessed lemma are found in the data, the lemma counts as saturated.

Have a look at the use case for en_X or use case for de_DE doctest to see emores in action.

cluster dendrogram for the generative power of the word Schritt Currently, the actual code finally works for both languages, but the usecase for 'de_DE' is very restricted, as it otherwise literally takes ages due to the complexity of the full German morphology. The image on the left is a statistical cluster dendrogram visualising the generative power for the lemmas generated by guessing for the token "Schritt". The use case for de_DE proves that emores is capable of narrowing down these 307 possible lemmas to 8.



$Date: 2008-06-18 18:24:52 +0200 (Wed, 18 Jun 2008) $