Python imports and setup:
>>> import emores >>> from emores import config >>> from emores.morphology import Morphology >>> from emores.text import Text >>> from emores.induction import Induction >>> from emores.deduction import Deduction >>> from emores.abduction import Abduction
This time we really need to explicitly set the language:
>>> conf = config.LanguageConfig()
>>> conf.set_language('de_DE')
>>> conf.language
'de_DE'
Because the SMOR morphology used for de_DE is very complex, the initialisation process really takes a long time. You can watch tail -F /var/log/emores/emores.log to see what's happening.
>>> morph = Morphology(conf)
>>> morph.new('usecase_de_DE', '1.0.0')
>>> morph = None # make sure to reload the object
First, we load the morphology just stored.
>>> morph = Morphology(conf)
>>> morph.load('usecase_de_DE')
This is the same text as for the en_X example, but in the original language:
>>> text = Text(conf)
>>> text.new('Kafka', 'Die Verwandlung', 'http://www.gutenberg.org/files/22367/22367-8.zip')
If the induction (and hence deduction and abduction) was limited as in en_X to 100 token to process, it would take ages (around 50 days) if run in the test suite. Therefore it is restricted to one single token which is saturable:
>>> induct = Induction(morph) >>> induct.induce(token='Schritt')>>> deduct = Deduction(morph) >>> deduct.deduce()
For the abduction, we restrict the saturation and deduction_count levels accordingly to get the most likely lemmas. A saturation of 1.0 means that all generateable token by the lemma have been found in the corpus, and deduction_count value of 4 was found by taking the empirical maximum of the unrestricted results.
>>> abduct = Abduction(morph) >>> saturated = abduct.saturation(token='Schritt', ... saturation=1.0, deduction_count=4) >>> for row in saturated: ... print row[0] <Base_Stems>Schritt<NN><base><frei><NMasc-s/sse> <Base_Stems>Schritt<NN><base><fremd><NMasc-s/sse> <Base_Stems>Schritt<NN><base><nativ><NMasc-s/sse> <Base_Stems>Schritt<NN><base><nativ><NNeut-s/sse> <Base_Stems>schritt<NN><base><frei><NMasc-s/sse> <Base_Stems>schritt<NN><base><fremd><NMasc-s/sse> <Base_Stems>schritt<NN><base><nativ><NMasc-s/sse> <Base_Stems>schritt<NN><base><nativ><NNeut-s/sse>
This is the maximal restriction we theoretically can get from the linguistically agnostic part of the emores machinery. But compared to the deduction_binary_crosstab.de_DE.dat file (the data for the example cluster dendrogram), the abductive narrowing mechanism is fairly powerful: it cuts the original guess of 307 lemmas down to 8, and half of them could be ruled out by the simple linguistical (and hence language specific) convention to lemmatise nouns in upper case in German.
A quick look at the deduction for the correct lemma gives evidence that it indeed produces all word forms:
>>> sorted(abduct.deduction(u'<Base_Stems>Schritt<NN><base><nativ><NMasc-s/sse>')) [(u'Schritt', 2), (u'Schritte', 2), (u'Schritten', 1), (u'Schrittes', 1)]
For a native speaker however, the result is still questionable: The genitive word form "Schritts" would be possible, too, although the form with an "e" is preferrable to avoid a sequence of three consonants ~\citep[pg. 222]{duden}. In that case, the \citeauthor{sfst} morphology seems to bee too restrictive. But as this choice is a matter of style, it is very unlikely to find both forms in the same text or even in different texts from the same author.