Python imports and setup:
>>> import emores >>> from emores import config >>> from emores.morphology import Morphology >>> from emores.text import Text >>> from emores.induction import Induction >>> from emores.deduction import Deduction >>> from emores.abduction import Abduction
We deliberately set the language, regardless of its preselection as default language:
>>> conf = config.LanguageConfig()
>>> conf.set_language('en_X')
>>> conf.language
'en_X'
The central access point for the compilation of a morphology is the Morphology class. After running make, new compiles anything and stores it in the database.
>>> morph = Morphology(conf)
>>> morph.new('usecase_en_X', '1.0.0')
>>> morph = None # make sure to reload the object
Now the morphology engineer has finished his work and passes the laptop over to the lexicon extender to process a corpus text.
First, we load the morphology just stored.
>>> morph = Morphology(conf)
>>> morph.load('usecase_en_X')
Then we retrieve additional corpus data. Multiple instances of the same text (corups, name and url identically) are not supported, in that case the text gets retrieved only once. This is a rather slow operation because of the one-by-one filtering of any characters not belonging to the language (try to use finite state complementation with Unicode\ldots).
>>> text = Text(conf)
>>> text.new('Kafka', 'Metamorphosis', 'http://www.gutenberg.org/files/5200/5200.zip')
Now we perform the induction step: For any newly found token which isn't yet analyseable with the initial lexicon, all lemmas are guessed and inserted into the database. Because emores operates somewhat slowly, we limit the amount of data processed.
>>> induct = Induction(morph) >>> induct.induce(limit=100)
This step really takes a long time. To spot progress, connect to the emores_test database using a tool like pgadmin3 and compare the count of token to the count of asserted single inductions:
SELECT COUNT(*) FROM token;
SELECT COUNT(*) FROM induction;
If the induction has finished, there are induction database entries for each token that somehow was analyseable. After the induction has taken place, the deduction step can happen, also taking a lot of time:
>>> deduct = Deduction(morph) >>> deduct.deduce()
Now there are deduction database entries for each lemma guessed in the induction step and the results can be examined by analysing the database directly using SQL or by using canned queries provided by the abduction module:
>>> abduct = Abduction(morph) >>> saturated = abduct.saturation()
Now we audit one guessed lemma for the verb "to consider":
>>> for row in saturated: ... if row[0].find(u'consider<V>') > 0: ... row_consider = row >>> print row_consider (u'<Stem>consider<V><base><native><VerbReg>', 0.75, 4L, 5L)
The lemma is not completely saturated, as the corresponding column is not 1, but only 0.75 on a generative lemma productivity of 4. To investigate which tokens lead to that saturation level of the lemma, we use the deduction reporting method of the Abduction class:
>>> sorted(abduct.deduction(u'<Stem>consider<V><base><native><VerbReg>')) [(u'consider', 3), (u'considered', 1), (u'considering', 1), (u'considers', 0)]
There were 4 word forms found in the deduction range, but the fourth, "considers", did not occur in our corpus, so the saturation level remains 3/4.