A little computer algebra system (CAS) for finite state transducers

algebra.txt

Contents

Python imports and setup:

>>> import sfst
>>> setup_sfst()
>>> from sfst import compile, solve
>>> import exceptions

Symbolic algebra in SFST

The Algebra class implements a little computer algebra system (CAS) for finite state transducers written in the SFST-PL.

The motivation for building a symbolic variable resolver for SFST came from the emores project: it needs to solve a complex morphology for the lemma to generate the literal lexeme instead of the analysis string.

Usually the lexicon containing the lemmas (with the lexemes) is embedded in a composition chain of various intermediate transducers. The definition of composition reads (in the SFST-Manual.pdf):

r||s maps a to c iff r maps a to some b and s maps b to c.

Justification of the laws used

To justify the symbolic algebra operating over compositions let's build a paradigmatic example:

>>> src_ac = '''
... $AB$ = a:b
... $BC$ = b:c
... $AC$ = $AB$ || $BC$
... $AC$
... '''
>>> t_ac = compile(src_ac)
>>> t_ac.generate()
['a:c']
>>> t_ac.analyze('c')
['a']

Now lets assume that we want to generate the string that was mapped within the composition. We somehow need to bring the b to either edge of the resulting transducer. This can be achieved in two ways, either by extracting $AB$ or $BC$ out of the composition by applying inversion (^_r) to one of the composition operands:

  1. solve for $AB$:

    >>> src_ab = '''
    ... $AB$ = a:b
    ... $BC$ = b:c
    ... $AC$ = $AB$ || $BC$
    ... % algebraic transformation:
    ... $CB$ = ^_ $BC$
    ... $AB$ = $AC$ || $CB$
    ... $AB$
    ... '''
    >>> t_ab = compile(src_ab)
    >>> [(b, a)] = t_ab.generate(separate=True)
    >>> b
    'b'
    
  2. solve for $BC$:

    >>> src_bc = '''
    ... $AB$ = a:b
    ... $BC$ = b:c
    ... $AC$ = $AB$ || $BC$
    ... % algebraic transformation:
    ... $BA$ = ^_ $AB$
    ... $BC$ = $BA$ || $AC$
    ... $BC$
    ... '''
    >>> t_bc = compile(src_bc)
    >>> [(c, b)] = t_bc.generate(separate=True)
    >>> b
    'b'
    

The second form has the advantage of analysing the same (surface) string c as the original transducer yielding the intermediate string b that we solved the equation for. This is why constructing a lemma guesser transducer preferably uses the second transformation, as for the first one the analysis a would be needed beforehand:

>>> t_bc.analyze('c')
['b']

Another important fact about finite state transducers is that the associative law of composition holds, so we can use above transformation for extensive composition chains:

>>> src_abcd = '''
... $AB$ = a:b
... $BC$ = b:c
... $CD$ = c:d
... '''
>>> src_abcd1 = src_abcd + '''
... $ABCD1$ = ($AB$ || $BC$) || $CD$
... $ABCD1$
... '''
>>> abcd1 = compile(src_abcd1)
>>> src_abcd2 = src_abcd + '''
... $ABCD2$ = $AB$ || ($BC$ || $CD$)
... $ABCD2$
... '''
>>> abcd2 = compile(src_abcd2)
>>> abcd1.generate() == abcd2.generate()
True

There are two features of the SFST-PL which break the algebra presented when not handled properly:

1. Variables can be redefined later on and therefore would become ambiguous in a symbolic transformation of an SFST-PL source, as the defining order gets lost n the variable solving algorithm.

2. The ALPHABET is implicitly scoped: It potentially affects any transducer definition, e.g. wildcards or two-level rules (see the corresponding chapter in the SFST-Manual.pdf for a complete list) until the compiler encounters the next ALPHABET definition (which terminates the scope of the previous one).

Both features would forbid rearranging SFST-PL code lines. But once variables are (unambiguously) defined, they're no more affected by the ALPHABET. Therefore the pysfst resolver first disambiguates all variable names and then extracts the variable resolved for out of the variable definitions to bring it explicitly to the final transducer definition.

Any symbolic algebra operating solely on that definition can rearrange free (i.e. not embedded in unhandled operators) variables freely, effectively circumventing the described problems.

Resolving concatenation

I had to add the restriction "free variables" to above statement after discovering that concatenating variables causes problems, as the next example shows. I'm using lower case letters for the variable names to stress their usage as "matching exactly that string". Therefore $a$ should be replaceable by $a$ || $a$ according to above laws.

>>> conc_src = '''
... $ab$ = ab
... $a$ = a
... $b$ = b
... $T$ = $ab$ || $a$ $b$               % concatenation of $a$ and $b$
... $T1$ = $ab$ || $a$ || $a$ $b$       % $a$ replaced by $a$ || $a$
... $T$
... '''
>>> conc_orig = sfst.compile(conc_src)
>>> conc_orig.analyze('ab')
['ab']
>>> conc_src1 = conc_src.replace('$T$\n', '$T1$\n')
>>> conc1 = sfst.compile(conc_src1)
>>> conc1.analyze('ab')
[]

The variable replacement in $T1$ in the context of a conjunction obviously was wrong. It seems that the composition binds stronger than the conjunction in that case. Because in SFST-PL conjunction binds stronger than composition, we need to use parentheses:

>>> conc_src2 = conc_src1.replace(
... '$T1$ = $ab$ || $a$ || $a$ $b$',
... '$T1$ = $ab$ || ($a$ || $a$) $b$')
>>> conc2 = sfst.compile(conc_src2)
>>> conc2.analyze('ab')
['ab']

Final result: It seems that concatenation is distributive over composition if we want to get rid of the parentheses (which is a requirement for being able to solve for a particular variable involved, otherwise the parentheses block splitting the composition chain):

>>> conc_src3 = conc_src2.replace(
... '$T1$ = $ab$ || ($a$ || $a$) $b$',
... '$T1$ = $ab$ || $a$ $b$ || $a$ $b$')
>>> conc3 = sfst.compile(conc_src2)
>>> conc3.analyze('ab')
['ab']

To solve specifically for $a$, we would need to remove the $b$. I was unable to handle that problem within our little CAS, as symbol deletion in SFST usually is done using an ALPHABET (see e.g. "map1.fst" in XMOR), but even if it would be possible to extract the character range from (the analysis of) a transducer ($b$ in the example), this would unintentionally delete symbols if there is a non-empty intersection of the character ranges from $a$ and $b. This emptiness constraint cannot be enforced in a general variable solver.

Concatenation with arbitrary transducers

Above derivation used only transducers mapping strings to themselves. For concatenating arbitrary string mappings, the distribution needs to carry on the mapping over the whole length. Above naive distribution yields an empty analysis for the concatenated input string, as in the concatenation of the distributed $b$ the last one produces a B which is not recognized by the preceding one:

>>> conc_t_src = '''
... $ab$ = a b:B
... $a$ = a
... $b$ = B:b
... $T$ = $ab$ || $a$ $b$ || $a$ $b$
... $T$
... '''
>>> conc_t = sfst.compile(conc_t_src)
>>> conc_t.analyze('ab')
[]

The distributed $b$ can be reformulated to pass trough the analysis string over arbitrary many compositions by using the domain operator _ (extracting the analysis/deep language), as shown in this example:

>>> pass_src = '''
... $b$ = B:b
... $TP$ = _$b$ ||  _$b$ ||  _$b$ || $b$
... $TP$
... '''
>>> pass_t = sfst.compile(pass_src)
>>> pass_t.analyze('b')
['B']

Note that any but the last $b$ transducer has been replaced with the inversion composition. We can therefore reformulate the conc_t_src using that method:

>>> conc_t_src_inv = conc_t_src.replace(
... '$T$ = $ab$ || $a$ $b$ || $a$ $b$',
... '$T$ = $ab$ || $a$ _$b$ || $a$ $b$')
>>> conc_t_inv = sfst.compile(conc_t_src_inv)
>>> conc_t_inv.analyze("ab")
['ab']

The pysfst solver: simple example

Here's an example for a very abbreviated lemma guesser with some pattern as $LEX$. It doesn't require the distribution law:

>>> simple_src = '''
... ALPHABET = [a-z]
... $LEX$ = lem .*
... $MAP1$ = {analysis}:{lemma}
... $MAP2$ = {lemma}:{complicated}
... $LEX$ = $MAP1$ || $LEX$ || $MAP2$
... $MOR$ = {complicated}:{surface}
... $MOR$ = $LEX$ || $MOR$
... $MOR$
... '''
>>> simple_analyser = compile(simple_src)
>>> simple_analyser.analyze('surface')
['analysis']

The analyser successfully analyses the 'surface' with 'analysis'. However, we're interested in the string the first occurrence of $LEX$ actually matched. With pysfst, we can simply solve for $LEX$:

>>> simple_solved_src = solve(simple_src, '$LEX$')
>>> simple_guesser = compile(simple_solved_src)
>>> simple_guesser.analyze('surface')
['lemma']

Hooray! There is our lemma the embedded $LEX$ actually matched.

The pysfst solver: example with concatenation

Above "Hooray!" was too early on the first attempt. Let's try with one that more resembles a real morphology and uses concatenation for recognizing the inflection:

>>> src_concat = '''
... % Experimental 'expose the problem' pseudo morphology
...
... % any symbols must pass trough the $MAP1$ wildcards
... ALPHABET = [a-z]<form><stem><reginf>
...
... % delete unwanted symbols in the analysis:
... % the 1st .* matches the lexeme, the 2nd .* matches the <form> tag
... $MAP1$ = <>:<stem> .* <>:<reginf> .*
...
... % A 'regular inflection' lemma
... $LEX$ = <stem>lexeme<reginf>
...
... % delete unwanted symbols on the "surface"
... % lexeme< -> <stem>lexeme<reginf>
... $MAP2$ = <stem>:<> .* <reginf>:<>
...
... $LEX$ = $MAP1$ || $LEX$ || $MAP2$
...
... % The "regular" morphology:
... $LEXEME_SURFACE$ = {lexeme}:{surface}
... $REGMOR$ = $LEX$ || $LEXEME_SURFACE$
...
... % A dummy "inflection"
... $INFL$ = <form>:<>
...
... % Problem culprit: analyser somehow mapping "lexeme" to a
... % completely different "analysis"
... $ANALYSIS$ = .* {analysis}:{lexeme} .*
...
... % Put together the morphology
... $MORPH$ = $ANALYSIS$ || $REGMOR$ $INFL$
...
... $MORPH$
... '''
>>> morph_concat = sfst.compile(src_concat)
>>> morph_concat.analyze('surface')
['analysis<form>']

First, we check whether just resolving the variables containing $LEX$ (without bringing $LEX$ to the analysis) still analyses the surface in the same way.

>>> algebra_concat = sfst.Algebra()
>>> algebra_concat.solve(src_concat, '$LEX$')
>>> src_concat_1 =  algebra_concat.result(solve=None)
>>> morph_concat_1 = sfst.compile(src_concat_1)
>>> morph_concat_1.analyze('surface')
['analysis<form>']

OK, this seems plausible. Now let's really solve for $LEX$ without directly using an Algebra instance:

>>> solved_src_concat = sfst.solve(src_concat, '$LEX$')

Just for the effect, show the finally solved transducer:

>>> solved_src_concat.split('\n')[-1:][0]
'(_$LEX$) (_$INFL$) || ^_ ( $ANALYSIS$ || $MAP1$ (_$INFL$) ) || $ANALYSIS$ || $MAP1$ (_$INFL$) || $LEX$ (_$INFL$) || $MAP2$ (_$INFL$) || $LEXEME_SURFACE$ $INFL$'

Well, doesn't look this cryptic expression nicely "CASy"? Let's try it out:

>>> guesser_concat = sfst.compile(solved_src_concat)
>>> guesser_concat.analyze('surface')
['<stem>lexeme<reginf><form>']

This is literally our lemma containing the original lexeme, but still concatenated with the $INFL$ analysis string <form>. As already said, I was unable to reliably remove it within this little SFST CAS itself.

For examining what happens "behind the curtain"; the Algebra has a verbose mode which pretty prints the variable expansion history:

>>> solved = sfst.solve(src_concat, '$LEX$', verbose=True)
---- Preprocessed source: begin ----
ALPHABET = [a-z]<form><stem><reginf>
$MAP1$ = <>:<stem> .* <>:<reginf> .*
$LEX$ = <stem>lexeme<reginf>
$MAP2$ = <stem>:<> .* <reginf>:<>
$LEX_pysfst_2$ = $MAP1$ || $LEX$ || $MAP2$
$LEXEME_SURFACE$ = {lexeme}:{surface}
$REGMOR$ = $LEX_pysfst_2$ || $LEXEME_SURFACE$
$INFL$ = <form>:<>
$ANALYSIS$ = .* {analysis}:{lexeme} .*
$MORPH$ = $ANALYSIS$ || $REGMOR$ $INFL$
$MORPH$
---- Preprocessed source: end ----
Expanding
    lhs='$LEX$' with
    rhs='None' in
    '$MAP1$ || $LEX$ || $MAP2$'
Result is '$MAP1$ || $LEX$ || $MAP2$'
Expanding
    lhs='$LEX_pysfst_2$' with
    rhs='$MAP1$ || $LEX$ || $MAP2$' in
    '$LEX_pysfst_2$ || $LEXEME_SURFACE$'
Result is '$MAP1$ || $LEX$ || $MAP2$ || $LEXEME_SURFACE$'
Expanding
    lhs='$REGMOR$' with
    rhs='$MAP1$ || $LEX$ || $MAP2$ || $LEXEME_SURFACE$' in
    '$ANALYSIS$ || $REGMOR$ $INFL$'
    Distributing lhs='', rhs='$INFL$' over
            '$MAP1$ || $LEX$ || $MAP2$ || $LEXEME_SURFACE$'
    Result is '$MAP1$ (_$INFL$) || $LEX$ (_$INFL$) || $MAP2$ (_$INFL$) || $LEXEME_SURFACE$ $INFL$'
Result is '$ANALYSIS$ || $MAP1$ (_$INFL$) || $LEX$ (_$INFL$) || $MAP2$ (_$INFL$) || $LEXEME_SURFACE$ $INFL$'
Expanding
    lhs='$MORPH$' with
    rhs='$ANALYSIS$ || $MAP1$ (_$INFL$) || $LEX$ (_$INFL$) || $MAP2$ (_$INFL$) || $LEXEME_SURFACE$ $INFL$' in
    '$MORPH$'
Result is '$ANALYSIS$ || $MAP1$ (_$INFL$) || $LEX$ (_$INFL$) || $MAP2$ (_$INFL$) || $LEXEME_SURFACE$ $INFL$'
Expanding
    lhs='$LEX$' with
    rhs='None' in
    '$MAP1$ || $LEX$ || $MAP2$ || $LEXEME_SURFACE$'
Result is '$MAP1$ || $LEX$ || $MAP2$ || $LEXEME_SURFACE$'
Expanding
    lhs='$LEX$' with
    rhs='None' in
    '$ANALYSIS$ || $MAP1$ (_$INFL$) || $LEX$ (_$INFL$) || $MAP2$ (_$INFL$) || $LEXEME_SURFACE$ $INFL$'
Result is '$ANALYSIS$ || $MAP1$ (_$INFL$) || $LEX$ (_$INFL$) || $MAP2$ (_$INFL$) || $LEXEME_SURFACE$ $INFL$'
Expanding
    lhs='$LEX$' with
    rhs='None' in
    '$ANALYSIS$ || $MAP1$ (_$INFL$) || $LEX$ (_$INFL$) || $MAP2$ (_$INFL$) || $LEXEME_SURFACE$ $INFL$'
Result is '$ANALYSIS$ || $MAP1$ (_$INFL$) || $LEX$ (_$INFL$) || $MAP2$ (_$INFL$) || $LEXEME_SURFACE$ $INFL$'
Resolved transducer to:
$ANALYSIS$ || $MAP1$ (_$INFL$) || $LEX$ (_$INFL$) || $MAP2$ (_$INFL$) || $LEXEME_SURFACE$ $INFL$
Result transducer is:
(_$LEX$) (_$INFL$) || ^_ ( $ANALYSIS$ || $MAP1$ (_$INFL$) ) || $ANALYSIS$ || $MAP1$ (_$INFL$) || $LEX$ (_$INFL$) || $MAP2$ (_$INFL$) || $LEXEME_SURFACE$ $INFL$

The pysfst solver: the SFST-Manual.pdf example

Finally, let's try the simple example provided by the SFST-Manual.pdf:

>>> src_manual = '''
... % Define the set of valid symbol pairs for the two-level rules.
... % The symbol # is used to mark the boundary between the stem and
... % the inflectional suffix. It is deleted here.
...
... ALPHABET = [A-Za-z] y:i [e#]:<>
...
... % Read the lexical items from a separate file
... % each line of which contains a form like "dark"
... $WORDS$ = dark | greedy
...
... % Define a rule which replaces y with i
... % if a morpheme boundary and an e follows
... % easy#er -> easier
... $R1$ = y<=>i (#:<> e)
...
... % Define a rule which eliminates e before "#e"
... % late#er -> later
... $R2$ = e<=><> (#:<> e)
...
... % Compute the intersection of the two rule transducers
... $R$ = $R1$ & $R2$
...
... % Define a transducer for the inflectional endings
... $INFL$ = <ADJ>:<> (<pos>:<> | <comp>:{er} | <sup>:{est})
...
... % Concatenate the lexical forms and the inflectional endings and
... % put a morpheme boundary in between which is not printed in the analysis
... $S$ = $WORDS$ <>:# $INFL$
...
... % Apply the two level rules
... % The result transducer is stored in the output file
... $S$ || $R$
... '''
>>> morph_manual = sfst.compile(src_manual)
>>> morph_manual.analyze('greediest')
['greedy<ADJ><sup>']
>>> morph_manual.analyze('darkest')
['dark<ADJ><sup>']

The raw source is not solvable:

>>> try:
...     solved_src_manual = sfst.solve(src_manual, '$WORDS$')
... except exceptions.ValueError, inst:
...     print inst
Expression '<>:# $INFL$' is not a pure variable concatenation.

We therefore need to introduce an intermediate variable for being able to solve the source. For morphologies not specifically designed to be solvable for a particular variable, it is probably a quite common requirement to introduce such intermediate variables:

>>> src_manual_var = src_manual.replace(
... '$S$ = $WORDS$ <>:# $INFL$',
... '''
... $MORPHEME_BOUNDARY$ = <>:#
... $S$ = $WORDS$ $MORPHEME_BOUNDARY$  $INFL$
... ''')
>>> solved_src_manual = sfst.solve(src_manual_var, '$WORDS$')
>>> solved_morph_manual = sfst.compile(solved_src_manual)

Here we're encountering the special case that the lexicon already was the leftmost expression in the original source, and therefore we "solving" it in fact did nothing:

>>> solved_morph_manual.analyze('greediest')
['greedy<ADJ><sup>']
>>> solved_morph_manual.analyze('darkest')
['dark<ADJ><sup>']

Optimizing for fast compilation of the variable solved for

For the emores project, it is crucial to minimise the compilation time for a morphology with one explicit lemma. Because precompiling as much as possible is tightly coupled to the variable solving mechanism, I decided to put it here. We use again our src_concat from above:

>>> (precompiled, includes) = sfst.solve_precompile(src_concat, '$LEX$')
>>> precompiled
'(_$LEX$) (_"<pysfst_rconcat>") || "<pysfst_lhs>" ||  $LEX$ "<pysfst_rconcat>"  || "<pysfst_rhs>"'

The precompiled source consists only of included binary transducers and the variable solved for. The transducers are returned in the includes dictionary as stored binray file in a string, not as sfst.Transducer instances. This should be faster and makes the result easily pickleable (even if it was easily possible, unpickling version-outdated sfst.Transducer instances could lead to interesting debugging sessions - in that case the pre-OO-ish separation of algorithms and data makes really sense). If a transducer is needed, it can simply be constructed out of the binary string:

>>> includes.keys()
['pysfst_rconcat', 'pysfst_lhs', 'pysfst_rhs']
>>> type(includes['pysfst_lhs'])
<type 'str'>
>>> type(sfst.Transducer(includes['pysfst_lhs']))
<class 'sfst.Transducer'>

To compile the new transducer, explicitly insert the variable $LEX$ at the beginning in a separate line. The result transducer analyses the surface just like the non-optimized guesser_concat:

>>> lex = '$LEX$ = <stem>lexeme<reginf>\n'
>>> precompiled_src = lex + precompiled
>>> morph_precompiled = sfst.compile(precompiled_src, include=includes)
>>> morph_precompiled.analyze('surface')
['<stem>lexeme<reginf><form>']

Limitations of the algebra

Disjunction

The SFST-PL algebra presented cannot handle any disjunctions. It is therefore a prerequisite to reformulate the morphology it should operate over by removing all and any disjunctions in the resolution path of the variable the morphology is solved for.

In the XMOR example shipped with the SFST tools, this concerns the (in fact optional) "defaultstems.fst" include for "generation of default derivational and compounding stems".and the "combination of the different derivations".

Compounding is a problem not handled at all: It would require to solve somehow for two distinct lemmas compounded together (which usually reside both in the lexicon which is a disjunction of lemmata).

At present, the algebra is also not capable of handling the syntax that is used in the SMOR example.

Exploding solution compliations

It turned out that the solved XMOR example requires at least more than three gigabytes to compile even with the shipped fst-compiler. To avoid damage to your keyboard because of squeezing <Ctrl>+<Alt>+<F2> to get a terminal to terminate the compiler while the whole OS gets swapped out, a technique similar to the fst-parse tool has to be used: First analyse the word form with the non-solved guesser, then "analyse" the result(s) with the first part of the solved guesser which can be obtained with the keyword argument solve='reverse_solved':

>>> reverse_solved = sfst.solve(src_concat, '$LEX$',
...                             solve='reverse_solved')
>>> reverse_solved.split('\n')[-1:][0]
'(_$LEX$) (_$INFL$) || ^_ ( $ANALYSIS$ || $MAP1$ (_$INFL$) )'

So we re-use the already compiled non-solved transducer to get the analysis we need for applying the reverse_solved transducer:

>>> analysis = morph_concat.analyze('surface')
>>> analysis
['analysis<form>']
>>> morph_reverse_solved = sfst.compile(reverse_solved)
>>> lemma = morph_reverse_solved.analyze(analysis[0])
>>> lemma
['<stem>lexeme<reginf><form>']

We get the same result as for the guesser_concat (as expected). We even can compose the two transducers after compiling (which also prevents us from being busted by the compiler):

>>> morph_combined = morph_reverse_solved << morph_concat
>>> morph_combined.analyze('surface')
['<stem>lexeme<reginf><form>']