Quick introcution to pysfst, the python interface to SFST

intro.txt

Contents

Python imports and setup:

>>> import sfst
>>> setup_sfst()    # for the doctest

One-Line Transducers

After installing pysfst, the module to import is called sfst. It's usage is similar to the re regular expression module. To match the string "fly" and annotate it with "<N>", we would write:

>>> import re
>>> re.compile('fly').sub('fly<N>', 'fly')
'fly<N>'

The same functionality using sfst:

>>> sfst.compile('{fly<N>}:{fly}').analyze('fly')
['fly<N>']

But unlike regular expressions, sfst uses a finite state transducer programming language which can return multiple analyses:

>>> sorted ( sfst.compile('fly(<N>:<>|<V>:<>)').analyze('fly') )
['fly<N>', 'fly<V>']

SFST Programs

It's possible to embed arbitrary large sfst programs into python, here the "easy" example taken from the SFST distribution:

>>> transducer = sfst.compile("""
... % the set of valid character pairs
... ALPHABET = [A-Za-z] y:i [#e<JJ><JJR><JJS>]:<>
...
... $WORDS$ = (\
... easy |\
... late |\
... early |\
... happy |\
... white |\
... black \
... )
...
... % rule replacing y with i
... $R1$ = y<=>i (#:<> e)
...
... % rule eliminating e
... $R2$ = e<=><> (#:<> e)
...
... $R$ = $R1$ & $R2$
...
... $S$ = $WORDS$#(<JJ>|er<JJR>|est<JJS>)
...
... $S$ || $R$
... """)

This little program analyses inflected forms of the adjectives listed in $WORDS$:

>>> transducer.analyze('easy')
['easy#<JJ>']
>>> transducer.analyze('easier')
['easy#er<JJR>']
>>> transducer.analyze('easiest')
['easy#est<JJS>']
>>> transducer.analyze('late')
['late#<JJ>']
>>> transducer.analyze('later')
['late#er<JJR>']
>>> transducer.analyze('latest')
['late#est<JJS>']