| Authors: | Flammie Pirinen |
|---|---|
| Software version: | |
| 20101026 (draft only, please send feedback to authors) | |
| Documentation license: | |
| Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 Unported | |
| SVN Revision: | 292 |
| SVN Date: | 2010-08-31 |
Omorfi is open source morphological analyser of Finnish language licenced under GPLv3. It uses lexical data from research institute of languages in finland Nykysuomen sanalista v1 and open source lexical data collecting project Joukahainen.
This documentation is intended for end-users of the morphology. The chapter 1 is an introduction that everyone may want to read. Chapter 2 describes the implemented morphological analyser and its contents. Chapter 3 describes applications of omorfi and references to their use.
This documentation is made from doc/ directory of omorfi source distribution. This document talks about 20101026, you may find more up-to-date versions via Omorfi website.
Table of Contents
Omorfi is an open source implementation of automatic morphological analysis of Finnish language, implemented in finite-state technology using traditional finite-state models and open source tools. Main development tools of omorfi are Helsinki Finite-State Technology tools from Helsinki finite-state technology. For lexical data source omorfi uses Nykysuomen sanalista v1, GNU LGPL word list by Research institute of languages in Finland, and Joukahainen, a GNU GPL word list developed by open source community.
Omorfi was started in University of Helsinki as master's thesis project. Afterwards it has been continued as author's side project and used for research purposes in the development of FST tools. Omorfi is also used by some external projects for basic morphological analysis or full-form dictionary.
This document describes what the morphological analyser contains, what its analyses mean, and why it has been built like this. This documentation is intended to be long-term background material for omorfi. The basic usage and details which may vary a lot is covered in Omorfi basic usage guide, which contains the specific commands, software versions and filenames relevant for current versions of the software. For a developer's reference, you may skim through this documentation and then read Omorfi developer's guide to find out the coding conventions, contribution guidelines, and the places of code to modify.
Throughout the document, hyperlinks, such as VISK (refering to descriptive grammar of Finnish language) are given. For academic citations [VISK10] notation is used. These citations are listed as endnotes of the document.
This monospaced style is used for input and output of command line tools and morphological analysis strings that are meant for automatic parsing. An offset monospaced section is used for longer sets of examples. For most examples, an informal parsing formula, such as following, is provided:
word+tags*
The purpose of this is to give very short overview of what kind of output you may expect when parsing data relavant to the description. The notation used is free form variant of regular expressions, where an asterisk * is used to signify repetition of previous structure zero or more times and a plus + once or more; previous example would've meant one or more words and possibly tags following them. A question mark ? is used to represent optionality of previous structure and parentheses () are used to group multiple structures together for asterisks or question marks.
Note
In this document, a note admonition like this one is used along with normative references, such as ones to official Finnish grammar VISK.
Warning
A warning admonition, such as this one, is used to note something unexpected or deviating from practices users of previous systems may have done differently.
The possible morphological readings relevant to the section are enumerated with a list of examples accompanied by standard format table. The examples show all the values of analyses using forms of example word, the examples are then explained in the table:
example READINGS...NAME=VALUE1 example READINGS...NAME=VALUE2
| VARIABLE NAME | Explanation | Examples |
|---|---|---|
| VARIABLE VALUE1 | interpretation of NAME=VALUE1 | example in Finnish (translation or gloss) |
| VARIABLE VALUE2 | interpretation of NAME=VALUE2 | example will analyse with NAME=VALUE described |
| ... | ... | ... |
Note that the examples have been obtained by running the analyser and copied here verbatim. The examples may have been cleaned up and rewrapped manually.
The Omorfi is not intended to create new linguistic descriptions. We merely aim to capture as much of contemporary linguistic knowledge about morphology of word-forms in the analyser. The primary source for linguistic knowledge of Finnish language is Iso suomen kielioppi, from now on refered to as VISK or the official grammar of Finnish language. Everything implemented in Omorfi is described in VISK or other scholarly resource on morphology of Finnish language, and the intent of this document is to list all parts of omorfi along with specific references to the grammar or the relevant scientific source describing the implemented feature.
The main emphasis on morphological analysis in Omorfi means, that it has been built to analyse word-forms in isolation, based on the information that is present in the word-form. While Omorfi is and will be used for other purposes than morphological analysis, such as ones listed in applications, the core morphological analyses will be retained as unchanged as possible. The necessary additions may be done by extending the description.
There are also conventions used in past and contemporary morphological analysers, that have had an effect in some of the design decisions of Omorfi. The basic tagging conventions of omorfi do not retain direct compatibility with past systems, but it is designed so that conversion downwards can be typically supported. For this reason the analyses of omorfi are occasionally lengthier than needed, since they aim to contain reasonable superset of a features that has been used in other systems. This has been done to facilitate the usage and comparison of Omorfi and other systems in the traditional, basic tasks of a morphological analyser.
The examples in this document have been given in omorfi's own notation for analyses. While this notation may change between versions it should always cover the same features and information and therefore I believe this verbose notation is most useful for the examples of this documentation.
The implementation of morphology in Omorfi deals with inflection, derivation and compounding of wordforms. This creates a morphological analyzer, that can retrieve words to their dictionary forms, refered to as lemma, and their morphological analyses. This chapter describes implementation of the morphology in terms of morphological combinatorics and shows how the analyses are formatted by the default analyzer in Omorfi. While the main target of Omorfi source code is a morphological analyzers, it is used for various purposes, such as spell-checking and correction, hyphenation, morpho-syntactic disambiguation, machine translation, so the morphological analyses provided may vary from end application.
The morphological analyses, in the end, are encoded as linear strings. The format of these strings varies wildly depending on application, but omorfi source tree aims to cater all end applications by providing rules to allow different encoding representations of morphological analyses. By default omorfi has its own, rather elaborate tags (also refered to as omor tagging style), more relevant to machine parsing than human-readable. The tags are always written with capitals in form [NAME=VALUE]. If you intend to parse it with scripts you will only need to capture [.*] and split the contents by = to extract some kind of feature structure map. It is very likely for same name to exist multiple times because of compounding, so you will also need to decide how to handle this in your application. It is suggested to give the rightmost reading the most value, since compounding and derivation of Finnish always extend to the right, but in some applications of course this is not the ideal solution.
Other tagging representations are modeled upon specific applications, requirements or standards. Currently you may try some of the recoded analyzers emulating Constraint Grammar, Finnish text collection or apertium style. For interactive use, the colorterm variant can be used for most terse output, but it relies on color coding working on terminal. The Constraint Grammar style is string of form lemma+X+Y+Z, Finnish text collection style correspondingly lemma X Y Z, apertium style lemma<X><Y><Z>. Here's one example from passage parsed under both the omor format and then the Constraint Grammar emulating format:
lukemaan [BOUNDARY=LEXITEM][LEMMA='lukea'][POS=VERB][KTN=58][KAV=D]
[VOICE=ACT][INF=MA][NUM=SG][CASE=ILL][BOUNDARY=LEXITEM]
lukemaan [BOUNDARY=LEXITEM][LEMMA='lukea'][POS=VERB][KTN=58][KAV=D]
[VOICE=ACT][PCP=MA][CMP=POS][NUM=SG][CASE=ILL][BOUNDARY=LEXITEM]
ja [BOUNDARY=LEXITEM][LEMMA='ja'][POS=PARTICLE][BOUNDARY=LEXITEM]
ja [BOUNDARY=LEXITEM][LEMMA='ja'][POS=CONJUNCTION][BOUNDARY=LEXITEM]
selittämään [BOUNDARY=LEXITEM][LEMMA='selittää'][POS=VERB][KTN=53]
[KAV=C][VOICE=ACT][INF=MA][NUM=SG][CASE=ILL][BOUNDARY=LEXITEM]
selittämään [BOUNDARY=LEXITEM][LEMMA='selittää'][POS=VERB][KTN=53]
[KAV=C][VOICE=ACT][PCP=MA][CMP=POS][NUM=SG][CASE=ILL][BOUNDARY=LEXITEM]
sitä [BOUNDARY=LEXITEM][LEMMA='se'][POS=PRONOUN][NUM=SG]
[CASE=PAR][BOUNDARY=LEXITEM]
sitä [BOUNDARY=LEXITEM][LEMMA='sitä'][POS=PARTICLE][BOUNDARY=LEXITEM]
ennen [BOUNDARY=LEXITEM][LEMMA='ennen'][POS=ADVERB][BOUNDARY=LEXITEM]
ennen [BOUNDARY=LEXITEM][LEMMA='ennen'][POS=ADPOSITION]
[BOUNDARY=LEXITEM]
kaikkea [BOUNDARY=LEXITEM][LEMMA='kaikki'][POS=PRONOUN
][NUM=SG][CASE=PAR][BOUNDARY=LEXITEM]
kouluissa [BOUNDARY=LEXITEM][LEMMA='koulu'][POS=NOUN][KTN=1
][NUM=PL][CASE=INE][BOUNDARY=LEXITEM]
ja [BOUNDARY=LEXITEM][LEMMA='ja'][POS=PARTICLE][BOUNDARY=LEXITEM]
ja [BOUNDARY=LEXITEM][LEMMA='ja'][POS=CONJUNCTION][BOUNDARY=LEXITEM]
muissa [BOUNDARY=LEXITEM][LEMMA='muu'][POS=ADJECTIVE][KTN=18]
[CMP=POS][NUM=PL][CASE=INE][BOUNDARY=LEXITEM]
oppilaitoksissa [BOUNDARY=LEXITEM][LEMMA='oppilaitos'][POS=NOUN][KTN=39
][NUM=PL][CASE=INE][BOUNDARY=LEXITEM]
eri [BOUNDARY=LEXITEM][LEMMA='eri'][POS=PARTICLE][BOUNDARY=LEXITEM]
maiden [BOUNDARY=LEXITEM][LEMMA='maa'][POS=NOUN][KTN=18]
[NUM=PL][CASE=GEN][BOUNDARY=LEXITEM]
ja [BOUNDARY=LEXITEM][LEMMA='ja'][POS=PARTICLE][BOUNDARY=LEXITEM]
ja [BOUNDARY=LEXITEM][LEMMA='ja'][POS=CONJUNCTION][BOUNDARY=LEXITEM]
alueiden [BOUNDARY=LEXITEM][LEMMA='alue'][POS=NOUN][KTN=48]
[NUM=PL][CASE=GEN][BOUNDARY=LEXITEM]
poliittisista [BOUNDARY=LEXITEM][LEMMA='poliittinen'][POS=ADJECTIVE]
[KTN=38][CMP=POS][NUM=PL][CASE=ELA][BOUNDARY=LEXITEM]
oloista [BOUNDARY=LEXITEM][LEMMA='olo'][POS=NOUN][KTN=1]
[GUESS=DERIVE][DRV=INEN][CMP=POS][NUM=SG][CASE=PAR][BOUNDARY=LEXITEM]
oloista [BOUNDARY=LEXITEM][LEMMA='olo'][POS=NOUN][KTN=1
][NUM=PL][CASE=ELA][BOUNDARY=LEXITEM]
oloista [BOUNDARY=LEXITEM][LEMMA='oloinen'][POS=ADJECTIVE]
[KTN=38][CMP=POS][NUM=SG][CASE=PAR][BOUNDARY=LEXITEM]
CG format:
lukemaan lukea+V+Act+Inf3+Sg+Ill lukemaan lukea+V+Act+AgPcp+Pos+Sg+Ill ja ja+Part ja ja+Conj selittämään selittää+V+Act+Inf3+Sg+Ill selittämään selittää+V+Act+AgPcp+Pos+Sg+Ill sitä se+Pron+Sg+Par sitä sitä+Part ennen ennen+Adv ennen ennen+Adp kaikkea kaikki+Pron+Sg+Par kouluissa koulu+N+Pl+Ine ja ja+Part ja ja+Conj muissa muu+A+Pos+Pl+Ine oppilaitoksissa oppilaitos+N+Pl+Ine eri eri+Part maiden maa+N+Pl+Gen ja ja+Part ja ja+Conj alueiden alue+N+Pl+Gen poliittisista poliittinen+A+Pos+Pl+Ela oloista olo+N+Der/inen+Pos+Sg+Par oloista olo+N+Pl+Ela oloista oloinen+A+Pos+Sg+Par
The tags of omorfi analyses refer to the strings of form [NAME=VALUE]. For example in tag [POS=NOUN] the NAME is POS and the VALUE is NOUN. The below listing is short reference for the tags and their values. The definitions and examples are provided in the following chapters.
Boundary tags are of form [BOUNDARY=VALUE]. The ultimate boundaries of each lexical item are marked explicitly, using value LEXITEM. The word boundaries inside lexical items are also marked in the analysis, if analysis consists of more than one lemma, using value COMPOUND. The word boundaries of other multi-word expressions are marked by orthographical space only.
Some special symbols can delimit sentences or paragraphs, and have analysis field of boundary value SENTENCE and PARAGRAPH. This feature is experimental.
Thus all analyses of a lexical unit are formed as:
[BOUNDARY=LEXITEM]...[BOUNDARY]*...[BOUNDARY=LEXITEM]
Following examples demonstrate both types of boundaries:
talo [BOUNDARY=LEXITEM][LEMMA='talo'][POS=NOUN][KTN=1] [NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] kissakoira [BOUNDARY=LEXITEM][LEMMA='kissa'][POS=NOUN][KTN=9] [NUM=SG][CASE=NOM][BOUNDARY=COMPOUND][GUESS=COMPOUND] [LEMMA='koira'][POS=NOUN][KTN=10][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM]
| BOUNDARY | Meaning | Examples |
|---|---|---|
| LEXITEM | ultimate boundary of lexical item | kissa (cat) two boundaries |
| COMPOUND | word boundary of a generated compound | kissakoira (cat dog) one boundary |
Kissakoira is a made-up, but perfectly valid, compound that is unlikely to be in dictionary and therefore most likely formed by productive compounding. The applications that do not make use of productive compounding will not have these forms.
In the analyses used with omorfi, the lemma systematically refers to root form of word as it is presented in original lexical data source. For data of Nykysuomen sanalista this means the word form you can use to look it up from Kielitoimiston sanakirja, i.e. the official dictionary. The values are coded by default to tag with name LEMMA, and arbitrary value in single quotation marks:
[BOUNDARY][LEMMA='.*']...
This also means that when derivational or compounding processes create a new word form, the lemma will refer to ones that can be found from dictionary. End user prefering otherwise should look into adding the compounded or derived form to lexical data:
kissa [BOUNDARY=LEXITEM][LEMMA='kissa'][POS=NOUN][KTN=9] [NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] kissatar [BOUNDARY=LEXITEM][LEMMA='kissa'][POS=NOUN][KTN=9] [GUESS=DERIVE][DRV=TAR][POS=NOUN][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] kissakoira [BOUNDARY=LEXITEM][LEMMA='kissa'][POS=NOUN][KTN=9] [NUM=SG][CASE=NOM][BOUNDARY=COMPOUND][GUESS=COMPOUND] [LEMMA='koira'][POS=NOUN][KTN=10][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM]
| LEMMA | Meaning | Examples |
|---|---|---|
| 'kissa', 'koira', ... | the dictionary form of words analyzed | kissa (cat) one lemma, kissakoira (cat dog) two lemmas |
Warning
By default generated compounds have analyses of form [LEMMA][TAGS][LEMMA][TAGS]..., only lexicalised compounds have analyses of form [LEMMALEMMA][TAGS]. If you need traditional form of compound analysis you can either add your compounds to lexicon.
The parts of speech are indicated in field named POS. All word form analyses typically start with part of speech right after lemma data:
[BOUNDARY][LEMMA][POS]...
The morphological division of Finnish words has three classes: verbal, nominal and others. The verbs are identified by personal, temporal, modal and infinite inflection. The nominals are identified by numeral and case inflection. The others are, apart from being the rest, identified by defective or missing inflection.
The classes are further subdivided by syntactic features. The nominals consist of nouns (substantiivi), adjectives, pronouns and numerals. The others are subdivided into adpositions, adverbs and particles. Omorfi also maintains subdivision of particles into conjunctions, which is not present in the grammar, but so useful for language technology that it has been deemed necessary.
The POS values in omorfi are based on this finer, morphosyntactic classification.
The list below introduces the default example words for each of the parts of speech:
talo [BOUNDARY=LEXITEM][LEMMA='talo'][POS=NOUN][KTN=1] [NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] kutoa [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][INF=A][NUM=SG][CASE=LAT][BOUNDARY=LEXITEM] kaunis [BOUNDARY=LEXITEM][LEMMA='kaunis'][POS=ADJECTIVE] [KTN=41][CMP=POS][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] minä [BOUNDARY=LEXITEM][LEMMA='minä'][POS=PRONOUN] [SUBCAT=PERSONAL][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] yksi [BOUNDARY=LEXITEM][LEMMA='yksi'][POS=NUMERAL][KTN=31] [SUBCAT=CARD][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] nopeasti [BOUNDARY=LEXITEM][LEMMA='nopeasti'][POS=ADVERB] [BOUNDARY=LEXITEM] hei [BOUNDARY=LEXITEM][LEMMA='hei'][POS=INTERJECTION] [BOUNDARY=LEXITEM] no [BOUNDARY=LEXITEM][LEMMA='no'][POS=PARTICLE] [BOUNDARY=LEXITEM] että [BOUNDARY=LEXITEM][LEMMA='että'][POS=CONJUNCTION] [SUBCAT=SUBORD][BOUNDARY=LEXITEM]
| POS | Meaning | Example |
|---|---|---|
| NOUN | noun (Finnish substantiivi) | talo (house) |
| VERB | verb | kutoa (knit) |
| ADJECTIVE | adjective | kaunis (beautiful) |
| PRONOUN | pronoun | minä (I) |
| NUMERAL | numeral | yksi (one) |
| ADVERB | adverb | nopeasti (fast) |
| INTERJECTION | interjection | hei (hey!) |
| ADPOSITION | adposition | päällä (over) |
| PARTICLE | particle | no (well) |
| CONJUNCTION | conjunction | että (so that) |
Note
VISK: - definitions > S > sanaluokka - § 438 <http://scripta.kotus.fi/visk/sisallys.php?p=438> - § 63 onwards explains morphological features of parts of speech <http://scripta.kotus.fi/visk/sisallys.php?p=63>.
Nominal parts of speech have common nominal declination consisting 16 cases in singular and plural, combined with any possessive suffix, combined with any clitics. Total is some thousands of word forms per word. The nominal parts of speech include nouns, adjectives, numerals and pronouns. The nominalised forms of verbs will also include nominal declination. The format of noun analysis string is:
...[NUM][CASE][POSS]?[CLIT]*[BOUNDARY]
Examples of nouns in tables of this section are given with forms of word valo (light), which does not have any stem variation in inflection. Here's a few examples of valo's inflectional pattern:
valolle [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=ALL][BOUNDARY=LEXITEM] valolleni [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=ALL][POSS=SG1][BOUNDARY=LEXITEM] valoillenikokaan [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1] [NUM=PL][CASE=ALL][POSS=SG1][CLIT=KO][CLIT=KAAN][BOUNDARY=LEXITEM]
Nominals inflect in number, to mark plurality of the word. NUM for nouns is either singular or plural, or in some cases underspecified. Numeral ending comes first after word stem, but is often more or less combined with case ending, and usually causes stem variation:
valo [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=NOM][BOUNDARY=LEXITEM] valot [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=PL] [CASE=NOM][BOUNDARY=LEXITEM]
| NUM | Meaning | Example |
|---|---|---|
| SG | Singular | valo (light) |
| PL | Plural | valot (lights) |
Note
VISK § 79–80 <http://scripta.kotus.fi/visk/sisallys.php?p=79>
CASE for nominals has 16 possible values, the cases of Finnish nominals mark syntactic roles (nominative, partitive, accusative-genitive) and semantics (others, partially even syntactic cases). The syntactic designation or semantic gloss is given in the meaning column, the traslations in example column are approximate since there's no 1:1 correspondence between semantic cases of Finnish and prepositions of English.
While many of cases have only one distinct ending, some combinations of plurality and case endings can exhibit up to 6 distinct case markers:
valo [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=NOM][BOUNDARY=LEXITEM] valoa [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=PAR][BOUNDARY=LEXITEM] valon [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=GEN][BOUNDARY=LEXITEM] valossa [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=INE][BOUNDARY=LEXITEM] valosta [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=ELA][BOUNDARY=LEXITEM] valoon [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=ILL][BOUNDARY=LEXITEM] valolla [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=ADE][BOUNDARY=LEXITEM] valolta [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=ABL][BOUNDARY=LEXITEM] valolle [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=ALL][BOUNDARY=LEXITEM] valona [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=ESS][BOUNDARY=LEXITEM] valoksi [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=TRA][BOUNDARY=LEXITEM] valotta [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1][NUM=SG] [CASE=ABE][BOUNDARY=LEXITEM] valoine [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=9] [CASE=CMT][BOUNDARY=LEXITEM] valoin [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=9] [NUM=PL][CASE=INS][BOUNDARY=LEXITEM]
| CASE | Meaning | Example |
|---|---|---|
| NOM | Nominative (subject) | valo (light) |
| PAR | Partitive (partial object) | valoa (some light) |
| GEN | Genitive (attribute/possessive) | valon (light's) |
| INE | Inessive (in inside) | valossa (in light) |
| ELA | Elative (away from inside) | valosta (from (inside of) light) |
| ILL | Illative (into inside) | valoon (to light) |
| ADE | Adessive (on surface/vicinity) | valolla (on/nearby light) |
| ABL | Ablative (from surface/vicinity) | valolta (from (nearby of) light) |
| ALL | Allative (on to surface/vicinity) | valolle (towards the light) |
| ESS | Essive (as) | valona (as light) |
| TRA | Translative (become as) | valoksi (into light) |
| ABE | Abessive (without) | valotta (without light) |
| CMT | Comitative (with/in company of) | valoine (with lights) |
| INS | Instructive (with/by using) | valoin (using lights) |
Note
VISK 81–94 <http://scripta.kotus.fi/visk/sisallys.php?p=81>
Posessive ending indicates ownership and can attaches always after a case ending. POSS can take six possible values from singular and plural, first, second and third person references, where third person form is always ambiguous over plurality. The third person form also has two allomorphs, latter of which typically only exists after long vowels. Here are the example readings of word light:
valoni [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1] [NUM=SG][CASE=NOM][POSS=SG1][BOUNDARY=LEXITEM] valosi [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1] [NUM=SG][CASE=NOM][POSS=SG2][BOUNDARY=LEXITEM] valonsa [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1] [NUM=SG][CASE=NOM][POSS=SG3][BOUNDARY=LEXITEM] valonsa [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1] [NUM=SG][CASE=NOM][POSS=PL3][BOUNDARY=LEXITEM] valomme [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1] [NUM=SG][CASE=NOM][POSS=PL1][BOUNDARY=LEXITEM] valonne [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1] [NUM=SG][CASE=NOM][POSS=PL2][BOUNDARY=LEXITEM]
| POSS | Meaning | Example |
|---|---|---|
| SG1 | First person singular | valoni (my light) |
| SG2 | Second pers. singular | valosi (your light) |
| SG3, PL3 | third person singular or plural | valonsa (his/her/their light) |
| PL1 | First person plural | valomme (our light) |
| PL2 | Second pers. plural | valonne (your light) |
Note
VISK § 95–97 <http://scripta.kotus.fi/visk/sisallys.php?p=95>
Nouns have currently only one subcategory of proper nouns, or names. Proper nouns are usually written with initial capitals–or more recently, totally arbitrary capitalisations, such as in brand names nVidia and ATi. Proper nouns do have full inflectional morphology exactly as other nouns, but work slightly differently in derivation and compounding. Some capitalised nouns may also lose capitalisation in derivation. Here are examples of semantic sub classes of proper nouns:
Pekka [BOUNDARY=LEXITEM][LEMMA='Pekka'][POS=NOUN] [SUBCAT=PROPER][KTN=9][KAV=A][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] Virtanen [BOUNDARY=LEXITEM][LEMMA='Virtanen'][POS=NOUN] [SUBCAT=PROPER][KTN=38][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] Helsinki [BOUNDARY=LEXITEM][LEMMA='Helsinki'][POS=NOUN] [SUBCAT=PROPER][KTN=5][KAV=G][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM]
| SUBCAT | Meaning | Examples |
|---|---|---|
| PROPER | proper noun | Pekka (personal name), Virtanen (surname), Helsinki (geographical name) |
Note
VISK § 98 <http://scripta.kotus.fi/visk/sisallys.php?p=98>
Certain nominal cases have multiple surface forms, which some applications need to tell apart. For these cases the omor tagset provides ALLO tag. The value of ALLO is the morphophonemic representation of the morpheme, written in caps, such as A for partitive ending a or ä.
Adjectives are effectively inflected as nouns, with additional level of comparison forms before regular nominal inflection. Adjectives are also very unlikely to have possessive suffixes. The adjectives
[POS=ADJECTIVE][KTN][KAV]?[CMP][NUM][CASE][POSS]?[CLIT]?
The examples in this section are given with nopea (fast). Here's an example of how comparisons forms derive to nominal inflection:
nopea [BOUNDARY=LEXITEM][LEMMA='nopea'][POS=ADJECTIVE][KTN=15] [CMP=POS][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] nopeampi [BOUNDARY=LEXITEM][LEMMA='nopea'][POS=ADJECTIVE] [KTN=15][CMP=CMP][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] nopein [BOUNDARY=LEXITEM][LEMMA='nopea'][POS=ADJECTIVE] [KTN=15][CMP=SUP][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM]
Note
VISK §
Comparison has three levels marked by CMP tag. In modern grammar comparison is under derivation instead of regular inflection, which also makes sense for Omorfi, since each form of comparison has full set of nominal inflection. The comparative suffixes precede the nominal inflection.
| CMP | Meaning | Example |
|---|---|---|
| POS | Positive | nopea |
| CMP | Comparative | nopeampi |
| SUP | Superlative | nopein |
Note
VISK § 300 <http://scripta.kotus.fi/visk/sisallys.php?p=300>
Numerals do not have any specific inflection besides noun's. The numerals, however, do have special compounding restrictions and patterns. They are also one of the typical part of speech in systems, so it is included here as separate class. The analysis of numeral compounds is detailed in the compounding section, but otherwise numerals follow the basic nominal pattern. It may also be noteworthy that this means full nominal inflection; Finnish numerals have singular and plural forms. The analysis strings are as with nouns:
[POS=NUMERAL][KTN][KAV]?[NUM][SUBCAT][CASE][POSS]?[CLIT]*
The numerals are of course infinite, closed class of words. The implementation of Omorfi aims to recognise all of the numeral words and their compounds using systemic names for very large numerals. The systemic names are comprised of the greek prefix x and suffix part for xillions and xilliards (i.e. like long scale English numerals). So the scale goes from miljoona (10^6, million), miljardi (10^9, milliard), biljoona, biljardi, triljoona, and so on for prefixes kvadri-, kvinti-, septi-, ..., until sentiljoona (10^303). Here are few examples:
yksi [BOUNDARY=LEXITEM][LEMMA='yksi'][POS=NUMERAL][KTN=31] [SUBCAT=CARD][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] kaksitoista [BOUNDARY=LEXITEM][LEMMA='kaksi'][POS=NUMERAL] [KTN=31][SUBCAT=CARD][NUM=SG][CASE=NOM] [BOUNDARY=COMPOUND][GUESS=COMPOUND][LEMMA='toista'] [POS=NUMERAL][BOUNDARY=LEXITEM] satakaksikymmentäkolmemiljoonaaneljäsataaviisikymmentä- kuusituhattaseitsemänsataakahdeksankymmentäyhdeksän [BOUNDARY=LEXITEM][LEMMA='sata'][POS=NUMERAL][KTN=9] [KAV=F][SUBCAT=CARD][NUM=SG][CASE=NOM] [BOUNDARY=COMPOUND][GUESS=COMPOUND][LEMMA='kaksi'] [POS=NUMERAL][KTN=31][SUBCAT=CARD][NUM=SG][CASE=NOM] [BOUNDARY=COMPOUND][GUESS=COMPOUND][LEMMA='kymmenen'] [POS=NUMERAL][KTN=32][SUBCAT=CARD][NUM=SG][CASE=PAR] [BOUNDARY=COMPOUND][GUESS=COMPOUND][LEMMA='kolme'] [POS=NUMERAL][KTN=7][SUBCAT=CARD][NUM=SG][CASE=NOM] [BOUNDARY=COMPOUND][GUESS=COMPOUND][LEMMA='miljoona'] [POS=NUMERAL][KTN=10][SUBCAT=CARD][NUM=SG][CASE=PAR] [BOUNDARY=COMPOUND][GUESS=COMPOUND][LEMMA='neljä'] [POS=NUMERAL][KTN=10][SUBCAT=CARD][NUM=SG][CASE=NOM] [BOUNDARY=COMPOUND][GUESS=COMPOUND][LEMMA='sata' ][POS=NUMERAL][KTN=9][KAV=F][SUBCAT=CARD][NUM=SG] [CASE=PAR][BOUNDARY=COMPOUND][GUESS=COMPOUND] [LEMMA='viisi'][POS=NUMERAL][KTN=27][SUBCAT=CARD] [NUM=SG][CASE=NOM][BOUNDARY=COMPOUND][GUESS=COMPOUND] [LEMMA='kymmenen'][POS=NUMERAL][KTN=32][SUBCAT=CARD] [NUM=SG][CASE=PAR][BOUNDARY=COMPOUND][GUESS=COMPOUND] [LEMMA='kuusi'][POS=NUMERAL][KTN=27][SUBCAT=CARD] [NUM=SG][CASE=NOM][BOUNDARY=COMPOUND][GUESS=COMPOUND] [LEMMA='tuhat'][POS=NUMERAL][KTN=46][SUBCAT=CARD] [NUM=SG][CASE=PAR][BOUNDARY=COMPOUND][GUESS=COMPOUND] [LEMMA='seitsemän'][POS=NUMERAL][KTN=10][SUBCAT=CARD] [NUM=SG][CASE=NOM][BOUNDARY=COMPOUND][GUESS=COMPOUND] [LEMMA='sata'][POS=NUMERAL][KTN=9][KAV=F][SUBCAT=CARD] [NUM=SG][CASE=PAR][BOUNDARY=COMPOUND][GUESS=COMPOUND] [LEMMA='kahdeksan'][POS=NUMERAL][KTN=10][SUBCAT=CARD] [NUM=SG][CASE=NOM][BOUNDARY=COMPOUND][GUESS=COMPOUND] [LEMMA='kymmenen'][POS=NUMERAL][KTN=32][SUBCAT=CARD] [NUM=SG][CASE=PAR][BOUNDARY=COMPOUND][GUESS=COMPOUND] [LEMMA='yhdeksän'][POS=NUMERAL][KTN=10][SUBCAT=CARD] [NUM=SG][CASE=NOM][BOUNDARY=LEXITEM]
Note
VISK § 99 <http://scripta.kotus.fi/visk/sisallys.php?p=99>
Numerals have functional subcategories for semantics, which have been used in most of the other systems and retained here as well. The distinction is made between cardinal and ordinal numbers, and is purely semantic:
kolme [BOUNDARY=LEXITEM][LEMMA='kolme'][POS=NUMERAL][KTN=7] [SUBCAT=CARD][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] neljäs [BOUNDARY=LEXITEM][LEMMA='neljäs'][POS=NUMERAL] [KTN=45][SUBCAT=ORD][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM]
| SUBCAT | Meaning | Example |
|---|---|---|
| CARD | cardinal | kolme (three) |
| ORD | ordinal | neljäs (fourth) |
For some numerals there are special derived forms with approximative meaning. These forms are not often fully inflected or inflected at all, and do not participate in compounding:
kuutisen toistasataa
| SUBCAT | Meaning | Example |
|---|---|---|
| APPROX | approximal | kuutisen (about six), toistasataa (100–200) |
Note
VISK §
Pronouns inflect mostly like nouns, but have their own POS. Pronouns are also only nouns to have explicit phonemically distinct accusative markers. Many of pronouns have defective pattern, e.g. only singulars or plurals, or heteroclitical paradigms. Pronoun analyses are of same form as other nominals:
[POS=PRONOUN][KTN][KAV]?[NUM][CASE][POSS]?[CLIT]*
Note
VISK § 100 <http://scripta.kotus.fi/visk/sisallys.php?p=100>
Some of the pronouns have accusative as separate case:
minut [BOUNDARY=LEXITEM][LEMMA='minä'][POS=PRONOUN] [SUBCAT=PERSONAL][NUM=SG][CASE=ACC][BOUNDARY=LEXITEM]
| CASE | Meaning | Examples |
|---|---|---|
| ACC | Accusative (object) | minut (me) |
Note
VISK §
Pronouns are divided into semantic classes by use. The classification is fully copied from the modern grammar:
minä [BOUNDARY=LEXITEM][LEMMA='minä'][POS=PRONOUN] [SUBCAT=PERSONAL][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] tämä [BOUNDARY=LEXITEM][LEMMA='tämä'][POS=PRONOUN] [SUBCAT=DEMONSTR][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] kuka [BOUNDARY=LEXITEM][LEMMA='kuka'][POS=PRONOUN] [SUBCAT=INTERROG][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] joka [BOUNDARY=LEXITEM][LEMMA='joka'][POS=PRONOUN] [SUBCAT=RELATIVE][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] kukaan [BOUNDARY=LEXITEM][LEMMA='kukaan'][POS=PRONOUN] [SUBCAT=QUANTOR][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] itse [BOUNDARY=LEXITEM][LEMMA='itse'][POS=PRONOUN] [SUBCAT=REFLEX][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] toinen [BOUNDARY=LEXITEM][LEMMA='toinen'][POS=PRONOUN] [SUBCAT=RECIPROC][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM]
| SUBCAT | Meaning | Examples |
|---|---|---|
| PERSONAL | Personal | minä (me) |
| DEMONSTR | Demonstrative | tämä (this) |
| INTERROG | Interrogative | kuka (who?) |
| RELATIVE | Relative | joka (who) |
| QUANTOR | Quantor | kukaan (no one) |
| REFLEX | Reflexive | itse (self) |
| RECIPROC | Reciprocal | toinen (each other) |
Note
VISK § 101–104 <http://scripta.kotus.fi/visk/sisallys.php?p=101>
Ad words are typically derived or inflected word forms with lexicalised meanings and defective inflection patterns; habitive adverbs (e.g. mainly sti derivation, but not all) have comparation and clitics, locative adverbs have partial locative cases, possessives and clitics, temporal adverbs have only clitics. Prolatives and similar (e.g. yli ~ ylitse) may only have clitics as well. Lots of inflected forms of adverbs is further lexicalised into more adverbs (i.e. all forms of one adverb have dictionary entries). Intensifying adverbs might not assume clitics at all. The analysis strings of adverbs therefore vary on case-by-case basis. Mostly they fall under simple form of:
[POS=ADVERB][CASE]?[POSS]?[CLIT]? [POS=ADPOSITION][CASE]?[POSS]?[CLIT]?
Note
VISK § 678 (discriminating adverb from adposition) <http://scripta.kotus.fi/visk/sisallys.php?p=678>
As noted earlied, many of adverbs are nominals with current or archaic case endings, and the endings may be marked in omorfi as long as they are clear. Also the sti derivation of adjectives is productive in class of manner adverbs. The certain types of adverbs that are mostly productively derived may be available in Omorfi:
nopeasti [BOUNDARY=LEXITEM][LEMMA='nopea'][POS=ADJECTIVE] [KTN=15][CMP=POS][GUESS=DERIVE][DRV=STI][POS=ADVERB][BOUNDARY=LEXITEM] meritse [BOUNDARY=LEXITEM][LEMMA='meri'][POS=NOUN][KTN=24] [GUESS=DERIVE][NUM=PL][DRV=TSE][POS=ADVERB][CASE=PRL][BOUNDARY=LEXITEM] taloittain [BOUNDARY=LEXITEM][LEMMA='talo'][POS=NOUN][KTN=1] [GUESS=DERIVE][NUM=PL][DRV=TTAIN][POS=ADVERB][CASE=DIS][BOUNDARY=LEXITEM]
| CASE | Meaning | Example |
|---|---|---|
| PRL | prolative | meritse (by sea) |
| DIS | distributive | taloittain (house by house |
Adpositions are, like adverbs, current or archaic inflectional forms of regular nominals. The adpositions are further sub-categorised along their syntactic behaviour, to prepositions and postposition. The prepositions appear in front of the adpositional phrase and postpositions in back. Many of the adpositions can appear in both.
Acronyms in omorfi are those shortened nominals, which have inflection. The inflection of these acronyms is formed by adding colon to the acronym, and adding most of the inflectional endings after the colon. The acronyms may be inflected in three ways. The inflectional endings after colon may show either the inflection of last letter of the acronym, or the last word of the acronym. The latter form of inflection is only implemented if the lexical source contains information of the last word of the acronym. For example STT short for Suomen tietotoimisto (Finland's information office) is inflected as STT:hen in illative since letter tee (T) is teehen in illative form, but also STT:oon is valid illative, since -toimisto is -toimistoon in illative form (the additional o there is an orthographic convention). For example:
STT [BOUNDARY=LEXITEM][LEMMA='STT'][POS=ACRONYM][NUM=SG] [CASE=NOM][BOUNDARY=LEXITEM] STT:hen [BOUNDARY=LEXITEM][LEMMA='STT'][POS=ACRONYM] [NUM=SG][CASE=ILL][BOUNDARY=LEXITEM]
The acronyms that form phonotactically valid words may often be inflected as regular nouns. Since their inflection pattern follows the regular nouns inflection pattern---e.g. KELA (Kansaneläkelaitos, the social security office) is inflected like noun kela ()---they should be treated as regular nouns in all parts of morphology. Some of these words lose their acronym interpretation and become regular nouns written in lowercase, such as laser. The lowercase variants are also allowed for other words:
AIDSilla [BOUNDARY=LEXITEM][LEMMA='AIDS'][POS=NOUN][KTN=5] [NUM=SG][CASE=ADE][BOUNDARY=LEXITEM]
The non-inflecting abbreviations are described in their own section.
Verb's conjugation includes voice (in Finnish grammars also verbal genus), tense (tempus), moods (modus), personal endings or negation marker and clitics. The analysis strings of verb inflection is not as systematic as nouns, as most categories collapse together in forms, for example voice distinction does not exist in all moods and tenses, and tense distinction only exists in one mood. Instead of underdefining analyses, many times taggings are omitted so verb analysis strings vary. Part of verbs regular derivation is typically included in the inflection, as has been done in traditional grammars. These infinite forms have nominal declination. Analysis string for finite verb forms is:
[POS=VERB][KTN][KAV]?[VOICE][MOOD][TENSE]?[PRS]?[NEG]?[CLIT]?
The infinite forms of verbs may have voice included. The infinite forms are split into infinitives, participles and derivations. The analysis string after these markers are same as for all nominals:
[POS=VERB][KTN][KAV]?[VOICE][INF][NUM]?[CASE]?
For participles the part after [VOICE] is the same as nominal declination. For infinitives, only some of the CASE values may appear, and full listing of those cases can be found below.
Note
VISK § 105 <http://scripta.kotus.fi/visk/sisallys.php?p=105>
Verbs have only one special subcategory for negation verb ei, which has partial inflection:
[BOUNDARY=LEXITEM][LEMMA='ei'][POS=VERB][SUBCAT=NEG] [VOICE=ACT][PRS=SG1][BOUNDARY=LEXITEM]
| SUBCAT | Meaning | Example |
|---|---|---|
| NEG | negation verb | en (I don't) |
Note
Marking negation verb as specific sub-category of verbs and the verb form that only goes along with it conneg has some history in fennistics, but I do not know the origin of the practice and it isn't in VISK. In fact this practice was added for interoperability with Sámi language morphologies, which follow the same tagging.
The finite inflection of verbs concerns actual verbal inflection in person, mood, tense.
Personal ending of verb defines the actors. PRS has seven possible values, six for the singular and plural groups of first, second and third person forms, and one specifically for passive. The passive personal form is encoded as fourth person passive, which had been the common practice in past systems and is accurate naming:
kudon [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][MOOD=INDV][TENSE=PRES][PRS=SG1][BOUNDARY=LEXITEM] kudot [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][MOOD=INDV][TENSE=PRES][PRS=SG2][BOUNDARY=LEXITEM] kutoo [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][MOOD=INDV][TENSE=PRES][PRS=SG3][BOUNDARY=LEXITEM] kudomme [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][MOOD=INDV][TENSE=PRES][PRS=PL1][BOUNDARY=LEXITEM] kudotte [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][MOOD=INDV][TENSE=PRES][PRS=PL2][BOUNDARY=LEXITEM] kutovat [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52][ KAV=F][VOICE=ACT][MOOD=INDV][TENSE=PRES][PRS=PL3][BOUNDARY=LEXITEM] kudotaan [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=PSS][MOOD=INDV][TENSE=PRES][PRS=PE4][BOUNDARY=LEXITEM]
| PRS | Meaning | Example |
|---|---|---|
| SG1 | First pers. singular | kudon (I knit) |
| SG2 | 2nd person singular | kudot (you knit) |
| SG3 | Third pers. singular | kutoo (he/she/it knits) |
| PL1 | First pers. plural | kudomme (we knit) |
| PL2 | 2nd person plural | kudotte (you knit) |
| PL3 | Third pers. plural | kutovat (they knit) |
| PE4 | Passive 4th person | kudotaan (knitting is being done) |
Note
VISK § 106–107 <http://scripta.kotus.fi/visk/sisallys.php?p=106>
Verbs have specific forms going together with negation verb (which has partial inflection itself). This form is marked with a NEG tag with value CON. The existence of negated form varies between moods, voices and tenses:
kudo [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][MOOD=INDV][TENSE=PRES][NEG=CON][BOUNDARY=LEXITEM] kudota [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=PSS][MOOD=INDV][TENSE=PRES][PRS=PE4][NEG=CON] [BOUNDARY=LEXITEM]
| NEG | Meaning | Example |
|---|---|---|
| CON | Negated form | (en) kudo (I don't knit), (ei) kudota (no knitting) |
Note
VISK § 109 <http://scripta.kotus.fi/visk/sisallys.php?p=109>
Verb inflection has two categories for active and passive voice, marked in tag named VOICE. For finite verb forms active voice is tied to personal forms and passive voice to non-personal verb endings. The voice is also marked in some of the infinite verb forms:
kudon [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][MOOD=INDV][TENSE=PRES][PRS=SG1][BOUNDARY=LEXITEM] kudotaan [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=PSS][MOOD=INDV][TENSE=PRES][PRS=PE4][BOUNDARY=LEXITEM]
| VOICE | Meaning | Example |
|---|---|---|
| ACT | active | kudon (I knit) |
| PSS | passive | kudotaan (knitting) |
Note
ISK § 110 <http://scripta.kotus.fi/visk/sisallys.php?p=110>, of passive
Verbs may inflect to mark up tense. TENSE has two values. For moods other than indicative the tense is not distinctive in surface form, and therefore not marked in the analyses. The morphologically distinct forms in Finnish form only distinctions between past and non-past tenses, which should be noted since some historical systems have talked about imperfect and present:
kudon [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][MOOD=INDV][TENSE=PRES][PRS=SG1][BOUNDARY=LEXITEM] kudoin [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][MOOD=INDV][TENSE=PAST][PRS=SG1][BOUNDARY=LEXITEM]
| Symbol | Tense | Example |
|---|---|---|
| PRES | non-past | kudon (I knit) |
| PAST | past | kudoin (I knitted) |
Note
VISK § 112 <http://scripta.kotus.fi/visk/sisallys.php?p=112>, § 111 for tenses and moods collectively
Finite verb forms inflect to mark up moods. Mood is systematically included in analysis strings, even with unmarked indicative. Only indicative mood includes full set of temporal and personal inflection, others have limited inflection in current use. Some forms may also be covered by theoretical or archaic word forms, which are included in some versions of Omorfi. MOOD has four possible values:
kudon [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][MOOD=INDV][TENSE=PRES][PRS=SG1][BOUNDARY=LEXITEM] kudo [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][MOOD=IMPV][PRS=SG2][BOUNDARY=LEXITEM] kutoisin [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][MOOD=COND][PRS=SG1][BOUNDARY=LEXITEM] kutonen [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][MOOD=POTN][PRS=SG1][BOUNDARY=LEXITEM]
| VALUE | Meaning | Example |
|---|---|---|
| INDV | indicative | kudon (I knit) |
| IMPV | imperative | kudo (do knit!) |
| COND | conditional | kutoisin (I would knit) |
| POTN | potential | kutonen (I might knit) |
Note
VISK § 115–118 <http://scripta.kotus.fi/visk/sisallys.php?p=115>, § 111 for tenses and moods collectively
Infinite verb forms are in principle nominal derivations from verb, included in morphology as inflection by long linguistic tradition. Especially notable is that verb form A infinitive with lative case marking is still considered the dictionary form of the verb.
INF has 4 possible values. Also one fully productive derivational form used to be marked infinitive in old grammars. In traditional grammars the infinitive forms were called I, II, III, IV and V infinitive, the modern grammar replaces the first three with A, E and MA respectively. The IV infinitive, which has minen suffix marker, has been reanalysed as derivational and this is reflected in Omorfi. The V infinitive is also assumed to be mainly derivational, but included here for reference.
The short form of A infinitive is in lative case which is extinct from nominal conjugation. The long form of A infinitive is translative, and it requires possessive suffix. For E infinitive, the possible cases are inessive and instructive, the possessive suffix is optional for both, but rare for instructive form. For MA infinitive the possible cases are abessive, adessive, elative, illative, inessive and instructive, the possessive ending is very rare since it usually indicates agent participle instead. The mAisillA derivation is theoretically already in adessive case (of mA infinitive's inen derivation, but this re-analysis is not performed in omorfi) and therefore has no case inflection, the possessive endings are optional but common. The minen derivation creates a noun root form, and has standard nominal inflection:
kutoa [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][INF=A][NUM=SG][CASE=LAT][BOUNDARY=LEXITEM] kutoen [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][INF=E][NUM=SG][CASE=INS][BOUNDARY=LEXITEM] kutomatta [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][VOICE=ACT][INF=MA][NUM=SG][CASE=ABE][BOUNDARY=LEXITEM] kutominen [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][DRV=MINEN][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] kutomaisillani [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB] [KTN=52][KAV=F][VOICE=ACT][DRV=MAISILLA][POSS=SG1][BOUNDARY=LEXITEM]
| INF | Meaning | Examples |
|---|---|---|
| A | A infinitive | kutoa (to knit) |
| E | E infinitive | kutoen (by knitting) |
| MA | Ma infinitive | kutomatta (without knitting) |
| DRV=MINEN | IV infinitive | kutominen (knitting n.) |
| DRV=MAISILLA | V infinitive | kutomaisillani (I am about to knit) |
Note
VISK § 120–121 <http://scripta.kotus.fi/visk/sisallys.php&p=120>, § 119 for infinite forms collectively
There are 4 participle forms. Like infinitives, participles in traditional grammars were named I and II where NUT and VA are used in modern grammars. The agent and negation participle have sometimes been considered outside regular inflection, but in modern Finnish grammars are alongside other participles and so they are included in inflection in omorfi as well. In some grammars the NUT and VA participles have been called past and present participles respectively, drawing parallels from other languages, but these names are more misleading and should usually be avoided. The participles work as mostly as adjective or nominal derivations, and may include full nominal inflection:
kutonut [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB] [KTN=52][KAV=F][VOICE=ACT][PCP=NUT][CMP=POS] [NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] kutova [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB] [KTN=52][KAV=F][VOICE=ACT][PCP=VA][CMP=POS][NUM=SG] [CASE=NOM][BOUNDARY=LEXITEM] kutoma [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB] [KTN=52][KAV=F][VOICE=ACT][PCP=MA][CMP=POS][NUM=SG] [CASE=NOM][BOUNDARY=LEXITEM] kutomaton [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB] [KTN=52][KAV=F][VOICE=ACT][PCP=NEG][CMP=POS] [NUM=SG][CASE=NOM][BOUNDARY=LEXITEM]
| PCP | Meaning | Example |
|---|---|---|
| NUT | Nut participle | kutonut (been knitted) |
| VA | Va participle | kutova (to be knitted) |
| MA | Agent participle | kutomani (which I knitted) |
| NEG | Negated participle | kutomaton (unknitted) |
Warning
Be aware that some traditional commercial software for Finnish morphology mistakenly analyse agent participles as MA infinitives which result in different taggings in some reference corpora you may see. To distinguish agent participle from MA infinitive, apart from semantics, agent participle almost always requires possessive suffix, and only rarely specifies agent via syntactic means. Also, participles allow all cases whereas set of cases used with infinitives are limited.
Note
VISK § 122 <http://scripta.kotus.fi/visk/sisallys.php?p=122>, § 119 for infinite forms collectively
Clitics are suffixes which can attach almost anywhere in the ends of words, both verb forms and nominals. They also attach on end of other clitics, froming theoretically infinite chains. In practice it is usual to see at most three in one word form. Two clitics have limited use: -s only appears in few verb forms and combined to other clitics and -kA only appears with few adverbs and negation verb. Their meaning also largely varies largely on context and even intonation, and the glosses below are therefore very vaguely relevant:
valohan [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1] [NUM=SG][CASE=NOM][CLIT=HAN][BOUNDARY=LEXITEM] valokaan [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1] [NUM=SG][CASE=NOM][CLIT=KAAN][BOUNDARY=LEXITEM] valokin [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1] [NUM=SG][CASE=NOM][CLIT=KIN][BOUNDARY=LEXITEM] valoko [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1] [NUM=SG][CASE=NOM][CLIT=KO][BOUNDARY=LEXITEM] valopa [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN][KTN=1] [NUM=SG][CASE=NOM][CLIT=PA][BOUNDARY=LEXITEM] tules [BOUNDARY=LEXITEM][LEMMA='tulla'][POS=VERB][KTN=67] [VOICE=ACT][MOOD=IMPV][PRS=SG2][CLIT=S][BOUNDARY=LEXITEM] eikä [BOUNDARY=LEXITEM][LEMMA='ei'][POS=VERB][SUBCAT=NEG] [VOICE=ACT][PRS=SG3][CLIT=KA][BOUNDARY=LEXITEM]
| CLIT | Meaning | Example |
|---|---|---|
| HAN | -hAn (even, also) | valohan (even light) |
| KAAN | -kAAn (not even) | valokaan (not even light) |
| KIN | -kin (also, as well) | valokin (also light) |
| KO | -kO (question) | valoko (light?) |
| PA | -pA (indeed, esp.) | valopa (light indeed) |
| S | -s (moderate) | tules (do come) |
| KA | -kA (negation) | eikä (nor) |
Note
VISK § 126– <http://scripta.kotus.fi/visk/sisallys.php?p=126>, § 131 on combinatorics,
Many numerals are written in digits or other codified expressions. Even digit sequences inflect and participate in compounding in Finnish.
| SUBCAT | Meaning | Example |
|---|---|---|
| DIGIT | numeral written in digits | 3,141 (3.141), XIV:ttä (of 14th) |
There are several parts of speech in omorfi that do not have any inflection and do not participate in derivation or compounding. The official grammar uses name particle for all of the non-inflecting words, here the syntactic and semantic division for conjunctions, interjections and the rest (named as particles here and in old grammars) has been retained.
Note
VISK § 792 <http://scripta.kotus.fi/visk/sisallys.php?p=792>
Conjunctions are non-inflecting words that join syntactic structures together. The conjunstions have two subcategories according the type of syntactic relation they make. The analysis string of conjunction is:
[POS=CONJUNCTION][SUBCAT]
Note
VISK § 812 <http://scripta.kotus.fi/visk/sisallys.php?p=812>
The conjunctions are divided into two classes depending on whether they act as subordinating or co-ordinating their respective syntactic units, this is marked by SUBCAT values SUBORD and COORD:
kun [BOUNDARY=LEXITEM][LEMMA='kun'][POS=CONJUNCTION] [SUBCAT=SUBORD][BOUNDARY=LEXITEM] ja [BOUNDARY=LEXITEM][LEMMA='ja'][POS=CONJUNCTION] [SUBCAT=COORD][BOUNDARY=LEXITEM]
| SUBCAT | Meaning | Examples |
|---|---|---|
| SUBORD | Subordinating | kun (when) |
| COORD | Co-ordinating | ja (and) |
Note
VISK § 816 <http://scripta.kotus.fi/visk/sisallys.php?p=816> (the classification differs, SUBORD is for unifying with other systems)
Interjections are usually characterisations of speech acts, and may often consist of more or less arbitrary series of characters, sometimes onomatopoetic. Also minimal turns in dialogue, mumbling, swearing, and so on are interjections. They always have analysis string:
[POS=INTERJECTION]
Note
VISK § 856 <http://scripta.kotus.fi/visk/sisallys.php?p=856>
Abbreviations are shortened word forms that do not inflect. Most of the abbreviations are written with lowercase letters and end in full stop. Some of the old abbreviations use colon as marker of omission inside the word. The analysis string must be:
[POS=ABBREVIATION]
Particles are leftover part of speech for non-inflected words that didn't find their way elsewhere. The analysis string is always:
[POS=PARTICLE]
Derivation forming is experimental feature and not present in all versions and applications using omorfi. The derived forms should be considered guesses at best. The form of derived analysis strings vary depending on root word, but typical form is:
[POS][INFLECTIONS...][GUESS][DRV][POS]...
The first POS is POS of dictionary word, the second is POS of derived form. Currently formed are following DRV values:
nopeasti [BOUNDARY=LEXITEM][LEMMA='nopea'][POS=ADJECTIVE] [KTN=15][CMP=POS][GUESS=DERIVE][DRV=STI][POS=ADVERB][BOUNDARY=LEXITEM] kutoja [BOUNDARY=LEXITEM][LEMMA='kutoa'][POS=VERB][KTN=52] [KAV=F][GUESS=DERIVE][DRV=JA][POS=NOUN] [NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] valoinen [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN] [KTN=1][GUESS=DERIVE][DRV=INEN][POS=ADJECTIVE] [CMP=POS][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] valotar [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN] [KTN=1][GUESS=DERIVE][DRV=TAR][POS=NOUN] [NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] valollinen [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN] [KTN=1][GUESS=DERIVE][DRV=LLINEN][POS=NOUN] [NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] valoton [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN] [KTN=1][GUESS=DERIVE][DRV=TON][POS=NOUN] [NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] valoitse [BOUNDARY=LEXITEM][LEMMA='valo'][POS=NOUN] [KTN=1][GUESS=DERIVE][NUM=PL][DRV=TSE] [POS=ADVERB][CASE=PRL][BOUNDARY=LEXITEM]
| DRV | Meaning | Examples |
|---|---|---|
| STI | manner of A | nopeasti (fast) |
| JA | actor of V | kutoja (knitter) |
| INEN | having N | valoinen (lightful) |
| TAR | feminine N | valotar (lightress) |
| LLINEN | owner of N | valollinen (lighted) |
| TON | without N | valoton (lightless) |
| TSE | via N | valoitse (by light) |
| VS | N-ness | valous (lightness) |
For most applications derivations must be removed from the morphological process and added to lexical data source as needed.
Note
VISK § 155– <http://scripta.kotus.fi/visk/sisallys.php?p=155>
Compounding is productive morphological process in Finnish language. Typically any nominals can be joined to form ad hoc compounds as needed. There are many restrictions to the word forms allowed in compounds. The productive nominal compounds are always formed by chain of nominals in genitive, nominative or special compound form, followed by final nominal word holding the inflectional suffixes. The nominals may also be nominalised verb forms.
There are also less productive compounds, where initial parts of compound may have other forms than those listed above, these should be added to lexical data since they are typically lexicalised. There is also set of adjective initial compounds where inflection in standard Finnish is said to agree for all parts of compound, these cases are not many and becoming more rare in general use, so they should be listed in exceptions.
The numeral compounds agree in all parts, except for nominative form where multiplicants take partitive forms. This complexity is hard-coded to morphology. In numeral compounds also the order of multipliers must go in decreasing magnitude.
The table below illustrates possible chains by some examples:
talonmies [BOUNDARY=LEXITEM][LEMMA='talo'][POS=NOUN][KTN=1] [NUM=SG][CASE=GEN][BOUNDARY=COMPOUND] [GUESS=COMPOUND][LEMMA='mies'][POS=NOUN][KTN=42] [NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] salaattikastike [BOUNDARY=LEXITEM][LEMMA='salaatti'][POS=NOUN] [KTN=5][KAV=C][NUM=SG][CASE=NOM][BOUNDARY=COMPOUND ][GUESS=COMPOUND][LEMMA='kastike'][POS=NOUN][KTN=48] [KAV=A][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] isänisänisä [BOUNDARY=LEXITEM][LEMMA='isä'][POS=NOUN][KTN=10] [NUM=SG][CASE=GEN][BOUNDARY=COMPOUND][GUESS=COMPOUND] [LEMMA='isä'][POS=NOUN][KTN=10][NUM=SG][CASE=GEN] [BOUNDARY=COMPOUND][GUESS=COMPOUND][LEMMA='isä'] [POS=NOUN][KTN=10][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM] naislääkäri [BOUNDARY=LEXITEM][LEMMA='nainen'][POS=NOUN] [KTN=38][COMPOUND_FORM=S][BOUNDARY=COMPOUND] [GUESS=COMPOUND][LEMMA='lääkäri'][POS=NOUN][KTN=6] [NUM=SG][CASE=NOM][BOUNDARY=LEXITEM]
| Compound pattern | Examples |
|---|---|
| N GEN + N | talonmies (house's man = janitor) |
| N NOM + N | salaattikastike (salad dressing) |
| N GEN* N | isänisänisänisän...isä (paternal great great ... grand father) |
| N CMP + N | naislääkäri (« nainen + lääkäri, female doctor) |
| A X + N X | vanhallepojalle (« vanha + poika, old boy = bachelor) |
| NUM X* | kahdeksisadaksikolmeksikymmeneksineljäksi (into 234) |
The productive compounding is typically required to gain any coverage with the analyzer, but it's also endless source of problems with ambiguity. In omorfi the method to deal with compounds combines list of verified compounds with estimate of likelihood of compound in weighted analyzer. The end applications may need to ignore productive compounds or decide threshold for accepted compounds.
Note
VISK § 398- <http://scripta.kotus.fi/visk/sisallys.php?p=398>
Some analysis include indication that word does not have a dictionary based root form, but contains a root that is generated morphologically. The value of GUESS defines the system that created the new root form. Currently two processes produce new base forms, compounding and derivation.
| VALUE | Meaning | Example |
|---|---|---|
| COMPOUND | compound | kissakoira |
| DERIVE | derive | kissatar |
For most applications guesses are supposed to be moved into the dictionaries as new base forms if recognised as proper used words.
Many lexical sources seem to record notes of style or area of usage with the words. This kind of lexical data may be indicated in additional STYLE value. The existing uses of style feature classify common misspellings or substandard forms with, dialectal, rare and archaic forms:
seitsämän [BOUNDARY=LEXITEM][LEMMA='seitsemän'] [STY=NONSTANDARD][POS=NUMERAL][KTN=10][SUBCAT=CARD] [NUM=SG][CASE=GEN][BOUNDARY=LEXITEM] mie [BOUNDARY=LEXITEM][LEMMA='mie'][POS=PRONOUN] [SUBCAT=PERSONAL][NUM=SG][CASE=NOM][BOUNDARY=LEXITEM]
| VALUE | Meaning | Example |
|---|---|---|
| NONSTANDARD | non-standard | seitsämän → seitsemän (seven) |
| RARE | rare | |
| DIALECTAL | dialectal | mie (I) |
| ARCHAIC | archaic |
Not all applications and versions of Omorfi include all of these forms.
These analyses come from traditional dictionaries released by Research Institute of Languages in Finland (RILF). They are usable in few applications so they are retained as analyses, but mostly they only provide slight disambiguation for some words with equal dictionary forms. For specification of these values the original documentation may be available from kotus kaino resource for Nykysuomen sanalista. The KTN takes values from 1 to 49 for nominals and 52 to 78 for verbs. 50 and 51 were originally used to mark up compounds, but these markings were mostly removed even from the source lexical data omorfi was built on, so they have not been restored. The example tables in the indicated site are comprehensive enough not to be reproduced here.
The KTN and KAV data is additional data related to words, and may not be included in all future versions of omorfi.
Note
VISK § 63 <http://scripta.kotus.fi/visk/sisallys.php?p=63>, gives a short introduction to inflection of Finnish
Most versions of omorfi can read words written in titlecase or uppercase as variants of the regular lowercased words. For some applications this data is necessary and is saved in omor tagset in tag named CASECHANGE. The casing tag has values NONE, UPFIRST and UPALL for retained case, titlecase and uppercase respectively.
The main application of omorfi is morphological analyser; the task of reading text and giving morphological analyses of potential word-forms. There are also other applications that use Omorfi, the main ones, such as the writers tools, are distributed along Omorfi. More complex ones, morphosyntactic disambiguation using Constraint Grammar, or rule-based machine-translation using apertium, or the spell checking library voikko, are separate software packages depending upon omorfi. This chapter gives a brief overview of these software and their relation to omorfi.
The morphological analysis is provided in omorfi number of different finite state automata, which contain different encodings or analysis styles, and different degrees of certainty for morphological analyses. The different encoding models of analyses are, in addition to the default omorfi tag set, added for needs of external applications, such as CG or apertium. The third, FTC has been made to facilitate comparison of omorfi analysis with those in the commercial corpora of Finnish text collection.
Omorfi analyser includes some statistical unigram based as well as some rule based crude disambiguation schemes. This is implemented by simply learning the most common word forms from a corpus and using simple rules to reduce the likelihood of unlikely word forms or compounds. This system is detailed in [LIN09a].
The writer's tools refers to word-error spell-checking and correction, and automatic hyphenation. The automata necessary for this basic functionality is provided with Omorfi. For the basic spell-checking there is a modified version of omorfi's dictionary. The spelling correction is provided by an error model that is applied in conjunction with the spell-checking dictionary. This system is detailed in [PIR10a]. The practical implementation of this spell checking system is included in spell checking library voikko.
Warning
The CG ruleset and parsers are not included in omorfi distribution
The morphosyntactic disambiguating system of Finnish uses Constraint Grammar, originated from [KAR90]. The omorfi includes CG compatible analyser, converted automatically from the main analyser. The Constraint Grammar works by extending morphological analyses with syntactic readings and removing illegal readings. Example usage of CG grammar can be found from apertium's Finnish machine translation pairs.
Warning
Apertium machine translation is not included in omorfi distribution
The apertium machine translation system contains some Finnish translation pairs which use omorfi for basic morphological analysis. The analyses are decorated with Constraint Grammar transfered to/from target/source language, and generated. The omorfi package contains apertium compatible analyser, which may be used for these purposes.
| [KOS83] | Kimmo Koskenniemi (1983), Two-level morphology (doctoral thesis) |
| [KAR90] | Fred Karlsson (1990) Constraint grammar as a framework for parsing running text in 13th international conference on Computational linguistics |
| [PIR08] | Tommi Pirinen (2008), Automatic morphological finite-state analyzer of Finnish language using open source tools (Master's thesis) |
| [LIN09a] | Krister Lindén, Tommi Pirinen (2009), Weighting Finite-State morphological analysers using HFST tools |
| [PIR10a] | Tommi A Pirinen, Krister Linden (2010), Finite-State Spell-Checking with Weighted Language and Error Models in LREC 2010 Saltmil workshop |
| [VISK10] | Auli Hakulinen, Maria Vilkuna, Riitta Korhonen, Vesa Koivisto, Tarja Riitta Heinonen ja Irja Alho (2010): Iso suomen kielioppi, Verkkoversio |