Morpheus is the morphological analysis engine that underlies all the Perseus language tools. It is described in Crane 1991.1 Support for Greek is detailed and robust. Support for Latin is reasonably complete, though some archaic or late forms may be missing, and the vocabulary is not as full as in Greek. Support for Italian is rudimentary.
The component parts of Morpheus are:
Code and data for Morpheus are stored in a single directory tree, conventionally ..../morph. The tree can be anywhere, though it is often placed in the sgml tree as a sibling of ..../sgml/texts and ..../sgml/xml. This tree has the following structure:
bin | (executables) | |
src | (several subdirectories; see below) | |
morph | Greek | |
stemlib | Latin | |
Italian |
In what follows, these components will be described in order of increasing complexity.
The primary user interface to Morpheus itself is cruncher.2
In simplest terms, it reads words from standard input and writes their possible morphological analyses to
standard output. Here is an example; user input is in green and
program replies in brown.
If you enter a word that Morpheus does not recognize, it will simply echo it back to you. This can
happen when the word is mis-spelled or is not correct Greek, as in the example; it can also happen
with legitimate words or forms that are not known to Morpheus. (This will be very rare in Greek,
will happen occasionally in classical Latin, and will be fairly common in Italian.)
The following are the commonly used command-line switches.
Analysis: the end-user view
$ cruncher
ai)/louros
<NL>N ai)/louros masc/fem nom sg os_ou</NL>
ai)/louroz
ai)/louroz
^D
$
The default language is Greek, and Greek must be entered in beta-code.3
For Italian, use the beta-code convention for accents, writing à as a\.
Switch | Use |
---|---|
-L | sets language to Latin |
-I | sets language to Italian |
-S | turn off Strict case. For Greek, allows words with an initial capital to be recognized, so that for example the personification *tu/xhs at Soph. OT 1080 is recognized as the genitive singular of tu/xh. For languages in the Roman alphabet, allows words with initial capital or in all capitals to be recognized. |
-n | ignore accents. Allows words with no accents or breathings, or with incorrect ones, to be recognized. |
The following other switches are supported.
Switch | Use |
---|---|
-d | database format. This switch changes the output from "Perseus format" to "database format." Output appears in a series of tagged fields. |
-e | ending index. Instead of showing the analysis in readable form, this switch gives the indices of the tense, mood, case, number, and so on (as appropriate) in the internal tables. |
-k | keep beta-code. When "Perseus format" is enabled (the default), this switch does nothing. When "Perseus format" is off, Greek output is normally converted to the old Greek Keys encoding. This switch disables that conversion so that Greek output stays in beta-code. Note that the handling of this switch was not updated when Latin was implemented, so when "Perseus format" is disabled, Latin and Italian will also be converted to this Greek font encoding. Hence if you are disabling Perseus format in those languages, you should also set the -k switch. |
-l | show lemma. When this switch is set, instead of printing the entire analysis, cruncher will only show the lemma or headword from which the given form is made. |
-P | turn off Perseus format. Output will be in the form $feminam& is^M &from$ femina^M $fe_minam^M [&stem $fe_min-& ]^M & a_ae fem acc sg^M Note the returns, without line feeds, between the fields. |
-V | analyze Verbs only. When this switch is set, words that are not verbs will not be recognized, and words that could be analyzed as either verb forms or noun forms will be treated as certainly verbs |
The following switches, which appear in the main routine, do nothing.
Switch | Use |
---|---|
-a | sets the SHOW_ANAL flag, which is never checked |
-b | sets the BUFFER_ANALS flag, which is no longer checked |
-c | sets the CHECK_PREVERB flag, which is no longer checked |
-i | sets the SHOW_FULL_INFO flag, which is never checked |
-m | sets the SHOW_MISSES flag, which is never checked |
-p | sets the PARSE_FORMAT flag, which is unconditionally turned on later anyway |
-s | sets the DBASESHORT flag, which is checked only in a routine that is never called |
-x | sets the LEXICON_OUTPUT flag, which is checked only in a routine that is never called |
Morpheus recognizes inflected words by comparing the given forms to known stems and endings. Stems are defined to belong to particular inflectional classes, for example first-declension nouns or second-conjugation verbs. Making a new word available to Morpheus involves adding it to the appropriate stems files.
Stems files are in the stemsrc directory under the appropriate language in the Morpheus tree. For example, stems for Latin are in ..../morph/stemlib/Latin/stemsrc. Stems for verbs and nouns are filed separately, because they are compiled by different routines. Indeclinable words, by convention, go into the nouns files. Adjectives are not distinguished from nouns.
The existing stem files for each supported language include one each for irregular nouns and verbs, one each for nouns and verbs extracted from the major dictionary, and one or more additional files for words that are not in the dictionary. These additional files are typically used for words appearing in texts outside the classical period (for example in Byzantine Greek or Neo-Latin) or for proper names. Most such words are nouns, but there is no reason there could not be additional verb files as well. It is convenient for maintenance to use a separate stem file for each new group of unusual words. For example, in Latin, nom.01 contains common quasi-regular words, nom.02 mostly contains words from Plautus, plus the larger numbers, nom.03 mostly contains words from Glass's biography of George Washington, and nom.04 contains words from the Vulgate.
The format of a stem file entry is like this:
:le:lemma :xx:stem class otherLines in the file that do not begin with a keyword enclosed in colons are ignored. Each line begins with a keyword identifying the type of word. The first line must have the :le: keyword, for the lemma or headword. The next line has a "part of speech" keyword. There may be more than one "part of speech" line for a given lemma. In each "part of speech" line, the first field is the stem. It must be followed by a tab. The rest of the line contains codes for inflectional class and gender, separated by spaces.
The lemma is given in its ordinary form. Vowel quantities are marked only in the stem field, not the lemma. Long vowels are marked by a following underscore, short vowels by a following up-arrow. It is not necessary to mark the quantities of unambiguous Greek letters (eta, epsilon, omega, omicron), vowels whose quantity is clear from the accent, or vowels in closed syllables; vowels otherwise not marked are considered short. In Greek, the stem field has no accent, though it must have a breathing if the word begins with a vowel.
Here are some examples.
Latin nouns:
:le:femina :no:fe_mi^n a_ae fem :le:amor :no:am or_oLris masc :le:Americanus :aj:America_n us_a_um
Latin verbs:
:le:quiesco :vs:quiesc conj3 :vs:quie_v perfstem :vs:quie_t pp4 :le:creo :de:cre are_vb
Greek nouns:
:le:ai)/louros :no:ai)elour os_ou masc fem :no:ai)lour os_ou masc fem :le:deino/s :aj:dein os_h_on suff_acc
Greek verbs:
:le:nomi/zw :de:nom izw :le:gra/fw :vs:gra^f aor2_pass @ fut
The following are the keywords recognized in stems files.
keyword | indicates |
---|---|
:le: | lemma or headword |
:wd: | indeclinable form (preposition, adverb, interjection, etc.) or unanalyzed irregular form |
:aj: | adjective; must have an inflectional class |
:no: | noun; must have an inflectional class and a gender |
:vb: | verb form; for unanalyzed irregular forms |
:de: | derivable verb; must have an inflectional class |
:vs: | verb stem, one of the principal parts; must have an inflectional class |
The inflectional class codes are different for each language. They are the base names of the files in ..../morph/stemlib/language/endtables/source. In general the easiest way to determine the correct class codes is to look at a similar word -- another noun of the same declension, for example. Gender codes are masc, fem, neut, masc/fem, masc/neut, the latter two used when endings for the two genders are the same. Use "masc fem" for a noun that can be of either gender. Other codes, for number, person, tense, mood, voice, or case, usually only appear in the stems files for irregular forms; these codes are listed under "Adding and changing endings.".
In general the class code for a noun declension will look like the nominative and genitive, for example a_ae for the Latin first declension. For an adjective, it will look like the three nominative forms, for example os_h_on for Greek first-and-second declension adjectives. Verbs are a bit more complex since the several stems usually need to be specified separately, except for highly predictable groups like the Latin first conjugation.
Most of the new words that will need to be added are regular, because virtually all of the irregular
words are already in the stems files (even for Italian), since they are the most common words in the
language.
Once you have added your words, you need to compile the database. In the next
directory up from the stems files, that is ..../morph/stemlib/language/stemlib, you
will find a make file; simply make all. Note several assumptions in these make files:
The compilation utilities, like cruncher itself, rely on the MORPHLIB environment
variable. This must be set to ..../morph/stemlib, wherever that is on your system. All of the
code is in ..../morph/bin, which must be on the path.
The compilation will produce various messages, most of which can be ignored. True errors
will be reported by make in the usual way. Here are examples of the most common messages:
In general if you mis-type inflectional class information in a stems file, you will not get a
message from the compilation process. You should therefore check your new words once your compilation has
finished. Do this by running cruncher and entering several forms of the new words. If they
are not recognized, then you have mis-typed something in the stems file.
Although the main morphological classes for the supported languages are all defined,
it is occasionally necessary to correct a problem, or to add a dialect form. Endings
are defined in the ..../morph/stemlib/language/endtables directory
and its subdirectories. Two subdirectories, basics and source, contain
files that can be edited; the others, ascii, indices, and out,
contain the compiled representations of the input files.
The files in ..../endtables/source define the inflectional classes. The names
of these files are the inflectional class codes that appear in the stems files. For example,
the endings for Latin fifth-declension nouns are defined in es_ei.end and those nouns
are listed in the stems files like this:
Here is the content of es_ei.end:
The codes for genders are as in the stems files, masc, fem, neut, masc/fem, masc/neut.
Number codes are sg, dual, pl. Cases are nom, gen, dat, acc, abl, voc.
For verbs, persons are 1st, 2nd, 3rd, numbers are as for nouns, and voices are act, mid, pass, mp.
Tenses are pres, imperf, fut, aor, perf, plup, futperf. Moods are ind, subj, opt, imperat, inf, part,
supine, gerundive (there is no code for the gerund as distinct from the gerundive).
Other modifying codes include early, poetic, attic, doric, and so on. All of these codes
are defined in morphkeys.h in the src/morphlib subdirectory (see below).
The endings file for the Latin fifth declension is not typical. More often, an inflectional
class is defined by reference to another class. For example, participles use the endings of adjectives,
and several different verb tenses and moods use the same groups of endings. To express these relationships,
Morpheus defines basic endings and then references them in inflectional class files. For example, the
Greek noun class c_ktos (as in anax) is defined like this:
The @decl3 reference is to a file in the ..../endtables/basics directory. That
directory contains groups of endings that can be re-used. The format of the "basics" files is the
same as that of the ordinary inflectional class endings files, and their names are also *.end.
To use a basic endings group in an inflectional class file, put its name, preceded by an at sign, in
the place of the actual endings -- or even parts of endings, as in the example above.
There is a further way to relate different inflectional classes, using the derivs directory.
Files in ..../derivs/source pull together information about stem formation and inflectional classes.
They are only used for verb classes. For example, Latin fourth-conjugation verbs are defined in
ire_vbs.deriv as follows:
There is one further complication to endings files. In the rule_files directory are two files
that determine whether inflectional class files apply to verbs, nouns, or adjectives. The derivtypes.table
file must list every file from the derivs/source directory. The stemtypes.table file
must list every file from the endtables/source directory. If you add a new inflectional class,
you will also need to declare it here. In each of these tables, the second field is a serial number and the
third describes what kind of object is being declared.
Once you have created or modified endings files, you can add or update stems entries to use them;
you do not need to compile the database first. But once you're finished with all the modifications,
to endings and stems, then you must compile the database, as described above.
Source code for Morpheus, written in C (mostly, though not entirely, ANSI C), is in the
..../morph/src directory tree. There is a make file at top level in the src
directory which controls compilation of the six libraries and twenty-six main programs that
make up Morpheus. Those programs are installed into ..../morph/bin.
The main routine for cruncher is ..../morph/src/anal/stdiomorph.c. The
actual work happens in subroutine checkstring and its subsidiaries, all in file
..../morph/src/anal/checkstring.c. Most of the significant modifications and bug fixes
over the past three years have been in this file as well.
The executables used in compiling the database (see above) are
Most header files are in ..../morph/src/includes, though some are in the code directories. Directories
..../morph/src/greeklib and ..../morph/src/morphlib contain utility routines which get linked into
object libraries. Each of the code directories anal, gener, gkdict, and gkends also has an
object library for its subroutines. Executables are statically linked against all these object libraries.
Other directories in the source tree contain related code which is not actually part of Morpheus. Directory
auto contains code for character encoding conversions. Directory retr has a search engine for
the TLG CD. Directory scan has initial experiments toward scansion. Directory tlg has a
one-file TLG search engine; a comment at the head of the file calls it "unbelievably ugly and impossible to figure out."
Finally, directory play is a space for toy routines.
The main loop of cruncher is quite simple: it reads a string from stdin, drops white space, and
passes the trimmed string to checkstring. It then displays the result on the output file, typically stdout.
This continues until end of file on input.
The real work is driven by checkstring. This routine comes in five layers: checkstring calls
checkstring1, which calls checkstring2, which calls checkstring3, which calls checkstring4.
In each case, if the next lower layer does not recognize the word, we adjust -- for crasis, enclisis, dialect forms, or the like --
and try again. The innermost layer, checkstring4, calls checkword (in checkword.c), which ultimately
calls the routines in ..../morph/src/gkdict/dictio.c to look up the word in the actual tables. In the case of a
simple word, such as ego (in either language), this is all we need to do. For inflected words, checknom and
checkverb peel off letters one at a time from the beginning of the word until they recognize an ending. If the peeled-away
part is recognizable as a stem (or a stem with a prefix), then this is a possible analysis.
If checkword does not find any analyses, then checkstring4 looks for spelling variations: cun-
for sun- or -ss- for -tt- in Greek. If checkstring4 does not find any analyses, then
checkstring3 looks at capitalization, elision or prodelision, attached enclitics (Greek -per,
Italian pronouns, Latin -que, -ve, -ne), and alternation between i and j or u and v.
If checkstring3 does not find any analyses, checkstring2 tries various Greek dialects. If checkstring2
does not find any analyses, checkstring1 looks at initial prodelision in Greek. And if checkstring1 does
not find any analyses, checkstring assumes the word is simply not recognized.
The main data structure behind all this is the gk_word structure, accessed throughout checkstring
by the pointer Gkword. Structure gk_word is defined in ..../morph/src/includes/gkstring.h.
It includes character buffers for the original word and the working form of the word (as adjusted for spelling, dialect,
and so on). It also includes flags for various options, including those that can be set on the cruncher
command line.
The gk_word structure is not manipulated directly but via
routines like set_workword and set_prntflags. That
is, although the code is written in C and long pre-dates C++, it uses
data hiding principles similar to those of object-oriented languages.
Maintainers are strongly urged to respect this design.
Finally, it should be noted that the earliest stages of development took place
without the use of a version management system. Later, when Perseus adopted such
a system, it took a while for everyone to become comfortable with its use. As a result,
there are many commented-out sections of code (rarely necessary when it is easy to
inspect older versions), and many un-informative log messages ("Nightly backup" is common),
making it hard to recover the early history of the code. But the last three years' work
should be reasonably well accounted for.
Most Perseus users see Morpheus only in the context of the Word Study Tool links on Greek, Latin,
and Italian words. On text pages, these links are created by routine morph_links
(in ..../cgi-bin/IncPerl/FilterText.pm); on Lookup Tool pages, they come from the
independent program ..../cgi-bin/Support/umorphck. Each of these routines works on the
generated HTML page just before it is returned to the client browser. Greek is identified by
<G> elements, Latin by <L>, and Italian by <IT>. These elements are inserted by the
transformation routines whenever an element has a suitable lang attribute. The
morphology linking routines must tokenize the stream into words, skipping over embedded
HTML elements, then insert links to morphindex. So that users are not annoyed
by links that do not produce results, morph_links and umorphck use a
cache of known forms to determine which words should be linked. Mis-spelled words
and words that are not known to Morpheus will not receive Word Study links.
The cache is created by compilation routines in the XML build directory ..../sgml/xml
and stored as ..../cgi-bin/DBs/language/mdb.4 The
langinstall target in the XML Makefile handles building this cache for each language and
copying it to the correct place. At each build, we make ..../sgml/xml/morph/language.words,
a list of all the words of the given language in the system that have not been seen before. At the
same time, we merge the previous language.words file with language.words.old, thus
updating the list of previously-seen words. We then run cruncher over any words in the
new language.words file. Those for which cruncher finds analyses are added to the cache;
those for which it does not are listed in ..../sgml/xml/morph/language.failed. It is convenient
to delete all of the files in ..../sgml/xml/morph/, or all the files for one language, whenever
you have made significant changes to the Morpheus stems or endings tables, because those changes will not
be reflected in the cache otherwise. That is, if you have added a stem, but its forms are already in the
already-seen-words list, those forms will not be re-analyzed and will therefore not have links in the
on-line system. To force re-analysis, then, delete the files in ..../sgml/xml/morph/.
Other compiled files include the wmdb database, the freqs database, and the inflex
database. The wmdb database (..../cgi-bin/DBs/language/wmdb) gives, for each analyzed
form, the list of headwords it might come from, with weights. For example, Latin facies could be
either a form of the noun facies or a form of the verb facio, so it is given a weight of 1/2
for each of those headwords. (That it could, in fact, be any of three forms of the noun is not relevant here,
only that it could come from either of two words.) This database is created from the output of cruncher
by program ..../sgml/xml/weightmorph. It is used by the Lookup Tool, to relate user input to headwords
for lemmatized searching, and, during compilation, by lemsens, to create the lemmatized sentence files.
The freqs database contains the number of occurrences of each word in each corpus. Keys
look like facies#perseus,author,Plautus and values look like 50 23 36.5. That is, the
key is a headword followed by the name of a corpus (whose official name might be Perseus:corpus:perseus,author,Plautus;
see corpora.xml), and the value is the maximum, minimum, and weighted occurrences of this word
in this corpus. This database is also stored in the language database directory. It is used by the Word Study Tool,
the lexicon display routines, and the frequency tool. During compilation, it is used by catalog and
collect_coll, the routines that underlie the Vocabulary Tool.
The inflex database gives, for each headword, a list of all the inflected forms of that word
attested in the texts. It is currently used only by the frequency tool, to verify that the word of interest is
actually a headword; it used to be used by psearch, the full-text search routine that preceded the
unified Lookup Tool.
In the compilation process, cruncher is only run once for each language, in the stem that
creates the known-words cache. In normal operation of the run-time system, it should not be run at all,
though it can be called by morphindex if it is run interactively and the user enters a form that
is not in the known-words cache. None of the other Morpheus programs is run other than during compilation
of the Morpheus database itself.
Notes
1. See the bibliography of Perseus publications.
2. On a standard Perseus development system, this program will be in your path
and the necessary environment variables will be set. For more on this see the instructions
on compiling the database.
MorphFopen: could not open [/data/sgml/morph/stemlib/Latin/endtables/source/or_uris.end]
This indicates that there is a reference to inflectional class or_uris somewhere in the definitions
of endings, but no actual definition for its endings. The stray reference may be in
an ordinary endings files (in directory ..../morph/stemlib/language/endings/source),
a basic endings file (..../morph/stemlib/language/endings/basics),
a derivation file (..../morph/stemlib/language/derivs/source),
or a rules file (..../morph/stemlib/language/rule_files). If you intend to use this inflectional
class, you will need to create its endings file.
If you see this message, you will also see
could not open [or_uris.end] or [endtables/source/or_uris.end] and, from indendtables,
MorphFopen: could not open [/data/sgml/morph/stemlib/Latin/endtables/out/or_uris.out]
endtables/ascii/a_ae.asc
This is a progress message.
stype 14000
stype [14000]
output file:endtables/indices/nendind
This is a success message indicating the output file the program has created.
1000) [2quamquam :quisquam:indef:fem:acc:sg]
This is a progress message.
out of qsort
done with i=46975, 0
about to index [steminds/nomind]
have just indexed [steminds/nomind]
bufsiz 5631748 bytes
allocated 5631748 bytes successfully!
stemcount 46975
This is a progress message.
processing 5000: Bacchylid :Bacchylides:es_is:masc
This is a progress message.
rval 0 stembuf [br] global [] deriv [o_stem] tk [vn,-mm,h_hs]
This indicates that no verb conjugation information could be deduced for the partial stem br.
compiling deriv [ire_vb]
derivs/ascii/ire_vb.asc
This is a progress message.
[reg_conj] not a regular conj [1000003] [2000000]
This indicates that the given verb derivation rule (in ..../morph/stemlib/language/rule_files/derivtypes.table) is
not flagged as a regular derivation.
output file:derivs/indices/derivind
This is a successful completion message.
Adding and changing endings
:le:facies
:no:fa^ci^ es_ei fem
e_s masc fem nom sg
ei_ gen sg
ei_ dat sg
em masc fem acc sg
e_ abl sg
e_s masc fem nom pl
e_rum gen pl
e_bus dat pl
e_s masc fem acc pl
e_bus abl pl
e_ dat sg early poetic
e_ gen sg early poetic
In this file, blank lines are ignored. Non-blank lines have two fields, separated by a tab.
The first field is the ending and the second tells where it is used. For example, the first
line of the file says that e_s (that is, -es with a long e) is the ending
for masculine and feminine nominative singular. The gender could in fact have been omitted,
as it is for other cases, since all fifth-declension nouns have the same endings regardless
of gender. (Moreover, every noun in this declension is feminine except dies and its
compounds.) Long vowels are marked as in the stems files, with a following underline. Short
vowels are not marked.
c c_ktos masc fem nom voc sg
* c_ktos neut nom voc acc sg
kt@decl3 c_ktos
This says that masculine and feminine nouns of this class have their nominatives ending in c,
neuters have simply the stem for the nominative, and the remaining cases end in -kt- plus the
appropriate third-declension ending.
* conj4
* ivperf
i_ perfstem
i_t pp4
Here the second field contains references to basics files. This file says that this
class of words takes the endings of conj4 and of ivperf, and that the perfect
stem is formed by adding long i and the fourth principal part by adding it with long i.
Verbs can then be declared in the stems files to be of this class, for example:
:le:munio
:de:mun ire_vb
An introduction to the code
There are other executable routines (see the makefiles in the various directories), but they are not currently used.
What you can do with the results