%inherit file="/base/index.html" /> <%namespace file="/base/searchFields.html" name="searchFields" \ import="formSearchFields"/>
A phonology in an OLD application is a representation of the relation between the underlying phonemic shape of a word and its surface phonetic realization where such representation is encoded as a finite state transducer (FST).
Multiple phonologies can be created within a single OLD application. This permits, for example, the specification of distinct phonologies for different dialects or speakers or simply for analytical experimentation.
A finite state transducer (FST) is a computational formalism that can be thought of as representing the relation between two languages, where "language" is understood in the computationo-logical sense as a set of strings. FSTs can be used to represent, among other things, the phonology of a language, i.e., the relationship between the underlying phonemic shape of a word and its surface realization.
There are several computer programs that can be used to create FSTs and do conversions on strings. Probably the most widely known of these is XFST, the Xerox Finite State Tool. The OLD uses the open source program foma. Other similar tools are SFST, HFST & PC-Kimmo. Programs like XFST and foma define languages that facilitate the specification of FSTs. The FST specification language implemented in foma allows one to generate FSTs via scripts containing SPE-style context-sensitive rewrite rules. It is in writing such a script that one specifies a phonology for a language being analyzed and documented on an OLD application.
Give your phonology a name. You may also provide a description.
Write the phonology foma script in the "Foma Script" input. The script must meet the following requirements:
Overview of foma's scripting language:
A -> B || C _ D
means "rewrite A as B only when it
occurs between C and D"A (->) B || C _ D
means "optionally rewrite A as
B only when it occurs between C and D"[..]
denotes the empty symbol in a rewrite rule, e.g.,
[..] -> s || i _ t
means "insert an 's' between an 'i' and
a 't'define name expression
assigns the FSM/FST generated by
expression
to name
, e.g., define iLoss
i -> 0 || y _ "-" [a | o];
.o.
is used to denote the composition operation that
forms a single FST from two or more, e.g., define phonology
assimilation .o. devoicing
;
#
to comment out lines"#"
to reference word boundaries,
i.e., the left or right side of a word. Before using the parser to
analyze a word, an OLD application will first enclose it in "#" symbols.
For detailed information on writing foma scripts, please consult the documentation on the foma home page, in particular the morphological analysis tutorial.
Here is an example foma script which implements a phonology of the Blackfoot language. The rules are taken (with some modification and interpretation) from Frantz (1997).
################################################################################
# The phonological rules of Frantz (1997) as FSTs
################################################################################
# How to understand this file
################################################################################
# - "A -> B || C _ D" means "rewrite A as B only when it occurs between C and D"
# - "(->)" is optional rewrite
# - "define name expression" assigns the FSM/FST generated by "expression" to
# "name"
# - Special characters (e.g., "-") need to be enclosed in quotes
# - ".o." denotes the composition operation
# - "[..]" denotes the empty symbol in a rewrite rule (using "0" in an insertion
# rule will result in 1 or more (!) insertions
# Comments
################################################################################
# Some interpretation of the ordered rewrite rules of Frantz (1997) was
# required:
# - what to do with the morpheme segmentation symbol "-" in the rules
# - Frantz (1997) provides a partial ordering: some decisions had to be made
#test nit-waanIt-k-wa nitaanikka
#test nit-waanIt-aa-wa nitaanistaawa
#test nit-siksipawa nitssiksipawa
#test nit-ssikópii nitsssikópii
#test á-sínaaki-wa áísínaakiwa
#test nikáá-ssikópii nikáíssikópii
#test káta'-simi-wa kátai'simiwa
#test áak-oto-apinnii-wa áakotaapinniiwa áakotapinniiwa
#test w-ínni ónni
#test w-iihsíssi ohsíssi
#test áak-Ipiima áaksipiima
#test kitsí'powata-oaawa kitsí'powatawaawa
#test á-Io'kaa-wa áyo'kaawa
#test yaatóó-t aatóót
#test waaníí-t aaníít
#test w-óko'si óko'si
#test á-yo'kaa-o'pa áyo'kao'pa
#test imitáá-iksi imitáíksi
#test á-yo'kaa-yi-aawa áyo'kaayaawa
#test á-ihpiyi-o'pa áíhpiyo'pa
#test á-okstaki-yi-aawa áókstakiiyaawa áókstakiyaawa
#test á-okska'si-o'pa áókska'so'pa
#test nit-Ioyi nitsoyi
#test otokska'si-hsi otokska'ssi
#test otá'po'taki-hsi otá'po'takssi
#test pii-hsini pissini
#test áak-yaatoowa áakaatoowa
#test nit-waanii nitaanii
#test kikáta'-waaniihpa kikáta'waaniihpa
#test áíhpiyi-yináyi áíhpiiyináyi áíhpiyiyináyi
#test áókska'si-hpinnaan áókska'sspinnaan
#test nit-it-itsiniki nitsitsitsiniki
#test á'-omai'taki-wa áó'mai'takiwa
#test káta'-ookaawaatsi kátaookaawaatsi
#test káta'-ottakiwaatsi kátaoottakiwaatsi
#test á'-isttohkohpiy'ssi áíisttohkohpiy'ssi
#test á'-o'tooyiniki áó'tooyiniki
#test káta'-ohto'toowa kátao'ohto'toowa kátaohto'toowa
#test nit-ssksinoawa nitssksinoawa
#test á-okska'siwa áókska'siwa
#test atsikí-istsi atsikíístsi
#test kakkóó-iksi kakkóíksi
#test nit-ihpiyi nitsspiyi
define phonemes [p|t|k|m|n|s|w|y|h|"'"|a|i|o|á|í|ó];
define vowels [a|i|o|á|í|ó];
define accentedVowels [á|í|ó];
define consonants [p|t|k|m|n|s|w|y];
define obstruents [p|t|k|m|n|s];
define stops [p|t|k|m|n];
define plosives [p|t|k];
define glides [w|y];
# 1. C1-C2 -> C2C2
# Gemination
define pGem plosives "-" -> p || _ p;
define tGem plosives "-" -> t || _ t;
define kGem plosives "-" -> k || _ k;
define gemination pGem .o. tGem .o. kGem;
# 2. It -> Ist
# s-Insertion (assumes that "breaking I" is a phoneme)
define sInsertion [..] -> s || I _ t;
# 3.a. C-s -> Css
# s-Connection A
define sConnectionA "-" -> s || stops _ s;
# 3.b. V(')-s -> V(')-is
# s-Connection B
# condition: where 's' is not part of a suffix
# present implementation: rule is optional
define sConnectionB [..] (->) i || vowels ("'") "-" _ s;
# 4. o-a -> aa
# o-Replacement
# note: for some speakers the o is deleted
# condition: where 'a' is not part of a suffix
# present implementation: rule is optional
define oReplacementA o (->) [a | 0] || _ "-" a;
define oReplacementB ó (->) [á | 0] || _ "-" a;
define oReplacementC [o | ó] (->) [á | 0] || _ "-" á;
define oReplacement oReplacementA .o. oReplacementB .o. oReplacementC;
# 5. w-i(i) -> o
# Coalescence
define coalescenceA w "-" i (i) -> o || _ [p|t|k|m|n|s|w|y|h|"'"];
define coalescenceB w "-" í (í) -> ó || _ [p|t|k|m|n|s|w|y|h|"'"];
define coalescence coalescenceA .o. coalescenceB;
# 6. k-I -> ksi
# Breaking
define breaking "-" -> s || k _ I;
# 7. I -> i
# Neutralization
define neutralization I -> i;
# 8.a. V-iV -> VyV
# Desyllabification A
define desyllabificationA "-" i -> y || vowels _ vowels;
# 8.b. V-oV -> VwV
# Desyllabification B
define desyllabificationB "-" o -> w || vowels _ vowels;
# 9. #G -> 0
# Semivowel Drop
define semivowelDrop glides -> 0 || "#" _;
# 10. V1V1-V -> V1V
# Vowel Shortening
define vowelShorteningA [a | á] -> 0 || [a | á] _ "-" vowels;
define vowelShorteningI [i | í] -> 0 || [i | í] _ "-" vowels;
define vowelShorteningO [o | ó] -> 0 || [o | ó] _ "-" vowels;
define vowelShortening vowelShorteningA .o. vowelShorteningI .o. vowelShorteningO;
# 11. Vyi-{a,o} -> Vy{a,o}
# i-Loss
define iLossA [i|í] -> 0 || [a|á|o|ó] y _ [a|á|o|ó];
define iLossB i y [i|í] -> i (i) y || _ [a|á|o|ó];
define iLossC í y [i|í] -> í (í) y || _ [a|á|o|ó];
define iLoss iLossA .o. iLossB .o. iLossC;
# 12. si{a,o} -> s{a,o}
# i-Absorption
define iAbsorption [i|í] ("-") -> 0 || s _ [a|á|o|ó];
# 13. sihs -> ss
# ih-Loss
define ihLoss [i|í] "-" h -> 0 || s _ s;
# 14. ihs -> ss
# Presibilation
define presibilation [i|í] "-" h -> s || _ s;
# 15. CG -> C , where C ne "'"
# Semivowel Loss
define semivowelLoss "-" glides -> 0 || obstruents _;
# 16. Ciyiy -> Ciiy
# y-Reduction (optional)
define yReduction y (->) 0 || [obstruents | "'"] [i|í] _ [i|í] ("-") y;
# 17. sih -> ss
# Postsibilation
define postsibilation [i|í] ("-") h -> s || s _;
# 18. ti -> tsi
# t-Affrication
define tAffrication "-" -> s || t _ [i|í];
# 19. V'VC -> VV'C
# Glottal Metathesis
define glottalMetathesisA "'" "-" a -> "-" a "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisAccA "'" "-" á -> "-" á "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisALong "'" "-" a a -> "-" a a "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisAccALong "'" "-" á á -> "-" á á "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisI "'" "-" i -> "-" i "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisAccI "'" "-" í -> "-" í "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisILong "'" "-" i i -> "-" i i "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisAccILong "'" "-" í í -> "-" í í "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisO "'" "-" o -> "-" o "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisAccO "'" "-" ó -> "-" ó "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisOLong "'" "-" o o -> "-" o o "'" || vowels _ [consonants|"'"|h];
define glottalMetathesisAccOLong "'" "-" ó ó -> "-" ó ó "'" || vowels _ [consonants|"'"|h];
define glottalMetathesis glottalMetathesisA .o. glottalMetathesisAccA .o.
glottalMetathesisALong .o. glottalMetathesisAccALong .o.
glottalMetathesisI .o. glottalMetathesisAccI .o. glottalMetathesisILong .o.
glottalMetathesisAccILong .o. glottalMetathesisO .o.
glottalMetathesisAccO .o. glottalMetathesisOLong .o.
glottalMetathesisAccOLong;
# 20. VV1V1'C -> VV1V1C
# Glottal Loss
define glottalLossA a a "'" -> a a || vowels ("-") _ consonants;
define glottalLossAccA á á "'" -> á á || vowels ("-") _ consonants;
define glottalLossI i i "'" -> i i || vowels ("-") _ consonants;
define glottalLossAccI í í "'" -> í í || vowels ("-") _ consonants;
define glottalLossO o o "'" -> o o || vowels ("-") _ consonants;
define glottalLossAccO ó ó "'" -> ó ó || vowels ("-") _ consonants;
define glottalLoss glottalLossA .o. glottalLossAccA .o. glottalLossI .o.
glottalLossAccI .o. glottalLossO .o. glottalLossAccO;
# 21. V'(s)CC -> VV(s)CC , where C ne 's'
# Glottal Assimilation
define glottalAssimilationA a "'" -> a a || _ (s) [p p | t t | k k | m m | n n];
define glottalAssimilationAAcc á "'" -> á á || _ (s) [p p | t t | k k | m m | n n];
define glottalAssimilationI i "'" -> i i || _ (s) [p p | t t | k k | m m | n n];
define glottalAssimilationIAcc í "'" -> í í || _ (s) [p p | t t | k k | m m | n n];
define glottalAssimilationO o "'" -> o o || _ (s) [p p | t t | k k | m m | n n];
define glottalAssimilationOAcc ó "'" -> ó ó || _ (s) [p p | t t | k k | m m | n n];
define glottalAssimilation glottalAssimilationA .o. glottalAssimilationAAcc .o.
glottalAssimilationI .o. glottalAssimilationIAcc .o. glottalAssimilationO .o.
glottalAssimilationOAcc;
# 22. '' -> '
# Glottal Reduction
define glottalReduction "'" "'" -> "'";
# 23. V1'h -> V1'V1h
# Vowel Epenthesis
# note: In place of this rule, some speakers have the following rule:
# ' -> 0 / _ h
define vowelEpenthesisA a "'" -> [a "'" a | a] || _ h;
define vowelEpenthesisAAcc á "'" -> [á "'" á | á] || _ h;
define vowelEpenthesisI i "'" -> [i "'" i | i] || _ h;
define vowelEpenthesisIAcc í "'" -> [í "'" í | í] || _ h;
define vowelEpenthesisO o "'" -> [o "'" o | o] || _ h;
define vowelEpenthesisOAcc ó "'" -> [ó "'" ó | ó] || _ h;
define vowelEpenthesis vowelEpenthesisA .o. vowelEpenthesisAAcc .o.
vowelEpenthesisI .o. vowelEpenthesisIAcc .o.
vowelEpenthesisO .o. vowelEpenthesisOAcc;
# 24. sssC -> ssC
# sss-Shortening
define sssShortening s -> 0 || _ s s [stops | glides];
# 25.
# Accent Spread
define accentSpreadA a -> á || accentedVowels "-" _;
define accentSpreadI i -> í || accentedVowels "-" _;
define accentSpreadO o -> ó || accentedVowels "-" _;
define accentSpread accentSpreadO .o. accentSpreadA .o. accentSpreadI;
# 26. - -> 0
# Break-Delete
define breakDelete "-" -> 0;
define phonology semivowelLoss .o.
gemination .o.
coalescence .o.
sInsertion .o.
sConnectionB .o.
yReduction .o.
breaking .o.
oReplacement .o.
ihLoss .o.
sConnectionA .o.
presibilation .o.
sssShortening .o.
semivowelDrop .o.
vowelShortening .o.
neutralization .o.
tAffrication .o.
postsibilation .o.
iAbsorption .o.
desyllabificationB .o.
desyllabificationA .o.
glottalMetathesis .o.
vowelEpenthesis .o.
glottalReduction .o.
glottalLoss .o.
glottalAssimilation .o.
accentSpread .o.
breakDelete .o.
iLoss;
Find below a simple Python script that can be used to test a phonology against a series of word/analysis pairs and print a report. This is especially useful when one is trying to get the correct ordering for composing a large number of rules. Simply include comments of the form
# test morpheme-break morphembreak
# test bush-s bushes
in your foma script and the Python script will tell you whether your phonology correctly phonologizes "bush-s" to "bushes", etc.
To get the script, copy and paste its contents into a text editor and save the resulting file as 'phonology_tester.py'. The script contains its own usage instructions.
import subprocess
import os
import codecs
import re
import pickle
import sys
"""Script takes a foma script and tests its phonology regex against its tests.
The tests are comments prefixed by "#test". The tests are of the format
#test input output1 (output2)
For example:
#test imitaa-iksi imitaiksi
Assuming the above example, this script would report whether "imitaa-iksi"
phonologizes to "imitaiksi".
How to use this script
================================================================================
1. Make sure you have foma and Python installed.
2. Place this script in the same directory as your foma script.
3. Make sure you have some test comments in your foma script (see above).
4. Run "python phonology_tester.py" at the command line.
5. Make sure your foma phonology script is named phonology.foma or else replace
the value of phonologyFileName with your phonology script's name.
"""
phonologyFileName = 'phonology.foma'
phonologyTestingFileName = 'phonology_testing.foma'
phonologyTestingShellScriptName = 'phonology_testing.sh'
phonologyTestingBinaryFileName = 'phonology_testing.foma.bin'
# Foma reserved symbols. See
# http://code.google.com/p/foma/wiki/RegularExpressionReference#Reserved_symbols
fomaReserved = [u'\u0021', u'\u0022', u'\u0023', u'\u0024', u'\u0025', u'\u0026',
u'\u0028', u'\u0029', u'\u002A', u'\u002B', u'\u002C', u'\u002D', u'\u002E',
u'\u002F', u'\u0030', u'\u003A', u'\u003B', u'\u003C', u'\u003E', u'\u003F',
u'\u005B', u'\u005C', u'\u005D', u'\u005E', u'\u005F', u'\u0060', u'\u007B',
u'\u007C', u'\u007D', u'\u007E', u'\u00AC', u'\u00B9', u'\u00D7', u'\u03A3',
u'\u03B5', u'\u207B', u'\u2081', u'\u2082', u'\u2192', u'\u2194', u'\u2200',
u'\u2203', u'\u2205', u'\u2208', u'\u2218', u'\u2225', u'\u2227', u'\u2228',
u'\u2229', u'\u222A', u'\u2264', u'\u2265', u'\u227A', u'\u227B']
def escapeFomaReserved(i):
def escape(ii):
if ii in fomaReserved:
ii = u'"%s"' % ii
return ii
return ''.join([escape(x) for x in i])
def getTests(phonologyList):
tests = []
for line in phonologyList:
if line[:5] == "#test":
tests.append((line.split()[1], line.split()[2:]))
#for test in tests:
# expecteds = '"%s"' % '", "'.join(test[1])
# print 'Test "%s" against %s' % (test[0], expecteds)
return tests
def getPhonologyTestingFile(phonologyList, tests):
regexes = [u' '.join([escapeFomaReserved(x) for x in t[0]]) for t in tests]
morphotactics = u'define morphotactics "#" [%s] "#";' % u' | \n '.join(
regexes)
return u'%s\n\n\n%s\n\n\n%s\n\n\n%s' % (
''.join(phonologyList),
morphotactics,
u'define morphophonology morphotactics .o. phonology;',
u'regex morphophonology;')
def phonologize(underlyingWord):
binaryFilePath = os.path.join(
os.getcwd(), phonologyTestingBinaryFileName)
process = subprocess.Popen(
['flookup', '-x', '-i', binaryFilePath],
shell=False,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE)
process.stdin.write(underlyingWord.encode('utf-8'))
result = unicode(process.communicate()[0], 'utf-8').split('\n')
return [x.replace('#', '') for x in result if x]
def generateFST():
phonologyFile = codecs.open(phonologyFileName, 'r', 'utf-8')
phonologyTestingFile = codecs.open(phonologyTestingFileName, 'w', 'utf-8')
phonologyList = phonologyFile.readlines()
# Get tests to perform on the phonology
tests = getTests(phonologyList)
# Write the phonology foma script to be used in the testing
testingFile = getPhonologyTestingFile(phonologyList, tests)
phonologyTestingFile.write(testingFile)
# Create the shell script for generating the binary FST
shellScript = open(phonologyTestingShellScriptName, 'w')
cmd = 'foma -e "source %s" -e "save stack %s" -e "quit"' % (
phonologyTestingFileName, phonologyTestingBinaryFileName)
shellScript.write(cmd)
shellScript.close()
os.chmod(phonologyTestingShellScriptName, 0755)
# Generate the binary file FST for the phonology
scriptFullPath = os.path.join(os.getcwd(), phonologyTestingShellScriptName)
process = subprocess.Popen([scriptFullPath], shell=True,
stdout=subprocess.PIPE)
output = unicode(process.communicate()[0], 'utf-8')
print output
def performTests():
if phonologyTestingBinaryFileName not in os.listdir(os.getcwd()):
print 'HAVE TO GENERATE'
print os.listdir(os.getcwd())
generateFST()
phonologyFile = codecs.open(phonologyFileName, 'r', 'utf-8')
phonologyList = phonologyFile.readlines()
tests = getTests(phonologyList)
report = u''
failures = u''
for test in tests:
passed = u'GOOD'
details = u''
result = phonologize(u'#' + test[0] + u'#')
expecteds = test[1]
for e in expecteds:
if e in result:
details += u'\t%s is a surface realization\n' % e
else:
details += u'\t%s IS NOT A SURFACE REALIZATION\n' % e
passed = u'BAD'
details += u'\tsurface realizations: %s' % ', '.join(result)
tmp = u'%s: %s\n%s\n' % (passed, test[0], details)
report += tmp
if passed == u'BAD':
failures += tmp
print report
if failures:
print 'TESTS FAILED:\n%s' % failures
if __name__ == '__main__':
try:
option = sys.argv[1]
except IndexError:
option = None
if option and option == '-g':
generateFST()
else:
performTests()
Enter a string of morphemes (i.e., a morphological analysis of a word) in the input box and click 'Phonologize'. The system will use the currently open phonology return the surface phonetic representation(s) corresponding to the morphemic analysis you have entered. This allows you to see whether your phonology is behaving as you want it to.
Note: in foma terms, this is the output of running apply down
morpheme-string
from within foma or running echo
"morpheme-string" | flookup -x -i phonology.foma.bin
at the command
line.)
The program foma must be installed on the server in order for an OLD phonology to function.
Your system administrator must install foma. This involves installing libreadline, zlib1g-dev, foma and flookup. On a debian-based system, try the following
apt-get install libreadline-dev
apt-get install zlib1g-dev
wget http://dingo.sbs.arizona.edu/~mhulden/foma-0.9.15alpha.tar.gz
tar -xzvf foma-0.9.15alpha.tar.gz
cd foma
touch *.c *.h
make
make install
The above installed foma but not the flookup utility. In order to install flookup, I downloaded the flookup binary (available here) and copied it to /usr/local/bin.
The Test Phonology on Itself function searches the foma script that
represents this phonology for comments that begin with 'test', i.e., lines
whose first five characters are #test
. These system expects
such lines to consist of '#test' followed a space, then a morphemic
segmentation of a word, then one or more strings representing the surface
representation of that word. For each such line, the system applies the
phonology to the first element, i.e., the morphemic segmentation, and tests
whether the subsequent predicted surface representation (or representations)
is in the set of surface representations returned by the phonology.
For example, the comment #test in-portare importare
would
result in the system applying the phonology to 'in-portare' and testing
whether 'importare' is in the set of results returned.
This functionality facilitates the process of determining whether a given phonology captures a carefully chosen set of data. It is especially useful when trying to figure out an acceptable ordering for composing a large number of rules.