english package notes

english - a package of classes for processing utterances written in the English language.

version 1.01 2002 February - Author: James A. Mason

At present this package contains three classes: EnglishWord, EnglishNoun, and Nounphrase1.

EnglishWord is a class which has static methods for computing morphological and other properties of English words. The current version is only a beginning. It provides just two capabilities:

1. a method isPreposition for determining whether a given word belongs to a list of (currently 57) prepositions in the English language. Note: This list does not include multi-word prepositions like "in front of" or "next to".

2. a method processApostrophe for analyzing words that contain apostrophes, and for expanding them into strings of one or more separate lexical elements. Such words include possessives, like "boy's", "girls'", "man's", and "women's", contractions like "it's", "can't", "'tisn't", "goin'", and "I'd", and plurals like "x's" and "y's".

EnglishNoun is a class which has static methods for computing morphological and other properties of English nouns. The current version provides the following methods:

1. a method isPlural for determining whether a given string can be a plural English noun.

2. a method isSingular for determining whether a given string can be a singular English noun.

3. a method pluralOf for computing a plural form of a given string, which is presumed to be an English noun. The given noun may be singular, or it may already be plural. If the word may have more than one plural form, the method produces the most common one (e.g., "brothers" rather than "bretheren" as plural of "brother").

4. a method singularOf for computing a singular form of a given string, which is presumed to be an English noun. The given noun may be plural, or it may already be singular. If the word may have more than one singular form, the method produces the most common one (e.g., "axe" rather than "axis" as singular of "axes").

Nounphrase1 is an example that provides semantic interpretations for an ASDGrammar, nounphrasesimple.grm, of English noun phrases. It also provides a graphical user interface for accepting phrases from a user and attempting to parse them as English noun phrases. When a parse is successful, Nounphrase1 displays the set of features and feature values which represents the meaning of the parsed phrase.

The current versions of nounphrasesimple.grm and Nounphrase1 do not provide a full-coverage grammar of English noun phrases. Instead, they are intended to provide a reasonably good start at building a grammar of English noun phrases, and to provide examples of interesting ways in which the ASDEditor and ASDParser can be used. In particular, this example illustrates:

* use of sets of semantic features and feature values to represent meanings of phrases and subphrases;

* use of the ASDParser raiseFeatures method to copy features and values from a subphrase to the higher-level phrase in which the subphrase occurs;

* use of semantic feature values to enforce grammatical agreement requirements (e.g., singular or plural agreement) and to terminate parses or parse paths which are syntactically or semantically anomalous. For example, see the methods NOUN_1_action and QUANTITY_1_action in Nounphrase1, and the corresponding nodes (NOUN 1) and (QUANTITY 1)
in nounphrasesimple.grm or in nounphrasesimple0.grm.

* morphological analysis of words before a phrase is parsed (See the morphologicallyAnalyze method in Nounphrase1.java.);

* further morphological analysis of words while a phrase is being parsed (See method NOUN_1_action in Nounphrase1.java.);

* use of settings of a boolean flag (named "strict" in this example) to permit relaxation of semantic feature rules to accept semi-grammatical phrases. For example, try parsing "two boy", "the one boys", "a boys", and the like.

* use of methods associated with grammar nodes to produce side effects, such as output to the user, during a parse. For example, see the methods NOUN_1_action, QUANTITY_1_action, UNKNOWNWORD_1_action, UNKNOWNWORD_2_value and UNKNOWNWORD_3_value.

* use of the UNKNOWN keyword in a grammar, to accept unknown words in given phrases and interpret them (in this example) as hyphenated CARDINALs, or ADJECTIVEs, or NOUNs;

* use of a separate ASDParser instance to parse parts of a given phrase (in this example, hyphenated words which may yield CARDINALs), so that one parser can handle things differently from the other. See the method UNKNOWNCARDINAL_action, and try the phrase "twenty-five x-rays". Notice that the two ASDParsers handle hyphenated words differently: The main
parser instance does not treat hyphens as separate lexical items, but the second parser instance is set to treat hyphens as separate lexical items. So the second parser can correctly parse "twenty-five" as a CARDINAL.

* updating of the grammar by the parser to include an entry for a previously unknown word. In this example, in the method
UNKNOWNCARDINAL_action a new entry is created for a hyphenated word that has been parsed successfully by taking it apart and treating the hyphen as a separate lexical item. Note: When this occurs during a run of the ASDParser, and if the parser has been told to save all uniquely parsed subphrases, then the effect of the new entry in the grammar may not be noticed immediately when the parser backtracks. That is because, the recognition of the hyphenated word as an UNKNOWNWORD may have been judged by the parser to be a uniquely parsed subphrase.

* use of merged ASDGrammars (nounphrasesimple0.grm merged into cardinal.grm with an ASDEditor to create nounphrasesimple.grm). Creating a grammar in modular form, by constructing separate grammar files and later merging them together, has at least two advantages: It keeps the ASDEditor's display pane from becoming too cluttered, and it permits some of the grammar modules to be re-used in constructing other merged grammars.

However, there are several things to consider when grammars are produced by merging:

1. There should only be one node UNKNOWN in the combined grammar; otherwise an ASDparser will find essentially the same phrase structures, redundantly, by using different instances of UNKNOWN. In this example, since there is an instance of UNKNOWN in cardinal.grm, the first of the two grammars which was written, there is not another instance of UNKNOWN in nounphrasesimple0.grm. (In any case, nounphrasesimple0.grm was not written as a stand-alone grammar, but was written to be merged with cardinal.grm.)

2. When grammars are merged, the order in which they are merged can be significant. If a grammar, A, is loaded into an ASDEditor and another grammar, B, is merged with it, then for lexical items which are found in both A and B, the instances in grammar B are re-numbered so as not to conflict with instances in grammar A.

For instances which are initial ones, this will affect the order in which an ASDParser tries them during parsing, because an ASDParser tries the instances in order by their numbers. In the example of the ASDGrammar nounphrasesimple.grm and the Nounphrase1 program, for parsing phrases like "twenty-five x-rays", it is better to load cardinal.grm and merge
nounphrasesimple0.grm into it, rather than the other way around. That way, the ASDParser will try the UNKNOWNWORD instance from cardinal.grm before it tries the UNKNOWNWORD instances from nounphrasesimple0.grm. So it will recognize a hyphenated word like twenty-five as a CARDINAL before it tries interpreting it as an ADJECTIVE or a NOUN.

Also, it is important to be aware that merging in a grammar file will generally re-number some of the instances from that file. If the methods for semantic action and semantic value computations are named to correspond to the grammar instances with which they are associated (e.g., method $$_1_action to correspond to node ($$ 1) in nounphrasesimple0.grm), then that correspondence will no longer be meaningful if the grammar instance in question is re-numbered. For example, in nounphrasesimple.grm, which was obtained by merging nounphrasesimple0.grm into cardinal.grm, node ($$ 1) is renumbered to ($$ 3) but still retains '$$_1_action' in its semantic action field.

link to parent page