A Note on

Phrase Sets for Evaluating Text Entry Techniques

I. Scott MacKenzie

Dept. of Computer Science

York University

Toronto, Ontario, Canada M3J 1P3

smackenzie@acm.org

Posted: 17-May-01
Updated: 3-Sep-02

Background

In evaluations of text entry methods, participants are typically asked to input text using the technique under investigation. Evaluations are often comparative, pitting one technique against another. The experimental software typically presents participants with phrases of text to enter. This research note examines issues pertaining to the phrase sets used in the evaluation.

Procedure for Text Entry Evaluations

Experimental research has serveral desireable properties, including internal validity and external validity. Internal validity implies that the controlled variables actually produced the effects observed, while external validity means the results are generalizable to other subjects and situations. This seems simple enough, however, there is a tension between these properties in that attenting too strickly to one tends to compromise the other. This research note pertains to one such point of tension in research on text entry methods: the text entered by the participants.

Text entry research typically pits one entry method against another. Thus, “entry method” is the controlled variable, and it is manipulated over two or more levels, for example, Multitap vs. Letterwise in an experiment comparing text entry techniques for mobile phones [1], or Qwerty vs. Opti in an experiment comparing soft keyboard layouts [2].

Allowing participants to freely enter “whatever comes to mind” seems desireable, since this mimicks typical usage (i.e., the results are generalizable). Although of unquestionable merit in gauging the overall usability of a system or implementation, such methodology is fraught with problems. Because the procedure lacks control, the measurements include spurious behaviours, such as pondering or secondary tasks.

Evaluations of an interaction technique usually focus on performance measures (speed, accuracy, and learning trends), and, therefore, are better served through controlled experiments. For this reason, the preferred procedure is to present participants with pre-selected phrases of text. Phrases are retrieved randomly from a set and are presented to participants one by one to enter.

Participants enter each phrase using the technique under investigation, while the experimental software collects low-level data on participant actions. The raw measurements usually consist of timestamps and the coincident characters or key codes. These data are used to calculate dependent measures such as speed (in words per minute) and accuracy (percentage of characters in error, see [4]).

Creating a Phrase Set

One of the first issues to consider in designing a text entry evaluation is the phrase set. The goal is to use phrases that are moderate in length, easy to remember, and with letter frequencies typical of the target language.

In a study comparing two soft keyboards, we used a set of 70 phrases [2]. We recently expanded the set to 500 phrases. A few examples from the set follow:

have a good weekend
video camera with a zoom lens
what a monkey sees a monkey will do
that is very unfortunate
the back yard of our house
I can see the rings on Saturn
this is a very good idea

The phrases contain no punctuation symbols, and just a few instances of uppercase characters. (Participants are instructed to ignore case, and enter all characters in lowercase.)

The complete set is available in a file called phrases2.txt. Researchers wishing to use this phrase set are welcome to do so. Some minor modifications may be necessary to convert Canadian spellings to American spellings (e.g., colour vs. color).

The phrase set should be representative of the target language. We have automated the analysis of the phrase set through a simple Java class called AnalysePhrases. Below is an invocation with our 500-phrase set:

PROMPT>java AnalysePhrases < phrases2.txt
---------------------------------------
phrases: 500
minimum length: 16
maximum length: 43
average phrase length: 28.61
---------------------------------------
words: 2712
unique words: 1163
minimum length: 1
maximum length: 13
average word length: 4.46
words containing non-letters: 0
---------------------------------------
letters: 14304
correlation with English: 0.9541
---------------------------------------
PROMPT>

As seen, the phases vary from 16 characters to 43 characters (mean = 28.61). The set contains 2712 words, of which 1163 are unique. Words vary from 1 to 13 characters (mean = 4.46). The correlation in the last line of output is with the letter frequencies of Mayzner and Tressalt [3], as given in [5]. The five most frequent letters are as follows:

Letter	Frequency	Probability
e	1523	.1064
t	1080	.0755
o	1005	.0702
a	921	.0644
i	879	.0614

The AnalysePhrases program is available for download to facilitate similar analyses on other phrase sets.

We have also compiled a list of unique words and their frequencies in the phrase set. The list is available in two forms, sorted by frequency, or sorted by word. Not surprisingly, ‘the’ is the most frequent word (n = 189). The five most frequent words are as follows

Word	Frequency	Probability
the	189	.0697
a	108	.0398
is	85	.0313
to	57	.0210
of	54	.0199

Punctuation and Other Characters

An issue that surfaces frequently in discussions on the evaluation of text entry techniques is whether or not to include punctuation or other characters in the phrase set. Here we see another point of tension between internal and external validity. The main argument in favour of including such characters is that the evaluation more closely mimics real-life interaction, and, therefore, the results are generalizable. This improves the external validity of the experiment.

The main argument against is that the entry of non-alpha characters introduces a confounding source of variation in the dependent measures, and, therefore, the results are less likely to attain statistical significance. So, should punctuation and other characters be included in the phrase set? It depends. The key issues are elaborated below.

In designing a controlled experiment, practice dictates that all behaviours potentially influencing the dependent variables (viz. speed and accuracy) are controlled, or held constant, except those directly attributable to the variables under investigation. The variables under investigation are the “factors” in the experiment. Typically, the evaluation seeks to compare one text entry technique against another; so, “interaction technique” is the critical factor. It is varied over (at least) two levels, such as “qwerty layout” vs. “other layout”, or “T9” vs. “Multitap”.

The preferred experimental design is one that constrains participants’ behaviours to mechanisms that differentiate the interaction techniques. For text entry, the mostly significant point of differentiation is the basic mechanism to enter letters, words, and phrases.

If the techniques under investigation include the same mechanism to enter punctuation and other characters, then it is best to exclude these from the interaction, because they do not serve to differentiate the techniques. Instead, they represent an additional and undesirable source of variation.

However, if the techniques under investigation include different mechanisms to enter punctuation and other characters, then including these merits serious consideration. If included, they represent an additional source of variation, and therefore reduce the likelihood of attaining statistically significant results. One possible approach is to include “character set” as an additional factor in the design of the experiment, with “alpha-only” and “alpha-plus-punctuation” as the levels.

References

1. MacKenzie, I. S., Kober, H., Smith, D., Jones, T., and Skepner, E. LetterWise: Prefix-based disambiguation for mobile text input, Proceedings of the ACM Symposium on User Interface Software and Technology - UIST 2001. New York: ACM, 2001, 111-120.

2. MacKenzie, I. S., and Zhang, S. X. The design and evaluation of a high-performance soft keyboard, Proceedings of the ACM Conference on Human Factors in Computing Systems - CHI '99. New York: ACM, 1999, 25-31.

3. Mayzner, M. S., and Tresselt, M. E. Table of single-letter and digram frequency counts for various word-length and letter-position combinations, Psychonomic Monograph Supplements 1 (1965), 13-32.

4. Soukoreff, R. W., and MacKenzie, I. S. Measuring errors in text entry tasks: An application of the Levenshtein string distance statistic, Extended Abstracts of the ACM Conference on Human Factors in Computing Systems -- CHI 2001. New York: ACM, 2001, 319-320.

5. Soukoreff, W., and MacKenzie, I. S. Theoretical upper and lower bounds on typing speeds using a stylus and soft keyboard, Behaviour & Information Technology 14 (1995), 370-379.

-----

If you have any comments or suggestions, please contact Scott MacKenzie at smackenzie@acm.org.