Dept. of Computer Science
York University
Toronto, Ontario, Canada M3J 1P3
smackenzie@acm.org
Posted: 17-May-01
Updated: 3-Sep-02
In evaluations of text
entry methods, participants are typically asked to input text using the
technique under investigation.
Evaluations are often comparative, pitting one technique against
another. The experimental software
typically presents participants with phrases of text to enter. This research note examines issues
pertaining to the phrase sets used in the evaluation.
Experimental research has serveral
desireable properties, including internal validity and external
validity. Internal validity implies
that the controlled variables actually produced the effects observed, while
external validity means the results are generalizable to other subjects and
situations. This seems simple enough,
however, there is a tension between these properties in that attenting too
strickly to one tends to compromise the other.
This research note pertains to one such point of tension in research on
text entry methods: the text entered by the participants.
Text entry research
typically pits one entry method against another. Thus, “entry method” is the controlled variable, and it is
manipulated over two or more levels, for example, Multitap vs. Letterwise
in an experiment comparing text entry techniques for mobile phones [1], or Qwerty vs. Opti in an experiment
comparing soft keyboard layouts [2].
Allowing participants to
freely enter “whatever comes to mind” seems desireable, since this mimicks
typical usage (i.e., the results are generalizable). Although of unquestionable merit in gauging the overall usability
of a system or implementation, such methodology is fraught with problems. Because the procedure lacks control, the measurements
include spurious behaviours, such as pondering or secondary tasks.
Evaluations of an
interaction technique usually focus on performance measures (speed, accuracy,
and learning trends), and, therefore, are better served through controlled
experiments. For this reason, the
preferred procedure is to present participants with pre-selected phrases of
text. Phrases are retrieved randomly
from a set and are presented to participants one by one to enter.
Participants enter each
phrase using the technique under investigation, while the experimental software
collects low-level data on participant actions. The raw measurements usually consist of timestamps and the
coincident characters or key codes.
These data are used to calculate dependent measures such as speed (in
words per minute) and accuracy (percentage of characters in error, see [4]).
One of the first issues to
consider in designing a text entry evaluation is the phrase set. The goal is to use phrases that are moderate
in length, easy to remember, and with letter frequencies typical of the target
language.
In a study comparing two
soft keyboards, we used a set of 70 phrases [2]. We recently
expanded the set to 500 phrases. A few
examples from the set follow:
have
a good weekend
video camera with a zoom lens
what a monkey sees a monkey will do
that is very unfortunate
the back yard of our house
I can see the rings on Saturn
this is a very good idea
The phrases contain no
punctuation symbols, and just a few instances of uppercase characters. (Participants are instructed to ignore case,
and enter all characters in lowercase.)
The complete set is
available in a file called phrases2.txt. Researchers wishing to use
this phrase set are welcome to do so.
Some minor modifications may be necessary to convert Canadian spellings
to American spellings (e.g., colour vs. color).
The phrase set should be
representative of the target language.
We have automated the analysis of the phrase set through a simple Java
class called AnalysePhrases. Below
is an invocation with our 500-phrase set:
PROMPT>java
AnalysePhrases < phrases2.txt
---------------------------------------
phrases: 500
minimum length: 16
maximum length: 43
average phrase length: 28.61
---------------------------------------
words: 2712
unique words: 1163
minimum length: 1
maximum length: 13
average word length: 4.46
words containing non-letters: 0
---------------------------------------
letters: 14304
correlation with English: 0.9541
---------------------------------------
PROMPT>
As seen, the phases vary from 16
characters to 43 characters (mean = 28.61).
The set contains 2712 words, of which 1163 are unique. Words vary from 1 to 13 characters (mean =
4.46). The correlation in the last line
of output is with the letter frequencies of Mayzner and Tressalt [3], as given in [5]. The five
most frequent letters are as follows:
Letter |
Frequency |
Probability |
e |
1523 |
.1064 |
t |
1080 |
.0755 |
o |
1005 |
.0702 |
a |
921 |
.0644 |
i |
879 |
.0614 |
The AnalysePhrases program is available for download to facilitate similar analyses on other
phrase sets.
We have also compiled a list of
unique words and their frequencies in the phrase set. The list is available in
two forms, sorted by frequency, or sorted by word. Not surprisingly, ‘the’ is
the most frequent word (n = 189).
The five most frequent words are as follows
Word |
Frequency |
Probability |
the |
189 |
.0697 |
a |
108 |
.0398 |
is |
85 |
.0313 |
to |
57 |
.0210 |
of |
54 |
.0199 |
An issue that surfaces frequently in
discussions on the evaluation of text entry techniques is whether or not to
include punctuation or other characters in the phrase set. Here we see another point of tension between
internal and external validity. The
main argument in favour of including such characters is that the evaluation
more closely mimics real-life interaction, and, therefore, the results are
generalizable. This improves the external
validity of the experiment.
The main argument against is that
the entry of non-alpha characters introduces a confounding source of variation
in the dependent measures, and, therefore, the results are less likely to
attain statistical significance. So,
should punctuation and other characters be included in the phrase set? It depends.
The key issues are elaborated below.
In designing a controlled
experiment, practice dictates that all behaviours potentially influencing the
dependent variables (viz. speed and accuracy) are controlled, or held constant,
except those directly attributable to the variables under investigation. The variables under investigation are the
“factors” in the experiment.
Typically, the evaluation seeks to compare one text entry technique
against another; so, “interaction technique” is the critical factor. It is varied over (at least) two levels,
such as “qwerty layout” vs. “other layout”, or “T9” vs. “Multitap”.
The preferred experimental design is
one that constrains participants’ behaviours to mechanisms that differentiate
the interaction techniques. For text
entry, the mostly significant point of differentiation is the basic mechanism
to enter letters, words, and phrases.
If the techniques under
investigation include the same mechanism to enter punctuation and other
characters, then it is best to exclude these from the interaction, because they
do not serve to differentiate the techniques.
Instead, they represent an additional and undesirable source of
variation.
However, if the techniques under
investigation include different mechanisms to enter punctuation and other
characters, then including these merits serious consideration. If included, they represent an additional
source of variation, and therefore reduce the likelihood of attaining
statistically significant results. One
possible approach is to include “character set” as an additional factor in the
design of the experiment, with “alpha-only” and “alpha-plus-punctuation” as the
levels.
1. MacKenzie, I. S., Kober, H.,
Smith, D., Jones, T., and Skepner, E. LetterWise: Prefix-based disambiguation
for mobile text input, Proceedings of the ACM Symposium on User
Interface Software and Technology - UIST 2001. New York: ACM, 2001, 111-120.
2. MacKenzie, I. S., and Zhang, S.
X. The design and evaluation of a high-performance soft keyboard, Proceedings of the ACM Conference on Human
Factors in Computing Systems - CHI '99.
New York: ACM, 1999, 25-31.
3. Mayzner, M. S., and Tresselt, M.
E. Table of single-letter and digram frequency counts for various word-length
and letter-position combinations, Psychonomic
Monograph Supplements 1 (1965),
13-32.
4. Soukoreff, R. W., and MacKenzie,
I. S. Measuring errors in text entry tasks: An application of the Levenshtein
string distance statistic, Extended
Abstracts of the ACM Conference on Human Factors in Computing Systems -- CHI
2001. New York: ACM, 2001, 319-320.
5. Soukoreff, W., and MacKenzie, I.
S. Theoretical upper and lower bounds on typing speeds using a stylus and soft
keyboard, Behaviour & Information
Technology 14 (1995), 370-379.
-----
If you have any comments or suggestions,
please contact Scott MacKenzie at smackenzie@acm.org.