Class ErrorMatrix

java.lang.Object
  extended by ErrorMatrix

public class ErrorMatrix
extends java.lang.Object

ErrorMatrix - a class to store the counts of presented-transcribed characters in evaluations of text entry methods.

Related publication:

Briefly, the counts are built by processing the set of optimal alignments for presented and transcribed text strings. For example, if the presented and transcribed text strings are

     quickly
     qucehkly
 
then set of optical alignments is

     quic--kly
     qu-cehkly 

     quic-kly 
     qucehkly

     qui-ckly 
     qucehkly 

     qu-ickly 
     qucehkly
 
Counts are tallied for each occurrence of a c1-c2 character pair, where each count is the reciprocal of the alignment count for the particular presented/transcribed text string. Thus, for any phrase the counts are weighted by the number of alignments in which the c1-c2 entries occurred.

"c1" is a character in the top alignment ("presented text"). "c2" is a character in the bottom alignment ("transcribed text"). Categories of c1-c2 entries include correct entries (c1 = c2), insertion errors (c1 = "-"), substitution errors (c1 != c2), and deletion errors (c2 = "-"). Various summary statistics are retrievable via the instance methods.

The first c1-c2 pair in the example above is q-q. There are four alignments, so the first q-q pair is given a count of 1 / 4 = 0.25. However, q-q appears in each of the 4 alignments; so the weighted count for q-q is 4 x 0.25 = 1.0 for the presented-transcribed text string. Clearly, "q" was entered correctly.

Although the processing above seems convoluted, it accommodates the situation where an error occured and type of error is ambiguous. Consider the "i" in "quickly" above. By examining the transcribed text string, it is evident there was an error. But, what was the error? Since there are four alignments, there are four possible explanations. First, there may have been a deletion error, as seen in the top alignment with c2 = "-" (weight = 0.25). However, it is also possible there was a substitution error, as seen in the bottom three alignments. In these cases, we see two i-c substitutions (weight = 2 x 0.25 = 0.50) and one i-e substitution (weight = 0.25). In most text entry evaluations, it is not known which explanation is correct. The process decribed here accommodates this by weighting all explanations according to their presence in the alignments.

The output is a matrix of size n x (n + 1), where n is the number of characters in the charDef array. The charDef array is assumed to contain the set of characters that appear in the presented text. The default charDef array contains 28 characters:

_ a b c d e f g h i j k l m n o p q r s t u v w x y z -
The "_" symbol represents the SPACE character. Rows represent the presented character, while columns represent the transcribed character. The dash ("-") at the end represents either a substitution error or a deletion error, depending on whether the array entries represent columns or rows in the matrix (see below).

The organization of the matrix is illustrated as follows:

As noted above, the error matrix also holds the counts for correct entries. These appear along the diagonal; that is, where the row index ( i ) equals the column index ( j ). So, getRowSum(i) retrieves the total number of occurrences of the character identified by index i. This includes the number of correct entries, as well as the number of substitution and deletion errors.

If just the number of errors is desired, use one of getRowSubCount(i), getRowSubOtherCount(i), getRowDelCount(i), or getRowTotCount(i).

For an insertion error, c1 = "-", thus, these errors do not appear along the row for a particular character, but, instead along the bottom row. The number of Insertion errors for a character is retrieved using getCell(rows - 1, j) where j is the index of the character.

The ErrorMatrix class also includes a main method, serving as an application to build error tables or error matrices.

Example invocations:

     PROMPT>java ErrorMatrix
     usage: java ErrorMatrix file [-et] [-em] [-a] [-nd] [-pr] [-co]
     
     where file = a file containing presented/transcribed strings
           -et  = output error table
           -em  = output error matrix
           -a   = output alignments (use for debugging/demo)
           -nd  = null diagonal cell entries in error matrix
           -pr  = use probabilities instead of counts in error matrix
           -co  = console output (looks better on display)
           (Note: default is no output)
        
     PROMPT>java ErrorMatrix ds2-phrases.txt -et -co
     Files: 10
     Phrases: 673
     MSD Error Rate =  2.2251% (mean across characters)
     MSD Error Rate =  2.1535% (mean across phrases)
     MSD Error Rate =  2.1102% (mean across files)

     Chr        Count     Ins       Sub       Del       Total
     --------------------------------------------------------
     _      2956.0000    0.0000    0.0027    0.0185    0.0213
     a      1248.0000    0.0000    0.0083    0.0089    0.0172
     b       231.0000    0.0000    0.0412    0.0151    0.0563
     c       457.0000    0.0000    0.0241    0.0153    0.0394
     d       527.0000    0.0000    0.0063    0.0127    0.0190
     e      2051.0000    0.0000    0.0060    0.0073    0.0133
     f       335.0000    0.0000    0.0119    0.0030    0.0149
     g       355.0000    0.0000    0.0087    0.0047    0.0134
     h       723.0000    0.0000    0.0055    0.0083    0.0138
     i      1154.0000    0.0000    0.0117    0.0082    0.0199
     j        45.0000    0.0000    0.0081    0.0141    0.0222
     k       178.0000    0.0000    0.0000    0.0056    0.0056
     l       647.0000    0.0000    0.0104    0.0100    0.0205
     m       387.0000    0.0000    0.0129    0.0103    0.0233
     n      1005.0000    0.0000    0.0162    0.0112    0.0274
     o      1356.0000    0.0000    0.0156    0.0095    0.0251
     p       324.0000    0.0000    0.0093    0.0031    0.0123
     q        35.0000    0.0000    0.0000    0.0000    0.0000
     r      1058.0000    0.0000    0.0112    0.0082    0.0194
     s      1056.0000    0.0000    0.0149    0.0123    0.0271
     t      1431.0000    0.0000    0.0054    0.0100    0.0154
     u       492.0000    0.0000    0.0129    0.0094    0.0224
     v       234.0000    0.0000    0.0085    0.0085    0.0171
     w       293.0000    0.0000    0.0205    0.0068    0.0273
     x        52.0000    0.0000    0.0000    0.0000    0.0000
     y       408.0000    0.0000    0.0025    0.0025    0.0049
     z        21.0000    0.0000    0.0635    0.0317    0.0952
     -        41.6463    1.0000    1.0000    0.0000    1.0000
     --------------------------------------------------------
     Cnt:  19100.6463   41.6463  183.7073  199.6463  425.0000
     --------------------------------------------------------
     Weightd mns(%):     0.2180    0.9618    1.0452    2.2251
     --------------------------------------------------------
     Presented characters: 19059
     Transcribed characters: 18901
     Alignment characters: 19100.64634146329
     Alignment Error_rate: 2.225056%
     --------------------------------------------------------
     Number of alignments by count...
     Occurrences:           642  20   5   1   1   1   0   0   0   3
     Number_of_Alignments:    1   2   3   4   5   6   7   8   9  >10
     Max= 495
 
Click here to view the phrases file used in the example invocation above. This file was built (using a separate program) from the sd1 data files from a text entry experiment. It is for the "Datestamp Method #2" condition described in Moble text entry using three keys , by MacKenzie (NordiCHI 2002).

If the data are destined for importing into a spreadsheet, it's best to use the ErrorMatrix application without the -co option. The table portion of the output, in this case, is comma-deliminted, full precision. For example, use

     PROMPT>java ErrorMatrix ds2-phrases.txt -et
 
to build an error table (similar to the example above), or

     PROMPT>java ErrorMatrix ds2-phrases.txt -em -nd
 
to build an error matrix.

The matrix data are useful for creating a "confusion matrix" -- a matrix showing the counts (or probabilties) of presented characters vs. transcribed characters. For the above invocation, the data can be saved in a file and then inputted into Excel. It's a simple matter to generate a chart such as the following:

Click here to see the spreadsheet that contains the above chart.

A better looking chart can be obtained using gnuplot:

Author:
Scott MacKenzie, 2002-2011

Field Summary
 char[] charDef
          The default characters associated with the rows and columns in this ErrorMatrix.
 int columns
          An int representing the number of columns in the error matrix
 int rows
          An int representing the number of rows in the error matrix
 
Constructor Summary
ErrorMatrix()
          Construct an ErrorMatrix using the default character set
ErrorMatrix(char[] custom)
          Construct an ErrorMatrix using a custom character set.
 
Method Summary
 void enter(char c1, char c2, double count)
          Enter the specified count for a c1-c2 character pair into this ErrorMatrix.
 void enter(StringPair[] sp)
          Enter the counts for an array of presented/transcribed text phrases into this ErrorMatrix.
 void enter(java.lang.String s1, java.lang.String s2)
          Enter the counts for a presented/transcribed text phrase into this ErrorMatrix.
 double getCell(int row, int col)
          Return the contents of the specified cell
 double getColInsCount(int idx)
          Return the count of the number of insertions of the character in column i
 double getColInsProb(int idx)
          Return the probability of an Insertion of the character in column i.
 double getColSum(int col)
          Return the sum of the entries in the specified column
 double[] getColSumArray()
          Return an array containing the column sums
 double getDelCount()
          Return the number Deletion errors.
 double getDelProb()
          Return the the Deletion error probability.
 java.lang.String getHeader()
          Return a comma-delimited string identifying the columns.
 int getIndex(char c)
          Return the index of the specified character, or -1 if the character is not in the charDef array.
 double getInsCount()
          Return the number of Insertion errors.
 double getInsProb()
          Return the Insertion error probability.
 double[][] getMatrix()
          Return the error matrix.
 double[] getRow(int idx)
          Return an array containing the specified row
 double getRowDelCount(int idx)
          Return the Deletion error count for the specified row
 double getRowDelProb(int idx)
          Return the Deletion error probability for the specified row
 double getRowInsProb(int idx)
          Return the Insertion error probability for the specified row.
 double getRowSubCount(int idx)
          Return the Substitution error count for the specified row
 double getRowSubOtherCount(int idx)
          Return the Substitute 'other' error count for the specified row
 double getRowSubProb(int idx)
          Return the Substitution error probability for the specified row
 double getRowSum(int row)
          Return the sum of the specified row
 double[] getRowSumArray()
          Return an array containing the row sums
 double getRowTotCount(int idx)
          Return the total error count for the specified row
 double getRowTotProb(int idx)
          Return the error probability for the specified row
 double getSubCount()
          Return the number Substitution errors.
 double getSubOtherCount()
          Return the number of Substitution errors where c2 == "Other"
 double getSubProb()
          Return the Substitution error probability.
 double getSum()
          Returns the sum of all entries in the matrix.
 java.lang.String getSymbol(int idx)
          Return a one-character String representing the symbol associated with the entries in a row or column.
 double getTotCount()
          Return the total numbers of errors
 double getTotProb()
          Return the total error probability
static void main(java.lang.String[] args)
          An application that uses the ErrorMatrix class.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

charDef

public char[] charDef
The default characters associated with the rows and columns in this ErrorMatrix.

The first entry is '_', representing the SPACE character. The last entry is '-', representing an Insertion error for rows, or a Deletion error for columns. The remaining entries are the characters appearing in the presented text phrases, namely, a-z.


rows

public int rows
An int representing the number of rows in the error matrix


columns

public int columns
An int representing the number of columns in the error matrix

Constructor Detail

ErrorMatrix

public ErrorMatrix()
Construct an ErrorMatrix using the default character set


ErrorMatrix

public ErrorMatrix(char[] custom)
Construct an ErrorMatrix using a custom character set. The last character in the array should be '-', representing an insertion error for rows and a deletion error for columns.

Method Detail

enter

public void enter(char c1,
                  char c2,
                  double count)
Enter the specified count for a c1-c2 character pair into this ErrorMatrix.

Parameters:
c1 - the presented character
c2 - the transcribed character
count - the amount to increment the corresponding cell by

getIndex

public int getIndex(char c)
Return the index of the specified character, or -1 if the character is not in the charDef array.


enter

public void enter(java.lang.String s1,
                  java.lang.String s2)
Enter the counts for a presented/transcribed text phrase into this ErrorMatrix. This method does the work of converting the presenting/transcribed strings into a set of alignments, and then scanning the alignments character-by-character in determining the appropriate increment for each c1-c2 pair. Use this method in a loop until all phrases are entered into the matrix, or put the phrases in a StringPair array and use a single call to the one-arg version of enter.

Parameters:
s1 - the presented text string
s2 - the transcribed text string

enter

public void enter(StringPair[] sp)
Enter the counts for an array of presented/transcribed text phrases into this ErrorMatrix.

Parameters:
sp - an array of StringPair objects containing presented and transcribed text phrases.

getMatrix

public double[][] getMatrix()
Return the error matrix. Here it is, just in case you need this for some special snooping around!


getSum

public double getSum()
Returns the sum of all entries in the matrix. Due to the weightings, this sum is the total of the "alignment characters", not the total number of presented or transcribed characters. The return value also includes counts contained in the "other" array.


getCell

public double getCell(int row,
                      int col)
Return the contents of the specified cell


getHeader

public java.lang.String getHeader()
Return a comma-delimited string identifying the columns. The last entry is "OTHER".


getSymbol

public java.lang.String getSymbol(int idx)
Return a one-character String representing the symbol associated with the entries in a row or column.


getRow

public double[] getRow(int idx)
Return an array containing the specified row


getRowSum

public double getRowSum(int row)
Return the sum of the specified row


getRowSumArray

public double[] getRowSumArray()
Return an array containing the row sums


getRowTotCount

public double getRowTotCount(int idx)
Return the total error count for the specified row


getRowTotProb

public double getRowTotProb(int idx)
Return the error probability for the specified row


getColSum

public double getColSum(int col)
Return the sum of the entries in the specified column


getColSumArray

public double[] getColSumArray()
Return an array containing the column sums


getInsCount

public double getInsCount()
Return the number of Insertion errors.


getInsProb

public double getInsProb()
Return the Insertion error probability. This is the ratio of the number of Insertion errors to the total number of characters.


getRowInsProb

public double getRowInsProb(int idx)
Return the Insertion error probability for the specified row. The return value is 1.0 if the index is size - 1 (i.e., the row associated with Insertion errors). Otherwise, the return value is 0.0


getColInsCount

public double getColInsCount(int idx)
Return the count of the number of insertions of the character in column i


getColInsProb

public double getColInsProb(int idx)
Return the probability of an Insertion of the character in column i. The probability is the ratio of the number of Insertion errors of the specified character to the total number of Insertion errors.


getSubCount

public double getSubCount()
Return the number Substitution errors.


getSubProb

public double getSubProb()
Return the Substitution error probability. This is the ratio of the number of Substitution errors to the total number of characters.


getSubOtherCount

public double getSubOtherCount()
Return the number of Substitution errors where c2 == "Other"


getRowSubCount

public double getRowSubCount(int idx)
Return the Substitution error count for the specified row


getRowSubProb

public double getRowSubProb(int idx)
Return the Substitution error probability for the specified row


getRowSubOtherCount

public double getRowSubOtherCount(int idx)
Return the Substitute 'other' error count for the specified row


getDelCount

public double getDelCount()
Return the number Deletion errors.


getDelProb

public double getDelProb()
Return the the Deletion error probability. This is the ratio of the number of deletion errors to the total number of characters.


getRowDelCount

public double getRowDelCount(int idx)
Return the Deletion error count for the specified row


getRowDelProb

public double getRowDelProb(int idx)
Return the Deletion error probability for the specified row


getTotCount

public double getTotCount()
Return the total numbers of errors


getTotProb

public double getTotProb()
Return the total error probability


main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
An application that uses the ErrorMatrix class. Execute without command-line arguments to get a usage message.

Throws:
java.io.IOException