1 Demographics

Participants for the experiment are Mechanical Turk members with a US bachelor’s degree and “job function” (but not necessarily the degree) in Information Technology.

1.1 Sex and Age

A total of 41 participants respond to the call, 13 female and 28 male. Their ages range from 23 to 69 (median 40) and as per recruitment qualifcation they have a Bachelors degree from a US university.

Age Distribution By Gender.

Figure 1.1: Age Distribution By Gender.

Var1 Freq
Female 13
Male 28

Total Age Distribution:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.00   33.00   40.00   39.73   43.00   69.00

1.2 Field / Background

Of the 41 participants, 34 declare academic or occupational background in Science, Technology and Engineering, 11 in Business and Economics and 4 in Social Sciences, Humanities or Fine Arts, noting that combinations were allowed.

Var1 Freq
Business and Economics 11
Fine Arts 1
Science, Technology and Engineering 34
Social Sciences / Humanities 3

2 Treatment of Intention Models

Intention models are treated in the analysis in three different modes. One is flat, where the various metrics are considering all four concepts normally. In the second and third mode, however, the concepts {goal, objective} are merged into one (called intention) and {claim,assertion} into another, called statement. These higher-level concepts, which are never revealed to participants, allow us to also devise authoritative classifications for all expressions so to obtain measures of accuracy. In the second mode, measures of agreement and accuracy are then based on whether agreement and respectively accuracy occur at the level of merged categories; this is the between-concepts mode. For example, in this mode, classification of an expression as goal by one participant and as objective by another constitutes an agreement, as both categories are intentions. Accuracy is understood similarly. In the third mode, the within-concepts mode measures of agreement are based on whether agreement is observed within each high-level category, after excluding classifications that do not belong to this category. For example, an expression authoritatively classified as intention, may be classified as goal by 6 participants, as objective by 4 and as claim by 1. We measure agreement between the 6 and 4 participants excluding the one that is outside the intention category. However, if more than 1 classification needs to be excluded, like the claim above, the entire expression is excluded from the analysis.

3 Randomness Analysis

3.1 Mutlinomial Tests

For each description, we perform a simple multinomial tests against the hypothesis that participants classify randomly, i.e. choose one of the four choices as if rolling a dice. For goal models the random hypothesis is frequently rejected. For a total of 64% descriptions the null is rejected with \(p\leq 0.05\) and for 79% of the cases with \(p\leq 0.2\), where \(p\) is the probability of acquiring the responses through a uniform random process (dice rolling). Performing a similar test for intention models, in the between-concepts mode in 80% of the descriptions we reject the uniform hypothesis \(p\leq 0.05\) that the data came from a random process, versus 9% of the within-concepts mode. These patterns seem to follow our expectations and offer some good initial evidence in support of instrument reproducibility.

The following table shows the exact frequencies for each probability value bin.

Language Descriptions with p<0.05 Descriptions with p<0.2
Goal Models 0.6410256 0.7948718
Intention Models - Flat 0.7843137 0.9019608
Intention Models - Between 0.8039216 0.8725490
Intention Models - Within 0.0853659 0.2317073

3.2 Calculation Note

The test used is xmulti(responses,probabilityVector) from the XNomial package and the LLR method for calculating the probability is considered as per the package author’s advice. In the example below the between categories call pattern is given, in which the first array is the collection of responses.

xmulti(as.integer(
  table(factor(c("Intention","Intention","Belief","Intention"),
               levels = c("Belief","Intention")))),
  c(0.5,0.5))
## 
## P value (LLR) = 0.625

The call for within-categories randomness is similar to the one below, noting that responses belonging to the other category are removed.

xmulti(as.integer(
  table(factor(c("Goal","Goal","Objective","Objective"),
               levels = c("Goal","Objective")))),
  c(0.5,0.5))
## 
## P value (LLR) = 1

4 Aggreement Exploration

4.1 Agreement per Concept (GpC)

The graph below shows GpE measures per expression set, considering the three modes of Intention Models.

Agreement index per Concept, Language, Description and Focus.

Figure 4.1: Agreement index per Concept, Language, Description and Focus.

4.2 Total Aggreement Comparison (GT)

We further calculate the GT (total agreement) that each language produces expecting that both intention models in between-concepts mode and goal models will ofer higher agreement than intention models in within-concepts mode. The result is seen below: participants agree substan-tially in (indirectly) distinguishing between intentions and statements in intention models – an effect that can again explain the moderate level of agreement in the at mode – but disagree in classifcations within those concepts. Goal models lie in the middle of the range \(m(sd) = 0.41(0.25)\). We also notice the sizeable standard deviations indicating sensitivity to the expression choice.

Average Agreements

Figure 4.2: Average Agreements

The numbers are as follows:

Language mean sd n min max
Goal Models 0.4085470 0.2469132 78 0.040000 1.00
Intention Models - Flat 0.3522195 0.1529430 102 0.030303 0.76
Intention Models - Between 0.7519981 0.3090908 102 0.000000 1.00
Intention Models - Within 0.1785445 0.2091792 102 0.000000 1.00

4.3 Focus on Goal Models

As will also become apparent in the ApC measures, in goal models, expressions designated as goals do not generate much agreement.

Authoritative Agreement
Belief 0.4613333
Goal 0.3120000
Quality 0.5155556
Task 0.5000000

5 Accuracy Analysis

5.1 Accuracy per Participant - Descriptives

Accuracy measures can be meaningfully compared between goal models and intention models in between-concepts mode, as there is no authoritative response in the within-concepts mode. The data can be seen in the boxplot and table below.

Language Set ApP.mean ApP.sd ApP.median
Goal Models Archimate 0.5558824 0.1602163 0.6176471
Goal Models Comprehension 0.7950000 0.2038446 0.8500000
Goal Models iStar 0.5875000 0.1862900 0.6666667
Intention Models Archimate 0.9159664 0.0994299 0.9411765
Intention Models Authoritative 0.9571429 0.1121224 1.0000000
Intention Models Comprehension 0.9841270 0.0566363 1.0000000
Intention Models iStar 0.8293651 0.1249338 0.8333333
Language Version ApP.mean ApP.sd ApP.median
Goal Models A 0.6173203 0.2297594 0.6470588
Goal Models B 0.6749346 0.1885329 0.6568627
Intention Models A 0.8914706 0.1372761 0.9705882
Intention Models B 0.9490865 0.0847610 1.0000000

5.2 Accuracy Per Participant (ApP) - ANOVA Analysis

For the statistical analysis we perform two separate 2x3 and 2x4 repeated measures MANOVA analyses are run for each of goal models and intention models in between-concept mode respectively. Description Set and Version are dealt with as within and between-factors, respectively.

5.2.1 2x3 for Goal Models

We first test some assumptions for MANOVA.

## Samples per cell for Goal Models:
## # A tibble: 2 x 3
## # Groups:   Language [1]
##   Language    Version     N
##   <chr>       <chr>   <int>
## 1 Goal Models A          10
## 2 Goal Models B          10
## Shapiro's Multivariate Normality Test for Goal Models: PASSES
## # A tibble: 1 x 2
##   statistic p.value
##       <dbl>   <dbl>
## 1     0.955   0.447
## Multico-lineary Inspection for Goal Models
## # A tibble: 3 x 4
##   rowname       Archimate iStar Comprehension
## * <chr>             <dbl> <dbl>         <dbl>
## 1 Archimate          1     0.54          0.48
## 2 iStar              0.54  1             0.44
## 3 Comprehension      0.48  0.44          1
## Box's test for Multivariate Homoscedacity: PASSES
## # A tibble: 1 x 4
##   statistic p.value parameter method                                            
##       <dbl>   <dbl>     <dbl> <chr>                                             
## 1      10.5   0.104         6 Box's M-test for Homogeneity of Covariance Matric~

They seem to support the application of parametric ANOVA. Let us have glimpse at the data.

head(ApP.plain.goals,4)
## # A tibble: 4 x 6
##   Participant                    Language  Version Archimate Comprehension iStar
##   <chr>                          <chr>     <chr>       <dbl>         <dbl> <dbl>
## 1 s.0f4fac34-3808-451f-93a1-b0e~ Goal Mod~ A           0.706           0.8 0.917
## 2 s.11d36ace-0607-4c23-90bf-879~ Goal Mod~ A           0.294           0.4 0.25 
## 3 s.3938fafc-a086-405d-b5d5-abb~ Goal Mod~ B           0.529           0.7 0.75 
## 4 s.3b43a420-5578-4775-b23f-4f0~ Goal Mod~ B           0.647           0.6 0.333
#
# PERFORMING PARAMETRIC MANOVA FOR GOALS
#
model.goals <- lm(cbind(Archimate,iStar,Comprehension) ~ Version, ApP.plain.goals)
Set <- ordered(c("Comprehension","iStar","Archimate"),
               levels = c("Comprehension","iStar","Archimate"))
idata <- data.frame(Set)
aov.Res <- Manova(model.goals, idata=idata, idesign=~Set, type="III")
print(aov.Res,test = "Pillai")
## 
## Type III Repeated Measures MANOVA Tests: Pillai test statistic
##             Df test stat approx F num Df den Df    Pr(>F)    
## (Intercept)  1   0.90434  170.157      1     18 1.302e-10 ***
## Version      1   0.03954    0.741      1     18    0.4006    
## Set          1   0.41492    6.028      2     17    0.0105 *  
## Version:Set  1   0.01947    0.169      2     17    0.8461    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We note that absence of effect for Version could be a Type II, due to our low power; so we can only remain inconclusive.

We observe an effect in Expression Set. To find out where exactly we perform pairwise post-hocs.

# Where is the set effect?
ApP.plain.goals.long <- ApP.plain.goals %>% pivot_longer(col = c(Archimate,Comprehension,iStar),names_to = "Set",values_to = "Measure")
ApP.plain.goals.long %>%
wilcox_test(Measure ~ Set, paired = TRUE, p.adjust.method = "bonferroni")
## # A tibble: 3 x 9
##   .y.    group1     group2       n1    n2 statistic       p   p.adj p.adj.signif
## * <chr>  <chr>      <chr>     <int> <int>     <dbl>   <dbl>   <dbl> <chr>       
## 1 Measu~ Archimate  Comprehe~    20    20         3 1.49e-4 4.47e-4 ***         
## 2 Measu~ Archimate  iStar        20    20        76 2.87e-1 8.61e-1 ns          
## 3 Measu~ Comprehen~ iStar        20    20       194 9.47e-4 3.00e-3 **

The comprehension set is the first one presented to the users, does not include a context (a description based on a scenario and a persona) and contains some expressions from authoritative examples of the iStar 2.0 guide.

5.2.2 2x4 for Intention Models

We perform similar assumption tests for intention models. They don’t go well due to the ceiling effects we seem to be having.

head(ApP.plain.int,4)
## # A tibble: 4 x 7
##   Participant      Language  Version Archimate Comprehension iStar Authoritative
##   <chr>            <chr>     <chr>       <dbl>         <dbl> <dbl>         <dbl>
## 1 s.211304bc-157a~ Intentio~ B           1                 1 0.917             1
## 2 s.2225a442-9977~ Intentio~ B           0.824             1 0.833             1
## 3 s.2e38d39c-6ca4~ Intentio~ B           1                 1 0.917             1
## 4 s.2efc9f6e-ff0e~ Intentio~ A           0.824             1 0.833             1
## Samples per cell for Intention Models:
## # A tibble: 2 x 3
## # Groups:   Language [1]
##   Language         Version     N
##   <chr>            <chr>   <int>
## 1 Intention Models A          10
## 2 Intention Models B          11
## Shapiro's Multivariate Normality Test for Intention Models: FAILS
## # A tibble: 1 x 2
##   statistic      p.value
##       <dbl>        <dbl>
## 1     0.374 0.0000000168
## Multico-lineary Test for Intention Models: some evidence
## # A tibble: 4 x 5
##   rowname       Archimate iStar Authoritative Comprehension
## * <chr>             <dbl> <dbl>         <dbl>         <dbl>
## 1 Archimate         1      0.25        -0.075        -0.031
## 2 iStar             0.25   1            0.49          0.33 
## 3 Authoritative    -0.075  0.49         1             0.87 
## 4 Comprehension    -0.031  0.33         0.87          1
## Box's test for Multivariate Homoscedacity: Nope..
## # A tibble: 1 x 4
##   statistic p.value parameter method                                            
##       <dbl>   <dbl>     <dbl> <chr>                                             
## 1       Inf       0        10 Box's M-test for Homogeneity of Covariance Matric~

Given the above problems we resort to a non-parametric test for the within factor, and treat the version factor separately, remaining cognizant of the error inflation.

## Friedman's Test of within factors
## # A tibble: 1 x 6
##   .y.       n statistic    df           p method       
## * <chr> <int>     <dbl> <dbl>       <dbl> <chr>        
## 1 ApP      21      30.8     3 0.000000920 Friedman test

We, again, perform pairwise post-hocs:

## Wilcox pairwise post-hocs
## # A tibble: 6 x 9
##   .y.   group1      group2         n1    n2 statistic       p p.adj p.adj.signif
## * <chr> <chr>       <chr>       <int> <int>     <dbl>   <dbl> <dbl> <chr>       
## 1 ApP   Archimate   Authoritat~    21    21       14  5.40e-2 0.325 ns          
## 2 ApP   Archimate   Comprehens~    21    21       12  3.60e-2 0.219 ns          
## 3 ApP   Archimate   iStar          21    21      144  1.20e-2 0.069 ns          
## 4 ApP   Authoritat~ Comprehens~    21    21        1  5.20e-2 0.314 ns          
## 5 ApP   Authoritat~ iStar          21    21      144. 2.00e-3 0.009 **          
## 6 ApP   Comprehens~ iStar          21    21      153  2.85e-4 0.002 **

The iStar example appears to induce sensibly less accuracy.

## # A tibble: 1 x 7
##   .y.   group1 group2    n1    n2 statistic      p
## * <chr> <chr>  <chr>  <int> <int>     <dbl>  <dbl>
## 1 ApP   A      B         40    44       680 0.0458
## Wilcox pairwise simple effects post-hocs for version
## # A tibble: 4 x 8
##   Set           .y.   group1 group2    n1    n2 statistic      p
## * <chr>         <chr> <chr>  <chr>  <int> <int>     <dbl>  <dbl>
## 1 Archimate     ApP   A      B         10    11      48   0.626 
## 2 Authoritative ApP   A      B         10    11      27.5 0.0105
## 3 Comprehension ApP   A      B         10    11      54   0.945 
## 4 iStar         ApP   A      B         10    11      29   0.0676

5.3 Accuracy Per Participant (ApP) - Interaction Graph

5.4 Accuracy Per Concept (ApC)

For goal models, for expressions that the designers would classify as goals and tasks, slightly less than half and 60% of the participants agree on average. The graph shows the normalized figures 0.3 and 0.47 respectively which are lower as they account for randomness. The result indicates that expressions authoritatively classified as goals and tasks may tend to be classified otherwise by participants. For intention models the measures are higher, despite normalization.

A point of caution concerns also use of accuracy measures for comparing conceptualizations of diferent sizes: the proposed normalizations allow for rough qualitative comparisons, but a theory of such comparisons is yet be developed. For instance, the differences between goal and intention models observed above do not trivially imply that the latter inspire more accuracy (i.e. should be preferred). Notice that this comparison is non-trivial and has a theoretical importance; is decrease in conceptual granularity always (as a law) accom- panied with increase in accuracy and agreement, and, if yes, how do we control for this increase for a truly fair comparison? This remains to be explored.

6 Overlap Analysis

We use a heatmap style visualization we call concept overlap maps to visually explore overlaps between concepts. The higher the value of a tile (and the darker the shadow) the more the overlap. The visualizations below are produced per expression set for each model separatelly, and combined for all expressions, as reported in the paper.

6.1 Goal Models

6.1.1 Concept Overlap Map - Goal Models

6.1.2 Alternate Visual

6.2 Intention Models

6.2.1 Concept Overlap Map - Intention Models

6.2.2 Alternate Visual

6.3 Comparizon between Models

The visualizations below, reported in the paper compare the two languages. Intention models are considered in flat mode. Starting from intention models, the categories within intentions and statements exhibit substantial overlap compared to other pairs, as strongly expected. Goal models on the other hand show that goals overlap with tasks and, less so with qualities. This overlap seems to explain the lack of ApC that goals exhibited.

7 Instrument Consistency Analysis

In this part of the exploration we try to see if observed (e.g. ApC) and reported measures are consistent with each other.

7.1 Accuracy vs. Intuition (Goal Models only)

Intuitiveness seems to be independent of accuracy or agreement measures (ApC, GpC). For example goals are thought to be more intuitive than qualities (\(0.76(0.15)\) vs. \(0.63(0.3)\)), though expressions of quality are more accurately identifed (\(0.63(0.18)\) vs. \(0.3(0.2)\)). Beliefs on the other hand are both accurately identifed and deemed intuitive.

7.2 Measured vs. Perceived Overlap - Goal Models

We perform a comparison between observed and self-reported overlap between concepts. Overlap can be expressed by pairs (VPI) or by concepts (VpC), by aggregating the overlap measure that a concept particiaptes in. The graphs below plot the two measures per expression set and then as an aggregate for all sets.

7.2.1 Observed Overlap vs. Reported Overlap - Per Concept

7.2.2 Observed Overlap vs. Reported Overlap - Per Pair

Similar overlaps are observed when viewed in pairs.

7.3 Measured vs. Perceived Overlap - Intention Models

7.3.1 Observed Overlap vs. Reported Overlap - Per Concept

7.3.2 Observed Overlap vs. Reported Overlap - Per Pair

Similar overlaps are observed when viewed in pairs.

7.4 Comparizon Between Languages

A more concise picutre of perceived overlap and its relationship to observed is seen below. Two points to observe in the graphs are: (a) the successful detection by participants of the contrived overlaps of the intention models (goal-objective and claim-assertion) (b) successful detection of overlaps between tasks and goals, (c) some tendency for observed and reported measures to vary synchronously, except in the case of VpC for intention models, in which all concepts have a relatively equal share of participation to an overlap.

For display quality in the paper we manually slightly move one of the labels as follows

allOverlap.pC.compare.dodge <- allOverlap.pC.compare %>% 
mutate(Reported = ifelse(Concept == "Quality", Reported - 0.002, Reported))

For displaying pairs more adjustment are needed to avoid overlaps of the labels.

allOverlap.pI.compare.dodge <- allOverlap.pI.compare %>% 
mutate(Reported = ifelse(Pair == "Quality-Task", Reported - 0.025, Reported)) %>% 
mutate(Observed = ifelse(Pair == "Quality-Task", Observed - 0.03, Observed)) %>% 
mutate(Reported = ifelse(Pair == "Task-Belief", Reported + 0.025, Reported)) %>%
mutate(Observed = ifelse(Pair == "Goal-Quality", Observed + 0.015, Observed)) %>% 
mutate(Reported = ifelse(Pair == "Objective-Assertion", Reported + 0.02, Reported))

8 Intuitiveness Responses

With the evidence at hand that intuitiveness does not seem to correlate with classification behavior, (i.e. participants may find a concept intuitive without being able to recognise its instances and vice versa), we explore if there are differences between the languages on the intuitiveness measure per se.

8.1 Descriptives

8.1.1 Bar-plot for the responses

8.1.2 Alternate Visualization

8.2 Statistical Test

There is substantial debate with regards to what kind of statistical analysis likert-type scales afford. We here take the conservative route and use Wilcoxon rank sum test (otherwise Mann-Whitney test) testing the the null hypothesis “that the distributions of the [two sets of responses] differ by a location shift of mu and the alternative is that they differ by some other location shift”; in our case we leave mu to its default \(mu=0\).

intu %>% group_by(Language) %>% summarise(Mean = mean(Response),`St. Dev.` = sd(Response),n = n())
## # A tibble: 2 x 4
##   Language          Mean `St. Dev.`     n
##   <fct>            <dbl>      <dbl> <int>
## 1 Goal Models       3.86      0.951    80
## 2 Intention Models  3.21      1.14     84
wilcox.test(data = intu, Response ~ Language,paired = FALSE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Response by Language
## W = 4385.5, p-value = 0.0003249
## alternative hypothesis: true location shift is not equal to 0

9 Summary

  • Tests for deviation from uniform random reveal that participants answer consciously or randomly when thus is expected, respectively.
  • Measures of agreement produce signal that is again expected with regards to our intuition of the input languages: there is more agreement within reasonably distinct concepts and least agreement within pairs of synonyms.
  • Accuracy measures ofer insights for goal models which do not counter our intuition: the goal concept is recognized less frequently by participants.
  • The concept overlap maps based on the introduced overlap measure, allows to quickly identify to the favor of which other concept goals are not chosen tasks – and, less so, qualities as it turns out – in agreement with out intuition.
  • Per participant aggregation of accuracy allows meaningful analysis of the effect of expression set and version, the former showing up but in a non directly explainable fashion and the latter requiring more study.
  • Observed measures of overlap seem to match both expectation and self-reported measures (pending follow-up study to show that statistically)
  • Per-concept intuitiveness rating is not a predictor of accuracy with respect to the concept.
  • Per-concept intuitiveness rating may be used in aggregate for assessment of the overall conceptualization.