Power in acceptability judgment experiments and the reliability of data in syntax.
Jon Sprouse, Diogo Almeida
May 2012

There has been a consistent pattern of criticism of the reliability of acceptability judgment data in syntax for at least 50 years (e.g., Hill 1961), culminating in several high-profile criticisms within the past ten years (e.g., Edelman & Christiansen 2003, Ferreira 2005, Wasow & Arnold 2005, Featherston 2007, Gibson & Fedorenko 2010a, 2010b). One of the fundamental claims of these critics is that traditional acceptability judgment collection methods lead to an intolerably high number of false negative results (i.e., low statistical power), and that this can be remedied by the use of more formal methods of data collection. We empirically assessed this claim by conducting a series of experiments designed to derive comprehensive estimates of statistical power for different types of acceptability judgment experiments. We tested 47 phenomena (94 sentence types) from a random sample of phenomena in Linguistic Inquiry (2001-2010) that span a large range of effect sizes (Cohen’s d 0.15-1.96), using all four major judgment tasks normally used in syntactic research (magnitude estimation, Likert scale, yes-no, and forced-choice), and four samples each of 144 participants. We then ran re-sampling simulations to empirically estimate statistical power for every combination of effect size, sample size (5-100), and task. The results provide the first comprehensive evaluation of statistical power in acceptability judgments, which can be used to (i) evaluate the statistical power of previously published studies, (ii) plan appropriately powered studies in the future, and most importantly, (iii) establish a common vocabulary for assessing whether any definition of the more traditional methods can be seen as a well-powered experiment in its own right. We discuss the relative power of the four types of experiments, the relative power of acceptability judgment experiments to experiments in other domains of psychology, and the empirical coverage of each experiment type.
keywords: acceptability judgments, syntactic theory, linguistic methodology, quantitative standards, experimental syntax, statistical power, syntax
previous versions: v3 [May 2012]
v2 [September 2011]
v1 [September 2011]
