Introduction

New York State, as well as many other states of the United States, recently underwent one of the most significant and massive educational reforms in history (Hursh, 2013; Leonardatos and Zahedi, 2014; Isaacs, 2014; Mitchell and Purchase, 2014), e.g., the intensive and significant shift to the Common Core Learning Standards (CCLS), followed by the shift to the Next Generation Learning Standards, in New York (The New York State Education Department [NYSED], 2013). As part of the process, local educational agencies (LEA) are required to use state assessment scores, whenever possible, for all teachers and principals as part of the annual professional performance review (APPR) process (NYSED, 2011). As articulated in the policy, the results of the APPR can be used to make employment decisions (NYSED, 2015a). The current and future APPR of teachers and principals relies heavily on the performance of students on the large-scale Grades Three through Eight New York State Testing Program (NYSTP) that mostly focuses on the content areas of English Language Arts (ELA) and Mathematics.

For the ELA and mathematics state testing, essay and constructed response questions have become an equally important part as multiple-choice questions. Students are required to write essays or compositions for the ELA and answer constructed response questions for the mathematics tests. Unlike assessing multiple-choice questions, the assessment of essay and constructed response questions, which usually involves human raters, has long been a problematic area in large-scale standardized assessments around the globe (Huang, 2008, 2012; Huang et al. 2023; Weigle, 2002; Zhao and Huang, 2020). Research shows that several factors related to human raters can impact the scoring variability and reliability of essay and constructed response questions (Barkaoui, 2011; Huang, 2009; Huang et al. 2023). These factors include but not limited to a) raters’ professional backgrounds (e.g., Huang and Foote, 2010; Weigle et al. 2003); b) scoring methods that raters used (e.g., Han and Huang, 2017; Huang et al. 2023; Li and He, 2015); c) scoring criteria that raters used (e.g., Huang, 2009; Huang et al. 2023; Weigle et al. 2003); d) raters’ tolerance for grammatical errors (e.g., Huang, 2009; Janopoulos, 1992); e) rater training (e.g., Weigle, 1998, 2002); and f) the number of raters used in the scoring process (e.g., Liu and Huang, 2020; Lee et al. 2002; Zhao and Huang, 2020).

Generally, the assessment variability and reliability of essay and constructed response questions is an area of major concern across the international measurement community given the myriad of possible interpretations of the written work produced by the examinees (Huang, 2009, 2011, 2012; Huang et al. 2023; Gamaroff, 2000). New York State’s implementation of No Child Left Behind (NCLB) and Racing to the Top (RTTT), which demands that each constructed response task or question be holistically scored only once by a single rater (NYSED, 2015d). The consequences of such practice would lead to concerns about the score reliability of the constructed response questions in the NYSTP ELA and Mathematics Assessments (American Educational Research Association [AERA], American Psychological Association [APA], National Council on Measurement in Education [NCME], 2014). Using generalizability (G-) theory (Cronbach et al. 1972) as a research method, this study attempted to examine the impact of the current one-rater holistic scoring practice on the rater variability and reliability in the NYSTP grades four and six ELA and grades four and five mathematics assessments. As the accountability of large-scale standardized assessments increases, it is important for educational assessment policy makers to ensure that such assessments are highly reliability and accurate. Therefore, the findings of this study would have critical implications for educational assessment policy makers in the state of New York, other states in the United States, and even other countries worldwide.

The NYSTP

In 1999, New York State administered the grade four and grade eight tests in ELA and mathematics for the first time; these assessments stand as the grandparents to the accountability assessments of today (NYSED, n.d.). These assessments, which are norm-referenced, were predominant and implemented between 1999 and 2005, focused on measuring students’ basic skills, or minimum competencies, in ELA and mathematics (NYSED, n.d.). Although 2006 ushered in a new era in the NYSTP, the grades three through eight annual assessments of ELA and mathematics as compliance for the expectation of NCLB, the rigor of these exams did not change from their predecessors, as basic skills were still the emphasis.

For the 2010 assessments, the NYSED changed the cut scores in the interest of providing a more accurate picture of achievement relative to National Assessment of Educational Progress (NAEP), as well as the newly adopted expectations of the CCLS and the notion of college- and career-readiness (NYSED, 2013). As a result, the scores, even on the basic skills assessments, plummeted. With the implementation of the newest iteration of grades three through eight assessments in ELA and mathematics, which are based on New York State’s CCLS, the purpose of the assessments changed dramatically from basic skills to enhanced levels of rigor and depth as described by the grade-level expectations of the CCLS and the standards’ reflection of college- and career-readiness (NYSED, 2015a, 2015b). Each guide, for each content, describes the assessments as “more advanced and more complex than […] prior assessments” (NYSED, 2015a, p. v). Again, this change in language represents a marked shift in the purpose of the NYSTP from 1999 to today.

In 2015, each of the NYSTP’s assessments, for both ELA and mathematics, were administered over three days; in general, each of the testing sessions allotted between fifty and ninety minutes for administration dependent upon the grade level (NYSED, 2015a, 2015b). No student without an individualized education plan (IEP), 504 plan, or English as a second language (ESL) designation was afforded the opportunity to work beyond the articulated time for the content and grade level (NYSED, 2015a, 2015b).

Regarding the composition of the assessments, there were a mixture of multiple-choice questions and constructed response questions, as well as a range of texts for ELA, per assessment. The number of multiple-choice items and constructed-response items vary based on the content and grade-level (NYSED, 2015a, 2015b, 2015c, 2015d). For example, the grade three ELA assessment included eleven reading passages, a mixture of literary and information, thirty-seven multiple-choice items, and ten constructed response items, two of which were considered extended or essay response (NYSED, 2015a, 2015b). In contrast, the grade eight ELA assessment increased the number of passages by one, as well as added twelve multiple-choice questions; however, the constructed response total and composition remained the same (NYSED, 2015a, 2015b). Mathematics followed a similar pattern. The grade three mathematics assessment was comprised of forty-eight multiple-choice questions and six constructed-response questions, while the grade eight mathematics assessment upped the total multiple-choice questions to fifty-five and constructed response questions to ten (NYSED, 2015a, 2015b).

Immediately following the NYSTP assessments, there is a very specific window during which assessments must be scored, submitted to the regional information center for scanning purposes, and the scoring data sent along to the NYSED through official data reporting systems (NYSED, 2015e). To be eligible for scoring or rating the constructed response questions on an NYSTP assessment, a potential scorer needs to be qualified; qualifications are denoted as being a teacher or other qualified educator, representing grades three through eight, and designated by a school principal (NYSED, 2015e). As part of the scoring process, table facilitators are needed to monitor the scoring process, as well as train or assist in training scorers; table facilitators are expected to be experienced scorers. An experienced scorer is defined as an individual having experience with the use of rubrics to evaluate student performance (NYSED, 2015e). Scoring leaders, at any level of site used for scoring (e.g., one district, multiple districts, regionally, usually through the leadership of a board of cooperative education services), facilitate the training and learning regarding rating processes (NYSED, 2015e).

As part of the training, anyone who could possibly rate, table facilitators and scorers, are taught the ins and outs of the NYSTP rating process. Each possible rater is given a training set, which includes anchor papers for each possible score, a practice set, on which to practice rating under the guidance of the scoring leader or table facilitator, and a consistency assurance set, which scorers use to test their ability to rate accurately; a person leading the training has the answers and reviews the results of the consistency assurance set for reliability prior to handing live papers over the rater (NYSED, 2015d). Each of the constructed response questions is scored holistically by a single rater, as demanded by the NYSED scoring leader handbook (multiple-choice questions are answered on bubble sheets that are scored at the scanning center) (NYSED, 2015d). As a check and balance, table facilitators are expected to do “read behinds” as a method for ensuring accuracy; should the table facilitator have concerns with a scorer, he or she should work with the scorer and the training materials to anchor in the scorer through use of the state provided training papers (NYSED, 2015d). However, neither a table facilitator nor scoring leader should change the score on a paper once it is rated and marked as each constructed response question may only be scored once (NYSED, 2015d).

The NYSTP results are shared publicly via official release from the New York State Commissioner of Education’s Office sometime after the administration. These were released for 2015 on August 12, 2015 for assessments given in April of 2015 (NYSED, 2015c). In terms of the quantitative results, each student’s raw scores are converted to scale scores using a combination of item response theory and number correct scoring (NYSED, 2015d). The scale scores are then converted into performance levels and percentile rank; in the media, only aggregate performance levels by school and/or district are reported (NYSED, 2015c). Each individual school or district receives a score report per assessment, per student, which is sent home through traditional communication means (NYSED, 2015c).

The importance of the study reflected in this paper is clear, as there are few empirical studies relative to the validity and reliability of the New York State Testing Program as an accountability measure. As Grant (2000) noted, “the professional literature is replete with debates about tests as a means for accountability, as measures of performance, and as levers for change” (p. 4). However, the referenced literature focuses more on country-wide, or even global, aspects of implementing testing regimes as policy levers, as opposed to critically investigating the implementation and impact of the NYSTP. While Grant’s investigation studies the perception of how teachers feel about the New York State Testing Program, and, in particular, how teachers feel about the changes in the level of expectation reflected in the tests, there is no reference to the rating of the exams and the critical role quality rating plays in the validity and reliability of the NYSTP results. As noted above, with the increased demand for students to show critical thinking through completing more, and varied, constructed response items, it is paramount to consider the impact rating and raters have on accountability assessment results. This study attempts to fill the void left within the professional literature, specifically as it pertains to New York State.

Reliability as a major indicator of educational assessment quality

There is considerable nexus regarding the critical and foundational indicators of educational assessment quality. Although, frequently and primarily led by the joint venture among AERA, APA, and NCME (2014), a vast array of educational researchers report and support the document’s contents regarding the import of reliability, validity, and fairness in the overall quality of educational assessments (AERA et al. 2014; Popham, 2010).

In general, reliability is synonymous with the notion of consistency as it relates to educational assessment (AERA et al. 2014; Popham, 2011). Consistency, both across and within tests and testing administrations, as well as between and amongst raters, is given critical evaluation when ensuring inferences based on assessment results are as accurate as possible. As Popham (2010) argued, “reliable tests can provide evidence from which valid or invalid score-based inferences can be made, but valid score-based inference cannot be made from unreliable tests” (p. 56). Here, Popham (2010) clearly articulates the maxim regarding reliability that is generally accepted by psychometricians and others involved in educational and psychological testing: valid inferences presume reliability.

Within the framework of classical test theory (CTT), there are three predominant forms of reliability: a) stability reliability, b) equivalency reliability, and c) internal consistency reliability (Popham, 2010, 2011). In the CTT framework, reliability coefficients can be generated for each type of reliability; and the closer the coefficient to 1.0, the more reliable an assessment (Brennan, 2001; Cronbach et al. 1972). The purpose of the generated coefficients is to estimate the level of stability and consistency of test scores for those individuals or groups taking the assessment (Popham, 2010, 2011). However, the goal of reliability examinations is to estimate the magnitude of random error that contributes to a test score which works in concert with test development and administration to improve the consistency of scores from which inferences will be drawn (Brennan, 2001; Cronbach et al. 1972).

Other important types of reliability concern the interaction of raters and the assessment. Within the framework of G-theory (Cronbach et al. 1972), reliability is defined as the ratio of the universe score variance to the expected observed score variance, and the G-coefficient is analogous to the reliability coefficient in the framework of CTT (Brennan, 2001; Cronbach et al. 1972). Perhaps the most pervading concern of assessment scholarship is the concept of inter-rater and intra-rater reliabilities as they relate to performance assessments including the scoring of essay and constructed response questions (Brennan, 2001; Huang, 2008, 2012; Huang et al. 2023). Inter-rater reliability indicates the extent to which independent raters obtain the same result when using the same rating criteria to rate an examinee’s performance and intra-rater reliability indicates the extent to which the same rater obtains the same result on the same assessment on two or more occasions (Brennan, 2000, 2001). Further, the error regarding the clarity of rubrics, student expectations, ever-changing criteria by the scorer during rating, and lenient and stringent tendencies significantly impact inter-rater reliability (Brennan, 2001; Huang, 2009). As a result, the testing community responded with increased support of holistic scoring, when reliability coefficients reached an acceptable threshold, as a means for mitigating errors attached to inter-rater reliability (Brennan, 2001; Huot, 1990).

In the context of large-scale standardized performance assessments, where examinees are often asked to complete a specific task, e.g., an essay or construct response question, and their written work is scored by human raters according to previously established rating criteria (Popham, 2011), inter-rater and intra-rater reliability have become very important (Huang, 2008, 2012; Huang et al. 2023; Huang and Foote, 2010; Li and Huang, 2022). Rating reliability is essential to sound performance assessments because it indicates the rating precision of examinee performance (Brennan, 2001; Popham, 2010).

The impact of holistic scoring and the number of scorers on performance assessment

Direct writing assessments, predominantly in the form of short or essay constructed response items, stand as one of the ubiquitously applied approaches to assessing critical curricular aims within the accountability framework across the United States (Huot, 1990; Popham, 2011). There is an inherent subjectivity relative to the writing and rating of constructed response assessment items, regardless of whether the items are short answer or essay questions, given that these items demand that a judgment be made. As a result, factors such as the method of holistic scoring and the number of raters reflect potential sources of variance regarding consistent, accurate, and fair interpretation of assessment results (Barkaoui, 2011; Weigle, 2002).

Holistic and analytic rating methods stand as commonly accepted and employed options for scoring the types of essay and constructed response items innate to current North American accountability assessments (Barkaoui, 2011; Weigle, 2002). Holistic scoring is predicated on the rater employing a single rating scale to generate an overall impression regarding the quality of the writing sample to derive an evaluative grade for example, a three out of four (Huang, 2009; Bacha, 2001). Analytic scoring focuses the rater on different features of writing, i.e., organization, ideas, sentence fluency, each of which receives an individual score based on the articulated rating scale (Huang, 2009; Bacha, 2001). Both scoring methods have benefits and drawbacks regarding utility in the rating process.

Holistic scoring could easily be adopted as an attractive rating method given its economic use of time. Huot (1990) cites several studies that denote the flexible and diverse nature of holistic scoring as it relates to writing assessment. As a rating method, holistic scoring was derived from the need to ensure a high measure of reliability, specifically inter-rater reliability, as it pertains to direct writing assessment (Huot, 1990). Review of the literature reflects the broad use of holistic scoring of direct writing assessments due to its efficiency and relative reliability reflected in overall writing proficiency (Bacha, 2001; Çetin, 2011). Çetin (2011) argued, as the result of his study, inter-rater reliability results, when employing holistic scoring methods, were highly positively correlated; these results were considered statistically significant in their results given the field’s tendency to hold true analytic scoring as a more reliable method.

However, Çetin’s (2011) conclusion reflects Huot’s (1990) argument regarding the positive use of holistic scoring methods when raters are appropriately trained as a means for increasing rater agreement, which can impact measures of reliability. With all that said, the major drawback of holistic rating as a critical method of scoring assessments is the lack of diagnostic interpretation of the results (Bacha, 2001; Barkaoui, 2011; Weigle, 2002). However, this may speak more to a disconnection between the intended purpose of the assessment data and the actual use of the assessment data. Tied closely to this notion, holistic scoring can be susceptible to raters drifting away from the original criteria designed for the assessment, which can cause a reduction in consistency or reliability (Barkaoui, 2011; Huang and Han, 2013).

Conversely, the previously articulated drawback of holistic rating is the inherent strength of analytic rating: “[…] analytic marking […] provide[s] criterion-level information intended for diagnostic purposes” (Heldsinger and Humphry, 2013, p. 221). Simply put, analytic scoring provides more information regarding an examinee’s specific writing ability than holistic scoring (Meadows and Billington, 2005). Bacha (2001) argued that analytic scoring methods more adequately defined a writer’s readiness for advanced writing coursework. Early studies of rating methods (Huang and Han, 2013; Meadows and Billington, 2005), holistic versus analytic, reflected the results that analytic rating had higher rates of reliability associated with the method, as well as less variation based on rater scores. The reduction of variation could be attributed to analytic rating minimizing the effects of rater factors by keeping specific focus on singular aspects of the written product (Han and Huang, 2017; Huang, 2009). The use of analytic marking methods is argued to improve not only inter-rater reliability, but the scoring method is argued to improve intra-rater reliability as well. Furthermore, analytic scoring methods can lead to greater precision, as each category being assessed can be treated as multiple items or greater sampling (Barkaoui, 2011; Huang and Han, 2013).

Regarding issues with the employment of analytic methods to the rating of writing assessments, analytic scoring is more time and labor-intensive (Meadows and Billington, 2005). As a result, the use of analytic methods in the type of accountability framework prevalent in North America, and more importantly New York State, is impractical (Badjadi, 2013). Despite their popularity, the use of analytic scoring tools need further experimental study to flesh out aspects of practical and effective usage tied to writing assessment (Andrade et al. 2008; Brookhart, 2005; Rezaei and Lovorn, 2010).

In the context of large-scale standardized English as a foreign language writing assessment such as TOEFL (i.e., Test of English as a Foreign Language) and IELTS (i.e., International English Language Testing System), examinees’ essays are normally marked by at least two independent raters to ensure assessment reliability (ETS, 2018; IELTS, 2018). Reliability is considered an important quality indicator of an educational assessment; and a high-quality writing assessment, therefore, must be first reliable (AERA et al. 2014). It is evident that increasing the number of writing tasks and human raters would increase the writing assessment reliability (Huang, 2008; Liu and Huang, 2020; Lee et al. 2002; Zhao and Huang, 2020). However, in some high-stakes large-scale language assessments a single-rater scenario is still being practiced (Liu and Huang, 2020; Zhang, 2009; Zhao and Huang, 2020). The consequences of such practice would lead to concerns about the score reliability of performance assessments (AERA et al. 2014; Huang, 2008, 2012).

G-theory as a methodology for detecting rater variation

CTT and G-theory are both used as methods for detecting rater variation. However, G-theory is a more powerful approach in the detection of rater variation (Brennan, 2001). The strength as a methodology, when situated within the realm of performance assessment, is derived from the framework’s ability to separate and articulate multiple sources of error, as well as each source’s magnitude, in an attempt to generate and generalize a “universe score” or proficiency score (Gerbil, 2009; Schoonen, 2005; Shavelson and Webb, 1991). It is the recognition of the potential various sources of error native to the performance assessment process that makes it unique when juxtaposed to closed assessment processes, which demands a statistical method that can more effectively estimate the sources of error that can affect scores and their interpretations (Huang, 2011; Huang et al. 2023; Li and Huang, 2022; Schoonen, 2005; Shavelson et al., 1993).

Brennan (2000) urges the use of G-theory as a method for examining performance assessments due to the focus on discerning which facets are fixed and which are random as an effort toward improving reliability. Again, as reliability reflects the major concerns of psychometric analysis, validity, or accuracy gets special attention through the employ of G-theory, particularly through the lens of task as a facet (Behizadeh and Engelhard, 2011; Schoonen, 2005). Given the proliferation of performance assessments within the current accountability assessment regimes, G-theory provides both the measurement and educational community with the necessary framework to improve the interpretation of scores derived from these assessments (Schoonen, 2005; Shavelson et al., 1993).

To sum up, although there is a significant research base regarding assessment reliability across the psychometric and educational canon, it is important to consider sources of variance, as well as important indicators of quality, most specifically rater reliability, on assessments that are administered to an entire population of students regardless of any categorical designation, i.e., the NYSTP in ELA and mathematics. Accountability is a critical element of consideration regarding the administration and uses of assessments and the inferences derived from them, in the current context of these assessments. This study specifically examined the rating of accountability assessments, as well as the impact of the sources of variance on the assessment inferences being made, as a means for ensuring valid and reliable decision-making can be applied to mandated, large-scale high-stakes state assessments for all students in New York State.

Research questions

The purpose of this study was to examine the scoring variability and reliability of the constructed response questions in the New York State elementary and intermediate ELA and mathematics assessments and provide implications for policy making at the local school districts and state levels. Specifically, the following two research questions were asked: a) what were the sources of rater variability in scoring the constructed response questions in the New York State ELA and mathematics assessments across grades (i.e., grades 4 and 6 ELA; and grades 4 and 5 mathematics)? And b) what were the rater reliabilities (i.e., G-coefficients for norm-referenced score interpretations) in scoring these constructed response questions across subjects and grades?

Methodology

Institutional review board (IRB) approval

Before collecting any data, it is necessary to obtain ethics approval from the IRB at the university where the joint first author studied for his doctoral degree under the supervision of the first and corresponding author prior to engaging in any data-oriented part of the study. This process was honored, and IRB approval was obtained prior to this study.

Selection of constructed response samples

The selection of the constructed response samples was a crucial step in designing this study. The joint first author worked with a school district to procure copies of actual assessment responses from the New York State 2015 grade four and grade six ELA, which are part of the New York State testing program in response to NCLB and RTTT. For each ELA Assessment, the joint first author obtained three “good,” three “fair,” and three “poor” responses, based on actual response scores, to two four-point constructed responses for each grade level. A total of 18 written responses, from each grade level, comprised the total of ELA written responses for a grand total of 36 written responses that were scored by the invited raters. The researchers ensured that relative synonymous length was considered for all the selected constructed response samples so as not falsely suggest a correlation between length of response and score.

The selection of the constructed response samples for mathematics followed the identical procedures. Once again, in conjunction with a school district, three “good,” three “fair,” and three “poor” responses were collected as rating samples. Specifically, two two-point constructed response items and two three-point constructed response items, per grade level, constituted the mathematics rating packet. A total of 36 written responses, from each grade level, comprised the total of mathematics written responses for a grand total of 72 written responses that were scored by the invited raters. Similarly, the relative synonymous length was considered for the selection of these constructed responses.

The selection of raters

The selection of raters followed a convenience sampling process. They were selected based on the following three criteria: a) certification area, b) years in teaching, and c) years of experience in accountability assessment rating. The joint first author used an elementary and middle school principal ListServ to reach out to potential raters. Although this gave a large number of potential candidates equal opportunities to participate, special consideration was taken when selecting raters as a means to ensure that multiple schools and districts are represented. Additionally, multiple personal communications went to another seven districts to garner appropriately diverse raters. In total, ten raters holding certification as educators in New York State were selected for this study.

Among the ten selected raters, five were male and five female participants; six of them had 1–15 years of teaching experience and four raters had 16–25 years of teaching experience; further, four of them had less than 5 years of assessment experience and six raters over five years of assessment experience; and finally, their certification levels were evenly distributed among the ten raters, with two representing each of the five certification levels, i.e., a) pre-kindergarten through grade 6, b) reading/literacy; c) secondary ELA grades 7–12, d) students with disabilities, and e) school administrator supervisor/school district administrator.

As noted above, the process used to collect a pool of possible raters was convenience sampling. Given the budgetary restraints of the study, it was difficult to entice rating participants for the study. However, after employing the electronic and personal communication strategies noted above, the joint first author filtered all potential participants through the criterion of certification area, years of teaching experience, and years of accountability assessment experience. The focus was on both providing a balance of participants across the criterion, as well as accurately reflecting the types of raters most likely rate these types of exams in an actual rating setting, which was based on the joint first author’s nearly twenty years of experience rating grades three through eight exams in New York State. Certainly, the study was limited by the number of participants willing to be involved with the study.

Rater training

The rater training mirrored the training outlined by New York State, as outlined previously in the Scoring the Test section of the paper, to the degree possible given the access to materials. The participants first received training in scoring ELA constructed responses, followed by the training in scoring mathematics constructed responses. Specifically, the participants were first trained on the holistic implementation of the rubrics for the ELA constructed response questions, which consisted of a two-point rubric and a four-point rubric. Since the nature of the New York State rubric was general, not task specific, the participants were trained on both grade four and grade six rubrics for the ELA constructed response questions. After a brief break, the participants received a similar training on the two general scoring rubrics for the grade four and grade five mathematics assessments, respectively. For the mathematics assessment, there were two rubrics from which to score as well—a two-point rubric and a three-point rubric.

Each rater was given a Training Set, which included anchor papers for each possible score, a Practice Set, on which to practice rating under the guidance of the Scoring Leader, and a Consistency Assurance Set, which scorers used to test their ability to rate accurately; the person leading the training had the answers and reviewed the results of the Consistency Assurance Set for reliability prior to handing live papers over the rater (NYSED, 2015). Each of the constructed-response questions was rated holistically by a single rater– this understanding is part of the training (NYSED, 2015). The exact same process was implemented for the analytic rating process. Each training session, by rating method, lasted approximately three hours.

Scoring procedures

Immediately after the participants had received training, they were invited to follow the New York State scoring rubrics to score the ELA and mathematics constructed response questions holistically. The ten raters were asked to score these samples independently. It is important to note that each writing sample was assigned an alpha-numeric notation so that the data generated could be effectively analyzed and anonymity would be ensured.

Data analyses

Using GENOVA (Crick and Brennan, 1983) computer program, the following two analyses were performed: a) 12 paper-by-rater (p x r) random effects G-studies for each task across subjects and grades (i.e., ELA grade 4 task 1, ELA grade 4 task 2, ELA grade 6 task 1, ELA grade 6 task 2, math grade 4 task 1, math grade 4 task 2, math grade 4 task 3, math grade 4 task 4, math grade 5 task 1, math grade 5 task 2, math grade 5 task 3, and math grade 5 task 4); and b) the calculation of G-coefficients for each constructed response task across subjects and grades. It is important to note that the tasks under each subject were analyzed separately because they are different in nature and the scoring rubrics are also different.

Results

Twelve paper-by-rater random effects G-studies results

A total of 12 paper-by-rater (p x r) random effects G-studies for each task across subjects and grades were performed. Each G-study yielded the following three sources of variation: paper (p), rater (r), and paper-by-rater (pr). Among the three variance components for each G-study, variance associated with paper (p) is considered wanted variance because it is the object of measurement, and all the examinees who constructed these written responses were supposed to be different in terms of their ELA and mathematics performance. Further, variance associated with rater is considered unwanted variance because it indicates the inconsistency that raters score these constructed response questions. The larger the rater variance component, the large the scoring inconsistency in terms of raters’ leniency of scoring these writing samples. Finally, the paper-by-rater (pr) residual variance component contains the variability due to the interaction between raters, persons, and other unexplained systematic and unsystematic sources of error. A large residual variance component indicates a large unexplained variance in the design, suggesting that some other facets may not have been considered but contributed to the large unexplained variance. Table 1 presents the results for all 12 G-studies results.

Table 1 Variance components for random effects p x r G-studies across subjects and grades.

As shown in Table 1, the results for both constructed response tasks for grade four ELA show that the wanted variance associated with the object of measurement (p) explained the largest score variance (57.18 and 62.42% of the total variance, respectively). The residual yielded the second largest score variance (40.34 and 29.36% of the total variance, respectively). Finally, rater variance component explained only 2.47% of the total variance for the first task; however, it explained 8.22% of the total variance, suggesting that raters scored the first task more consistently than they scored the second task.

Similarly, for both constructed response tasks for grade six ELA, the wanted variance associated with the object of measurement explained the largest score variance (48.91 and 62.14% of the total variance, respectively. The residual yielded the second-largest score variance (34.98 and 34.13% of the total variance, respectively). Finally, rater variance component explained only 16.11% of the total variance for the first task; however, it explained only 3.73% of the total variance, suggesting that raters scored the second task over four times more consistently than they scored the first task.

Also as shown in Table 1, the results for the four constructed response tasks for grade four mathematics show that the wanted variance associated with the object of measurement (p) explained the largest score variance (52.50, 60.90, 81.65, and 74.50% of the total variance, respectively). The residual yielded the second largest score variance (45.23, 38.16, 17.38, and 24.50% of the total variance, respectively). Finally, rater variance component explained 2.27% of the total variance for the first task; however, it explained only 0.94, 0.98, and 1% of the total variance for the second, third, and fourth task, respectively, suggesting that raters scored all the four constructed response tasks for grade four mathematics extremely consistently.

Similarly, for all four constructed response tasks for grade five mathematics, the wanted variance associated with the object of measurement explained the largest score variance (81.20, 70.58, 63, and 91.22% of the total variance, respectively. The residual yielded the second largest score variance (17.15, 25.66, 34.68, and 7.48% of the total variance, respectively). Finally, rater variance component explained only 1.65, 3.77, 2.33, and 1.12% of the total variance for the four tasks, respectively. Like the grade four mathematics, these results indicated that the raters scored all the four constructed response tasks for grade five mathematics extremely consistently.

The calculation of G-coefficients for reliability interpretations

Using the formula \(\rho ^2 = \frac{{\sigma _\rho ^2}}{{\sigma _\rho ^2 + \sigma _\delta ^2}}\), the G-coefficients for the reliability of ratings across subjects, grade levels, and tasks were calculated. The results are reported in Table 2.

Table 2 A summary of G-coefficients for rater ratings across subjects, grades, and tasks.

As shown in Table 2, the G-coefficients for the two constructed response tasks for grade four ELA were 0.59 and 0.68, respectively, for the current New York State one-rater scenario. If three raters score the first task and two raters score the second task, these G-coefficients would increase to 0.81 for both constructed response tasks. Very similarly, the G-coefficients for the two constructed response tasks for grade six ELA were 0.58 and 0.65, respectively, for the current New York State one-rater scenario. If three raters score the two tasks, these G-coefficients would increase to 0.81 and 0.85 for the first and second constructed response tasks, respectively.

Also as shown in Table 2, the G-coefficients for the four constructed response tasks for grade four mathematics were 0.96, 0.61, 0.82, and 0.75, respectively, for the current New York State one-rater scenario. If three raters score the second task and two raters score the fourth task, these G-coefficients would increase to 0.83 and 0.86 for these two constructed response tasks, respectively.

Slightly different from the four constructed response tasks for grade four mathematics, the G-coefficients for the four constructed response tasks for grade five mathematics were 0.83, 0.73, 0.64, and 0.92, respectively, for the current New York State one-rater scenario. If two raters score the second task and three raters score the third task, these G-coefficients would increase to 0.85 and 0.84 for these two constructed response tasks, respectively.

Discussions

This study examined the impact of the one-rater holistic scoring practice in the assessment of constructed response questions on the rater variability and reliability of the NYSTP in grades four and six ELA and grades four and five mathematics. The G-theory results for both grades four and six ELA assessments indicated that the object of measurement yielded the largest variance component. This variance component is considered wanted variance because we believe that the participants were considerably different in terms of their ability in answering the constructed response questions. In addition, the residual yielded the second largest variance component in all these G-studies, indicating that there was still large amount of unexplained variance, indicating that some hidden facets were not considered in the design (Brennan, 2001). However, the rater variance component did exist between grades and tasks, i.e., in grades four ELA assessments, raters scored the first task more consistently than they scored the second task; in grades six ELA assessments, raters scored the second task over four times more consistently than they scored the first task.

Similarly, the G-theory results for grades four and five mathematics assessments indicated that the wanted object of measurement variance component was the largest in all G-studies. Again, the residual yielded the second largest score variance in all G-studies. Interestingly, the rater variance component explained less than 3 and 4% of the total variance for all constructive response questions in grades four and five mathematics assessments, respectively. These results suggested that raters scored all the four constructed response tasks for grades four and five mathematics assessments extremely consistently.

In terms of the rater reliability of the NYSTP in grades four and six ELA assessments, the current one-rater scoring practice would not be able to yield an acceptable G-coefficient of ≥0.80. The number of raters needs to increase to two for the second task and three to the first task for grade four ELA assessment, and to three for both tasks for grade six ELA assessment, respectively, in order to achieve a minimum threshold of ≥0.80.

Slightly different from the grade four mathematics assessments, the current one-rater scoring practice would be able to yield an acceptable G-coefficient of ≥0.80 only for the first and third tasks, and the first and fourth tasks for grades four and five mathematics assessments, respectively. For the rest constructed response questions, the number of human raters needs to increase to two or three in order to achieve acceptable reliability coefficients.

The current practice for the rating of accountability assessments as part of the NYSTP is to have each constructed response question rated by a single rater using the holistic scoring method; meaning, one constructed response question is rated once by rater 1, the second constructed response question is rated by rater 2, and so on using the “general impression” ideology of holistic scoring. In practice, no student response is rated more than once.

It seems that rater variability and reliability were evident in the ELA assessments across grades and tasks; however, rater variability appeared to be minimal, but rater reliability seemed apparent in the mathematics assessments across grades and tasks. These important findings seem to contradict the current rating process implemented when scoring the accountability assessments in New York State. Further research is needed to validate these findings.

Limitations

The present study was limited in the following three ways. First, the small sample sizes for both the constructed response samples and the rater participants would restrict the generalizability of the outcomes. Given the security of the New York State grades three through eight assessments, the researchers could only use tasks that were released by the State publicly, which made it very difficult to gather many constructed response samples of different quality levels. As a result, the training materials were very basic given the lack of released student samples. In actual live scoring of New York State assessments, each level of the rubrics is reflected, usually more than once, with accompanying practice papers and consistency assurance sets.

Second, only quantitative methods were used for this study. As a result, a fuller and richer picture based on the perceptions of the process is missing as a feature to enhance the analysis of the quantitative results. Conducting interviews with the raters at the end of each rating session could have provided a well-developed picture regarding participants’ views on how they used the scoring rubrics to make their scoring decisions.

The results reported in this study reflect agreement with previously noted claims regarding holistic scoring processes (Barkoui, 2011; Huang et al. 2023; Huang and Han, 2013; Liu and Huang, 2020; Zhao and Huang, 2020). For example, raters may tend to move away from the clearly articulated criteria due to personal interpretations and decisions made throughout the rating process (Barkoui, 2011; Weigle, 1998).

Further, in true consistency with the theoretical framework of G-theory, the results of this study surfaced the magnitude of various sources of error inherent within the rating procedures (Schoonen, 2005; Shavelson et al., 1993). These were previously discussed and instances of percentage of the total variance that was attributed to components other than the object of measure were noted. Furthermore, given the performance assessment nature of the tasks represented within this experiment, the results reflect the need to infuse G-theory into the interpretation process of accountability assessment results across the educational community (Huang et al. 2023; Li and Huang, 2022; Liu and Huang, 2020; Schoonen, 2005; Shavelson et al., 1993; Zhao and Huang, 2020).

Finally, the results of this study depart from predecessors regarding the assessment concept of reliability. As reported above, in every assessment situation for ELA, the results articulated the need for more than one rater for each constructed-response question for each student in order meet an appropriate G-coefficient of ≥0.80; in many cases for the ELA tasks, the need for raters reached three raters – well beyond the current practice for the NYSTP. These results are counterpoints to studies previously conducted regarding the reliability and efficiency of the holistic rating of writing assessments (Bacha, 2001; Çetin, 2011). Not to be outdone, the holistic rating of mathematics assessments was also reported as needing more than one rater to reach the appropriate reliability coefficient based on this study. In four out of eight tasks, holistic rating needed multiple raters to be considered reliable.

Implications

The results provide information that could impact policy at the local and state level when considering the use of assessment results in an accountability framework, like that of the §3012-c and §3012-d. The ushering in Every Student Succeeds Act in December of 2015 cemented the use of accountability assessments first brought into law with NCLB. In other words, these assessments are here to stay. As a result, the use of them must be as valid, fair, and reliable as possible for the most appropriate inferences to be made to improve education for all students at both the LEA and state level. In most cases, multiple raters would be needed to ensure reliable assessment results, suggesting a need for change in the scoring policy as applied in New York to the accountability assessments. By considering the results presented here, appropriate interpretation may be possible. To do so, money cannot be the driving force for policy making. As reported above, the number of raters needs to be considered to honor the process, as well as the inferences being made tied to accountability assessments.

However, this is not just a policy issue. Given the results of this study, it is clear that assessment literacy for educators must be improved. Greater training regarding the concepts tied to reliability could go a long way in mitigating some issues tied to rating; it could be less of a chore and more of an active application of quality principles. Equal time needs to be given to ensuring that rating participants have a clear understanding of what quality assessment procedures are versus over reliance on the indicators of assessment quality only being applied to the test itself.

This study provides some directions for future research in the area of variability and reliability in the rating of constructed response questions in the high-stakes state assessments. First, future research should expand the number of papers, tasks, and participants to enhance the generalizability of the results. In order to do so, assessment results need to become more readily accessible. Student papers, with no personalizing information, would need to be accessible to researchers attempting to improve the effective use of assessment results for accountability purposes. Second, with the impending promise of computer-based testing (CBT), it would be necessary to apply a similar process outlined in this study to the CBT world. In New York State, CBTs for the New York State Testing Program should be completely implemented by 2020. At this point, it is still unclear as to how performance assessment tasks will be rated for these new tests. Finally, the framework of this study should be applied to the Regents Examinations in New York as well. Based on the previous APPR and the impending APPR, Regents scores will be used as accountability measures. Therefore, it is suggested that a similar study be completed for those five examinations that need to be passed in order to graduate from high schools in New York State. By doing so, the focus can be once again on ensuring a reduction in the impact of unwanted variance on score variability and an increase in the reliability of score inferences.