Looking for more artists? Visit the featured artists site!


Sunday, October 11, 2015

Post 40: What it's all really like.

For post forty I figured I would share just exactly what it is I do as a student, what my material looks like, and allow some peer-review. Correspondence courses tend to demand more deliverable materials and independent study, but as a research community we're working on ways to include social presence and epistemic engagement. Presented here is a paper.
Hixson, S. (2015, October 11). Critical thinking application II: Standardized test report card 
     [Course deliverable for a course on evaluation and assessment OTL-541K].
     School of Education, Colorado State University Global Campus.
Critical Thinking Application II: Standardized Test Report Card

          In order to establish the background perceptions related to this report, and in the case of any cultural or situational bias, I am of the opinion that the current standardized assessment system in Colorado for secondary students has not successfully demonstrated statistical validity or reliability. As the opinion of the author may result in confirmation bias it is my responsibility to mention that I noticed a peculiarity in the data represented as scale means, which were sets of assessment scores that were processed and averaged on a base score, because I have a background in psychology; and not because I am active in the political situation surrounding standardized testing in Jefferson County, Colorado.

Part I: Statewide Assessment Data and Report

          Bennett (2015), provides a stratification to describe the development of standardized assessment according to three tiers; a tier in which the assessments attempt to reproduce conventional paper assessments and rely on the validity of the reproduced measures, a tier in which the testing systems infrastructure is developed and expanded for accountability and efficiency reasons, and a testing system which includes rich data and student performance activities that are immersive and cognitively engaging multi-step processes. In the State of Colorado I can say that we have a second-tier testing system. Though there are grave concerns about the construct validity and reliability of the instruments themselves, the assessment scores and processed data are easily available and reports can be generated automatically by the Colorado Department of Education (CDE, n.d.). The infrastructure exists which can provide access to many different measures of key variables in the statewide assessment data, generate and export electronic report of the thousands of scores in the population, and demonstrate well thought out data storage and technology utilization. However the tests themselves cannot be shown to demonstrate construct validity as there are confounding variables which cause the data to have extreme and unrelated variation in overall mean scores statewide. In terms of cognitive psychology reading comprehension and writing are thought to be strongly related, people who are skilled at reading and vocabulary comprehension are overall also skilled in writing and communication; while poorly demonstrated writing skills may not necessarily be related to intelligence or achievement, writing depends on reading related cognitive abilities and tests of writing and reading tend to show strong correlations in a wide range of studies (Bruning, Schraw & Ronning, 1999). Bruning, Schraw & Ronning generalize in a second print textbook, “These relationships are not surprising when we consider that frequent reading exposes students to many more samples of writing” (p. 303). The data presented in figure 1 are from the statewide population mean scale scores of secondary public education students on the reading and writing tests created by the Colorado Department of Education, and demonstrate a Pearson correlation coefficient of r=0.0016 where df=9 r-crit=0.602 using a probability of error p=0.05 failing to reject the null hypothesis that reading and writing are not related.

Figure 1. Colorado statewide mean scale scores on reading and writing 2004-2014. (CDE, n.d.)       

         These data raise critical concerns about construct validity and questions about which of the assessments is in error among either of the three major subjects reading, writing, and math; and also demonstrate no relationship in variability of scores within scores generated in a given year. Figure 2 demonstrates that these data are reproduced on the school, district, and state level which make it appear that either statewide reading scores are inflated, or likewise the assessments for math and writing are not accurately measuring performance.

Figure 2. Comparisons between large and small schools, district and state averages. (CDE, n.d.)

         These data warrant further investigation into the causes of outlying and variable data which seems otherwise unrelated between measures and within groups. One confounding variable with these data is that Lakewood High School is a very large school while Jeffco Open School is a very small democratic school, N=2,000 and N=200 students estimated, respectively.

Part II: Reflections.
          Goodwin & Hubbell (2013), include only self-report measures of knowledge based on multiple-choice or open ended sentence questions in their survey of assessments of skill. In a related assignment and available for viewing online I had created a formative assessment based on a Likert scale and continued to investigate the possibility that other measures of performance may be more accurate and statistically valid (Hixson, 2015). A previously cited work states that scores correlating a subjects content knowledge after a standard video presentation and scores generated by an automated essay scoring system were stating that the average Pearson r=0.85 in the study of the automated scoring of writing (Kersting, Sherin & Stigler, 2014). Kersting, Sherin & Stigler fail to report the number of human grades performed in the analysis, so no value of r-crit could be found, during human versus computer scoring and move forward with their research claiming, “We reasoned that if the average correlations between rater and computer-generated scores exceed .80, an argument can be made for the convergent validity of machine scores with human scores” (p. 965). In statistical terms this literally says, that the researchers are unaware of whether the correlations they had found were significant, and were choosing to publish data which clearly demonstrates a confirmation bias for automated testing systems in order to show that automated essay readers may work.

          What I am seeing in terms of the quest for statistical validity among outcomes measures for secondary students; is extensive confirmation bias in the ability of the system to accurately represent and aggregate student knowledge with grandiose technological infrastructure and little construct validity, measures of variance, or confidence intervals related to the reliability of mean scores or scoring systems. The system appears to be financially driven and the data that the system has generated over a decade is not supported by research or best practices in psychology.


Bennet, R. E. (2015), Chapter 10: The changing nature of educational assessment. 39, 370-407.

Bruning, R., Schraw, G. & Ronning, R. (1999). Cognitive psychology and instruction (3rd ed).  
     Upper Saddle River, NJ: Prentice Hall.

CDE (n.d.). SchoolView data lab report. Retrieved from: CDE Website

Goodwin, B. & Hubbell, E. (2013). The 12 touchstones of good teaching: A checklist for staying 
     focused every day. Alexandria, VA: Association for Supervision & Curriculum Development.

Hixson, S. (2015, May 31). Teaching portfolio [Course deliverable for a class on teaching and
     learning methods OTL-502-1]. School of Education, Colorado State University Global Campus.
     Retrieved from http://www.dxed.org/teaching-portfolio

Kersting, N., Sherin, B. & Stigler, J. (2014). Automated scoring of teachers’ open-ended responses
     to video prompts: Bringing the classroom-video-analysis assessment to scale. Educational and
     Psychological Measurement, 74(6), 950-974.

No comments:

Post a Comment