Interpreting Student Ratings of Instructors, A Literature Review

The following review is the work of UVU faculty, provided here by Alex Simon, former President of the UVU Chapter of the AFT as part of his ongoing research on a subject that has and continues to trouble colleagues denied tenure or promotion largely on the basis of SRI scores and comments 


[the AFT and the AAUP have joined forces nationally and at UVU as well]


Interpreting Student Ratings of Instruction: A Literature Review by the Utah Valley University Chapters of the American Association of University Professors and American Federation of Teachers

 

 

A significant and established body of peer-reviewed research provides the following results and guidelines for interpreting student ratings of instruction (SRIs) / student evaluations of teaching (SETs):

 

 

1.     Peer-reviewed research has repeatedly shown that SRIs are not a highly valid measure of teaching effectiveness.

a.     Peer-reviewed research on the relation between SRIs and teaching effectiveness stretches back over 45 years, with numerous independent researchers studying the topic using a variety of methods. Over the last 10+ years, independent researchers have conducted multiple meta analyses showing that the relation between SRIs and student learning or teaching effectiveness is usually weak or non-existent (Clayson, 2009; Kreitzer & Sweet-Cushman, 2022; Uttl et al., 2017). Research shows that even in instances where there is a weak relation between SRIs and learning or teaching effectiveness, the relation is situational and dependent on factors irrelevant to teaching ability such as the instructor, the discipline, or other factors (Clayson). 

                                               i.     Uttl and colleagues note “…students do not learn more from professors who receive higher SET [SRI] ratings.” (2017, p. 40), and Kreitzer and Sweet-Cushman note “Student Evaluations of Teaching have low or no correlation with learning.” (2022, p. 1).

b.     SRIs are such poor measures of teaching effectiveness that peer-reviewed studies have shown that more effective teachers sometimes receive lower SRI ratings than less effective teachers (Emery, Kramer & Tian, 2003), and that greater teaching effectiveness is sometimes related to lower student evaluation scores (Braga et al., 2014). A recent study also showed that poor instructors often receive high SRIs and effective instructors often receive low SRIs (Esarey & Valdes, 2020).

                                               i.     Summarizing the lead author, a 2020 Inside Higher Ed article ( https://www.insidehighered.com/news/2020/02/27/study-student-evaluations-teaching-are-deeply-flawed ) stated: “… Esarey [the author] said that unless the correlation between student ratings and teaching quality is ‘far, far stronger than even the most optimistic empirical research can support,’ then common administrative uses of SETs [SRIs] ‘very frequently lead to incorrect [RTP] decisions.’ Those professors with the very highest evaluations ‘are often poor teachers,’ he added, ‘and those with the very lowest evaluations are often better than the typical instructor.’ Consequently, Esarey said that he and Valdes would expect ‘any administrative decisions made using SET [SRI] scores as the primary basis for judgment to be quite unfair.’”

c.     A primary reason why SRIs are not valid measures of teaching effectiveness is the fact that SRIs are affected by numerous factors which have nothing to do with teaching effectiveness. These factors are described below.

 

2.     Peer-reviewed research has repeatedly shown that higher academic rigor is related to, and causes, lower SRI scores (Greenwald & Gillmore, 1997a & 1997b; Stroebe, 2016)

a.     Higher expected grades among students is related to higher SRI ratings (Boring et al., 2016). Higher perceived grading leniency is related to higher SRIs (Griffin, 2004; Olivares, 2001). Higher student satisfaction with grades is related to higher SRI ratings (Kogan, Genetin, Chen & Kalish, 2022). Courses with lighter workloads and higher grades receive higher SRI ratings (Kreitzer & Sweet-Cushman, 2022).

b.     Experiments have shown that the relation between grades and SRIs is causal. In one experiment, giving students lower grades than they expected caused their SRI ratings to decrease significantly (Holmes, 1972). In another experiment, more liberal grading caused students to provide higher SRI ratings (Vasta & Sarmiento, 1979). In a natural experiment, SRI ratings decreased significantly after the college implemented an anti-grade-inflation policy that increased rigor and lowered grades (Butcher, McEwan, & Akila, 2014). 

 

3.     Peer-reviewed research has repeatedly shown that women receive lower SRIs than men, and persons of color receive lower SRIs than white counterparts.

a.     The relations between instructor sex, instructor ethnicity, and SRIs has been well documented, with studies consistently showing that female instructors and persons of color receive lower SRIs than male instructors and white instructors (e.g.; Boring, 2017; Kreitzer & Sweet-Cushman, 2022). 

b.     Experimental research has shown that the relations between instructor sex, instructor ethnicity, and SRIs are causal (e.g.; Chávez & Mitchell, 2019).

c.     Summarizing this well-established field of research, Kreitzer and Sweet-Cushman note: “scholars using different data and different methodologies routinely find that women faculty, faculty of color, and other marginalized groups are subject to a disadvantage in SETs [SRIs].” (2022, p. 1).

 

4.     Peer-reviewed research shows that numerous additional factors that have nothing to do with teaching quality affect SRIs.

a.     Non-elective courses receive lower SRIs than electives. Quantitative courses receive lower SRIs. Upper-level discussion based courses receive higher SRIs than lower-level large courses. SRI ratings vary across disciplines; natural science courses receive the lowest SRIs, while humanities courses receive the highest SRIs. The accent of an instructor, sexual orientation of instructor, and/or disability status of instructor also impact SRIs (Kreitzer & Sweet-Cushman, 2022). 

b.     SRIs can also be affected purposefully by instructor actions that have nothing to do with teaching quality. For instance, providing chocolate chip cookies to students in a randomized, controlled trial resulted in higher SRI scores (Hessler et al., 2018).

c.     As recently as 2022, peer-reviewed research showing the limitations of SRIs has been the focus of multiple, high-profile articles in media outlets like Inside Higher Ed and The Chronical of Higher Education. For instance:

                                               i.     https://www.insidehighered.com/news/2022/01/19/study-grade-satisfaction-major-factor-student-evals

                                              ii.     https://www.insidehighered.com/news/2021/02/17/whats-really-going-respect-bias-and-teaching-evals

                                             iii.     https://www.insidehighered.com/news/2020/02/27/study-student-evaluations-teaching-are-deeply-flawed

                                            iv.     https://www.chronicle.com/article/why-we-must-stop-relying-on-student-ratings-of-teaching/

 

5.     The average of SRI scores is a poor and biased measure of SRI performance. This is because the distribution of SRI scores is skewed, and in skewed distributions, averages are unduly affected by outlier scores—such as a small minority of students who rate an instructor low on SRIs. Put differently, in a skewed distribution of SRI scores where most students rate the instructor “4” or “5” out of 5, a small minority of students who rate the instructor “1” or “2” will have an undue influence on the average of the SRI scores, lowering the average of the SRIs at a disproportionate rate. Because of the undue influence of outliers on averages within skewed distributions, it is possible for most students to rate an instructor “4” or “5”, but the average may be well below 4.  Below is a graph and description depicting the undue influence of outliers in a skewed distribution of hypothetical SRI data for a hypothetical instructor.

 

 

The graph shows that 500 students rated the instructor “5”, 300 rated them “4”, 200 rated them “3”, 100 rated them “2”, and 100 rated them “1”. 

 

800 students rated the instructor either a “4” or “5”, but only 400 students rated the instructor “3” or lower. In other words, twice as many students rated the instructor “4” or “5” as all the students combined who rated the instructor “1”, “2”, or “3”

 

Clearly, a typical rating for this instructor is somewhere between “4” and “5”. Indeed, there were more scores of “5” than any other score, and far more scores of either “4” or “5” than scores of “1”, “2”, or “3”.

 

However, because the average is unduly affected by outliers in a skewed distribution, the average score of this data is 3.83 (even though 67% of the students rated the instructor “4” or “5”). The average is not a good indicator of SRI performance because in skewed distributions like this, outlier scores—that is, the small minority of “1” or “2” scores—have a disproportionately large effect on the average (Linse, 2017).

 

Because of this, averages of SRI scores should not be used for interpreting SRI performance (Stark & Freishtat, 2014).

 

The median (middle) score and the mode (most frequent) score provide a more accurate measure of SRI performance, given the skewed distribution of SRI data.

 

For the hypothetical data in the graph above, the median (middle) score is 4. The mode (most frequent) score of the data is 5. Both the median and mode provide an accurate depiction of the typical score.

 

Medians and modes are less biased by outlier scores, and are therefore better measures of SRI performance, and should be the focus of SRI interpretation (Linse, 2017). Generally, a mode SRI score of 5 out of 5 indicates good SRI performance (Linse, 2017).

 

6.     Single SRI items should not be interpreted in isolation. SRIs must be interpreted based on aggregated data of numerous SRI items (Boysen, 2015; Linse, 2017).

a.     Single SRI items are often not reliable or valid measures of performance. Data are more reliable and valid in aggregate. High stakes decisions such as RTP decisions should not be based on a singular SRI item or a singular issue within a portfolio (Boysen, 2015; Linse, 2017). 

 

7.     RTP decisions should be based on more than just SRI data (Esarey & Valdes, 2020; Linse, 2017).

 

8.     SRI data should not be compared between faculty (Franklin, 2001; Linse, 2017) because differences in average SRI scores between faculty often do not indicate a real difference in teaching (Boysen, 2015; Franklin, 2001; Linse, 2017), and because SRI performance is affected by numerous factors that differ across instructors, but which have nothing to do with teaching effectiveness (e.g.; Clayson, 2009). Peer-reviewed research suggests that SRI data and the overall portfolio of a given faculty person should only be compared to the established RTP criteria, and should not be compared to the SRI data or portfolios of other faculty. There are several reasons SRI data of one faculty person should not be compared to the SRI data of other faculty:

a.     SRI differences between faculty do not necessarily represent meaningful differences in teaching. For example, there is no scientific evidence that an SRI score of 3.85 represents meaningfully different teaching than a score of 4.35 (Boysen, 2015; Franklin, 2001). This is partly because SRIs are not highly valid measures of teaching effectiveness (Boysen, 2015; Clayson, 2009; Kreitzer & Sweet-Cushman, 2022). Research has shown that in some cases, more effective teachers can receive lower SRIs than less effective teachers (Braga et al., 2014; Emery, Kramer & Tian, 2003; Esarey & Valdes, 2020). Afaculty person may have an SRI score of 3.75 because they teach a challenging course with low grades in a discipline that is not inherently interesting to most students, while another faculty person has a score of 4.95 because they teach an inherently entertaining discipline, grade leniently, and only provide praising student feedback. 

b.     The statistical procedures that could be used to compare one instructor’s SRI data to the aggregate SRI data of a department or college require that the data be normally distributed, but SRI data of competent instructors are heavily skewed and are not normally distributed (Linse, 2017).

                                               i.     As Stark and Freishtat (2014) note: “Personnel reviews routinely compare instructors’ average [SRI] scores to departmental averages. Such comparisons make no sense, as a matter of Statistics.” (p. 2)

c.     Because numerous factors differ between instructors and are known to affect SRI scores—such as discipline, instructor sex, instructor ethnicity, and grading—comparing the SRI scores of one instructor to another would require controlling for those factors.  

 

 

 

 

Works Cited

 

Boring, A., Ottoboni, K., & Stark, P. B. (2016). Student evaluations of teaching (mostly) do not measure teaching effectiveness. ScienceOpen Research. DOI: 10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1

Boysen, G. A. (2015). Uses and misuses of student evaluations of teaching: The interpretation of differences in teaching evaluation means irrespective of statistical information. Teaching of Psychology, 42(2) 109-118. DOI: 10.1177/0098628315569922

Braga, M., Paccagnella, M., & Pellizzari, M. (2014). Evaluating students’ evaluations of professors. Economics of Education Review, 41, 71-88. http://dx.doi.org/10.1016/j.econedurev.2014.04.002

Butcher, K. F., McEwan, P. J., & Weerapana, A. (2014). The effects of anti-grade-inflation policy at Wellesley College. Journal of Economic Perspectives, 28 (3), 189-204.

Chávez, K. & Mitchell, K.M.W. (2019). Exploring bias in student evaluations: Gender, race, and ethnicity. PS: Political Science & Politics, 53(2), 270-274. doi:10.1017/S1049096519001744

Clayson, D. E. (2009). Student evaluations of teaching: Are they related to what students learn? Journal of Marketing Education 31 (1), 16-30.

Emery, C. R., Kramer, T. R., & Tian, R. G. (2003). Return to academic standards: A critique of student evaluations of teaching effectiveness. Quality Assurance in Education, 11 (1), 37-46.

Esarey, J. & Valdes, N. (2020). Unbiased, reliable, and valid student evaluations can still be unfair. Assessment & Evaluation in Higher Education, 45 (8), 1106-1120. https://doi.org/10.1080/02602938.2020.1724875

Franklin, J. (2001). Interpreting the numbers: Using a narrative to help others read student evaluations of your teaching accurately. New Directions for Teaching and Learning, 87, 85-100.

Greenwald, A. G., & Gillmore, G. M. (1997a). Grading leniency is a removable contaminant of student ratings. American Psychologist, 52(11), 1209–1217. https://doi.org/10.1037/0003-066X.52.11.1209

Greenwald, A. G., & Gillmore, G. M. (1997b). No pain, no gain? The importance of measuring course workload in student ratings of instruction. Journal of Educational Psychology, 89(4), 743–751. https://doi.org/10.1037/0022-0663.89.4.743

Griffin, B. W. (2004). Grading leniency, grade discrepancy, and student ratings of instruction. Contemporary Educational Psychology, 29, 410-425.

Holmes, D. S. (1972). Effects of grades and disconfirmed grade expectancies on students’ evaluations of their instructor. Journal of Educational Psychology, 63 (2), 130-133.

Kogan, V., Genetin, B., Chen, J., & Kalish, A. (2022). Students’ grade satisfaction influences evaluations of teaching: Evidence from individual-level data and an experimental intervention. EdWorkingPaper No. 22-513. https://doi.org/10.26300/spsf-tc23

Kreitzer, R. J. & Sweet-Cushman, J. (2022). Evaluating student evaluations of teaching: A review of measurement and equity bias in SETs and recommendations for ethical reform. Journal of Academic Ethics, 20, 73-84. https://doi.org/10.1007/s10805-021-09400-w

Hessler M, Pöpping DM, Hollstein H, Ohlenburg H, Arnemann PH, Massoth C, Seidel LM, Zarbock A, Wenk M. (2018). Availability of cookies during an academic course session affects evaluation of teaching. Med Educ., 52(10), 1064-1072. doi: 10.1111/medu.13627

 Linse, A. R. (2017). Interpreting and using student ratings data: Guidance for faculty serving as administrators and on evaluation committees. Studies in Educational Evaluation, 54, 94-106. http://dx.doi.org/10.1016/j.stueduc.2016.12.004

Olivares, O. J. (2001). Student interest, grading leniency, and teacher ratings: A conceptual analysis. Contemporary Educational Psychology, 26, 382-399.

Stark, P. B. & Freishtat, R. (2014). An evaluation of course evaluations. ScienceOpen Research, 0(0), 1-7. DOI: 10.14293/S2199-1006.1.SOR-EDU.AOFRQA.v1

Stroebe, W (2016). Why good teaching evaluations may reward bad teaching: On grade inflation and other unintended consequences of student evaluations. Perspectives on Psychological Science, 11(6), 800-816. DOI: 10.1177/1745691616650284

Uttl, B., White, C. A., Wong Gonzalez, D. (2017). Meta-analysis of faculty's teaching effectiveness: Student evaluation of teaching ratings and student learning are not related. Studies in Educational Evaluation, 54, 22-42. http://dx.doi.org/10.1016/j.stueduc.2016.08.007

Vasta, R., & Sarmiento, R. F. (1979). Liberal grading improves evaluations but not performance. Journal of Educational Psychology, 71(2), 207–211. https://doi.org/10.1037/0022-0663.71.2.207

 

 

 

Comments

Popular posts from this blog

A REPORT ON THE UVU “FINAL INVESTIGATION REPORT”: THE CASE OF PROFESSOR MICHAEL SHIVELY

Statement on the Suspension of Professor Mike Shively at UVU

Begging the DEI Question