Home Education Directorate Precollege and Undergraduate Psychology Teacher Network Teaching Activities and Book Reviews
A quick guide to understanding item statistics and how to use them.
By Rob McEntarffer, PhD, and Jen Schlicht, MS Date created: December 14, 2021 5 min read
- Teaching Psychology as a Subject
- Testing, Assessment, and Measurement
Cite This Article
McEntarffer, R., & Schlicht, J. (2021, December 14). Interpreting item statistics. https://www.apa.org/ed/precollege/psychology-teacher-network/activities-books/interpreting-item-statistics
Some teachers use online assessment systems (often embedded inside district Student Information Systems) that report “item statistics” for their assessments. These item statistics are potentially useful but can be confusing. The purpose of this quick guide is to help teachers interpret item statistics in order to revise items on assessments.
The screenshot below is from output from an online assessment system. This output is from GradeCam, but several other online assessment tools produce similar output.
(Note: only part of the output for the 28-item test is included in this screenshot.)
Interpreting the Item Reliability Primary Key screenshot
Overall descriptive statistics about the test at the top (average, range, median, mean, standard deviation and Cronbach's alpha/KR20).
Cronbach's Alpha/KR20 = a measure of how well a whole test "hangs" together.
- If I design a 20-item test all about research methods, there should be some level of "connection" between all the items since they are all about the same "thing" in a broad sense. Students who do well on one item should do well on the test as a whole (probably) and students who do well on the test as a whole should (mostly) do well on individual items.
- The Cronbach's alpha/KR20 for the whole test =.75. That's OK for a 28-item test. .7 and above is a rough “standard.”
In the item-by-item list below the descriptive statistics you get “item statistics” for each item.
- The Correct % is the difficulty score (% of students who got the item right).
- The Pt Biserial (point by serial) is the discrimination score. You can think of a discrimination score as a correlation showing how highly correlated a correct or incorrect answer on that item is with a high or low score on the test overall. Discrimination scores can be a "red flag" about a problematic item. If an item has a low discrimination score (or worse: a negative discrimination score), it means that students who score well on the test overall do not score well on that item or vice versa.
- When you have point bi serials/discrimination scores below .1 (close to zero or negative), then you want to look very closely at those items and probably change something. If they are between .1 and .4, that may be fine, but you can still look to see if there's anything about that item that stands out to you that you may want to change to see if it helps measure exactly what you want to measure.
- The Cronbach Del statistic stands for “Cronbach Deleted” - it is what your overall Cronbach Alpha would be if that item got deleted. If you spot items that would increase the overall Cronbach’s alpha if that item was deleted, that’s another hint that the item may not be as effective as it should be.
Using Item Statistics to improve a test—Case study on Item 8
- Glancing at the stats for these eight items, the teacher wonders about item 8. When we analyze item statistics, we look carefully at any item with a difficulty score above about 80−90% and a discrimination score below .4. Those two statistics together indicate that an item might not be measuring much (e.g., almost everyone gets it correct no matter how well they do on the rest of the test).
- Conversely, very difficult items (difficulty scores below 20−30%) with low discrimination scores may not be measuring much either (e.g., hardly anyone gets them correct no matter how well they do on the rest of the test).
- Item 8 is an “easy” item (92% correct) but it doesn't discriminate at all (Pt BiSerial = .03). The teacher decides to look carefully at that item to see if the right answer is obvious for some reason.
8. The biologist Jane Goodall lived with wild chimpanzees for years, researching social systems while striving to not interfere with the chimpanzee groups in any way if possible. Which of the following is the most accurate term for the research method she used?
A. Survey
B. Case study
C. Experiment
D. Correlational research
E. Naturalistic observation
This Jane Goodall question might be okay—it just may be something students know really well. The teacher decides to try to remove some of the “clues” in the stem of the question (because the teacher wants to measure whether students just know that Goodall did naturalistic observation work without any other information provided).
8. Jane Goodall’s research with chimpanzees primarily involved which of the following research methods?
A. Survey
B. Case study
C. Experiment
D. Correlational research
E. Naturalistic observation
This may or may not be a better question! The teacher will try that item out with students the next time they give the test and will monitor item statistics and adjust accordingly.
Conclusion
Many classroom teachers aren’t familiar with these kinds of item statistics and how to use them (we have more important things to worry about). Hopefully this short guide will help some teachers use information provided by online assessment systems in ways that may improve our classroom assessments and better enable us to measure what students know and are able to do.
About the authors
Rob McEntarffer, PhD, taught psychology, AP psychology, and philosophy for 13 years at Lincoln Southeast High School in Lincoln, Nebraska, and was involved with the AP psychology reading and APA/TOPSS for many years. While teaching, he became interested in educational measurement issues and got a master’s degree in educational measurement (qualitative and quantitative methods) from the University of Nebraska Lincoln in 2003. McEntarfferstarted his work as an assessment/evaluation specialist with Lincoln Public Schools in 2005 and works with the district on large scale and classroom assessment issues. McEntarfferearned his PhD in teaching, learning and teacher education in 2013, focusing his research on how teachers make room for formative assessment processes in their classrooms. McEntarfferlives with his wife, two kids, dog and cat in Lincoln, Nebraska. He is as an AP research teacher and an assessment/evaluation specialist for Lincoln Public Schools.
Jennifer Schlicht teaches AP psychology and intro to psychology at Olathe South High School in the suburbs of Kansas City. She is the past-chair of TOPSS, part of the National Council for Social Studies Psychology Community Leadership Team, and an AP reading table leader. Schlichtenjoys traveling and watching trash television and lives in Olathe, Kansas, with her husband and australian shepherd. This is her 23rd year teaching high school.
McEntarffer, R., & Schlicht, J. (2021, December 14). Interpreting item statistics. https://www.apa.org/ed/precollege/psychology-teacher-network/activities-books/interpreting-item-statistics
`; if (document.getElementById('mainwrap') != null) { document.getElementById('mainwrap').insertAdjacentHTML('beforeend', popup); } }