What should Teaching American History (TAH) evaluation programs evaluate? Of course, the most obvious answer would be that they should evaluate the success of the programs. But what constitutes success? This is a much more challenging question.
Our team of researchers at the University of Maryland has been conducting evaluations of TAH programs since first-round grants were vetted. This fall we begin evaluating the fifth of these programs here in Maryland. As we began our evaluation work, we conceptualized the question of measuring success around trying to understand knowledge growth among the history teachers who participated in these programs. After all, it seemed to us, that's what these programs were fundamentally designed to do—enhance the knowledge of participants in order to better prepare them to teach history. Here again, we encountered tough questions: What does it mean to enhance teachers' historical knowledge? What do we mean when we say knowledge? And how do you measure gains?
What do we mean when we say knowledge?
Conceptualizing Evaluation Criteria
Drawing from a growing body of research in history education, we conceptualized that knowledge as of three tightly interwoven types: (a) foreground substantive knowledge, (b) background substantive knowledge, and (c) procedural or strategic knowledge.
We defined foreground substantive knowledge as ideas and understandings of what happened in the American past, engaged in by whom, for what reasons, and to what end results. This form of knowledge is what we typically read about in American history books—accounts of what happened and what they meant. Background substantive knowledge turns on ideas historical investigators impose on an unruly, broadly temporalized past in order to corral its unwieldy nature and give it some meaning useful to readers. Ideas such as historical significance, causation, change over time, chronological sweep, evidence, and historical contextualization make up concepts of the background type. Procedural or strategic knowledge involves using background concepts together with cognitive processes in order to arrive at foreground substantive understandings. Being able to ask historical questions, to seek out and assess sources as evidence for making claims, to know how to evaluate the validity and reliability of sources, and to build interpretations require strategic knowledge.
Time-series Design
To assess change in teachers' knowledge of the three types, we created a complex instrument that we could use in a time-series design. This meant that we could administer the instrument before teachers began the TAH program and again after they had completed it, or at various intervals along the way to the end of the three-year funding cycle. This allowed us to measure baseline knowledge against changes brought about by the program's intervention elements. It also allowed us to ask TAH program directors to solicit comparison group teachers to take the assessment so that we could compare scores between participants and nonparticipants in a quasi-experimental design. This has proven workable and productive, although it sometimes has been difficult to get comparison group teachers to return to take the assessment a second time.
Our most significant challenge involved figuring out what sort of items to create to measure these differing types of knowledge. The assessment needed to be relatively efficient to administer, repeatable without practice effects, reasonably reliable, and high in construct validity. We settled on a rather heavy reliance of multiple, forced-choice items in each of the knowledge types. However, because history is an ill-structured knowledge domain (meaning that problems worth studying can be defined in multiple ways with varying interpretive results), we turned the multiple-choice items effectively upside down. By this I mean that, instead of positing only one correct answer to the items, we offered three possibilities with only one distractor of the four being patently incorrect.
. . . because history is an ill-structured knowledge domain [. . . ] we offered three possibilities with only one distractor of the four being patently incorrect. . .
With considerable effort, we structured the three remaining acceptable distractors into a descending order from most-to-least acceptable and weighted them. This structure has allowed us to disaggregate item scores to show the direction of movement in teachers' responses (towards stronger or weaker knowledge) and to map the multiply-interpretive, ill-structured nature of history domain knowledge onto the items themselves.
Assessment Tools
To augment these items, we constructed a DBQ-style essay we ask teachers to write. We purposely chose events about which a variety of interpretations are possible based on conflicting testimony provided in the four documents teachers read and on the basis of which they are asked to craft their responses. We score these essays using a complex 21-point rubric that has five key categories (e.g., contextualizes interpretation, assesses the status of sources used). This single essay, we have found, is the most knowledge-sensitive element of the assessment and correlates highly with the three types of knowledge the multiple-choice items measure.
We also borrowed from the research literature in educational psychology to design two additional scales that we include in the instrument—interest and epistemological stance. We know from the research literature that if an intervention program does not elicit interest from participants, their knowledge is unlikely to change. We also know from a different research literature that to think historically in ways that enable deeper historical understandings, teachers need to conceptualize history as an interpretive domain, ill-structured in its problem spaces, and prone to regular revision.
To understand history as such, those who investigate it and apply its forms of knowledge need to work from a set of criteria for what counts in making sense of the past. Assuming that history falls from the sky, authorless and ready-made, tends to cognitively handcuff teachers, especially when facing conflicting testimonies from the past. The epistemology scale attempts to measure changes in teachers' understandings of the bases and warrants for historical knowledge and correlates them with other items on the assessment.
Assuming that history falls from the sky, authorless and ready-made, tends to cognitively handcuff teachers, especially when facing conflicting testimonies from the past.
This instrument—called the HKTA for Historical Knowledge and Teaching Assessment—produces a rich array of powerful data. It sheds considerable light on what teacher participants know, can do with what they know, and how their ideas change (or not) across the programs' durations. Most importantly, results provide project partners with feedback on the strengths and weaknesses of the interventions and ways they can go about making changes as the programs evolve in growing participants' knowledge of American history.
What We Learned
We have learned many things from using this assessment tool. Because of its complexity and number of scales, we have struggled to keep its length reasonable so it can be administered in a relatively short timeframe. We have found that after about an hour's duration, teachers begin to tire (although generally they take the assessment in good spirit and sometimes seek out their personal scores which we release only to them on individual request). Given the richness of ideas and constructs we are trying to sample—so as to provide sound feedback to project partners—this creates tradeoffs for us that we have had to manage carefully. Rich data collection has to be weighed against economies of efficiency in assessment administration time.
The epistemology scale has created additional concerns. The items presented in Likert-scale format have a tendency to be prone to social-desirability item-selection effects. To date we have been reluctant to release this scale's outcomes because we are still sorting out how validly it measures epistemological stances among teachers.
The most important learning aspect of administering the assessment has come when we report out data to project partners. As I noted, the HKTA exposes both strengths AND weaknesses in the TAH programs. Though this is as intended, we have found it frequently difficult to communicate weaknesses to partners who invest much energy in producing powerful programs.
The struggle here often turns on helping historians, who operate as content experts, to understand what the assessment tells us about what it means to transfer that content into history lessons for pre-collegiate students. This is a language historians are understandably least familiar with. In particular, the assessment reveals gaps between the efforts of the historians and that of the pedagogy experts assigned to the projects. Such gaps can be delicate observations to convey. It has helped that the various scales on the HKTA generate data useful for the purpose of strengthening these connections.