Abstract

Background and objectives

Universities throughout the USA increasingly offer undergraduate courses in evolutionary medicine (EvMed), which creates a need for pedagogical resources. Several resources offer course content (e.g. textbooks) and a previous study identified EvMed core principles to help instructors set learning goals. However, assessment tools are not yet available. In this study, we address this need by developing an assessment that measures students’ ability to apply EvMed core principles to various health-related scenarios.

Methodology

The EvMed Assessment (EMA) consists of questions containing a short description of a health-related scenario followed by several likely/unlikely items. We evaluated the assessment’s validity and reliability using a variety of qualitative (expert reviews and student interviews) and quantitative (Cronbach’s α and classical test theory) methods. We iteratively revised the assessment through several rounds of validation. We then administered the assessment to undergraduates in EvMed and Evolution courses at multiple institutions.

Results

We used results from the pilot to create the EMA final draft. After conducting quantitative validation, we deleted items that failed to meet performance criteria and revised items that exhibited borderline performance. The final version of the EMA consists of six core questions containing 25 items, and five supplemental questions containing 20 items.

Conclusions and implications

The EMA is a pedagogical tool supported by a wide range of validation evidence. Instructors can use it as a pre/post measure of student learning in an EvMed course to inform curriculum revision, or as a test bank to draw upon when developing in-class assessments, quizzes or exams.

Lay Summary

Universities increasingly offer undergraduate courses in evolutionary medicine (EvMed); this has created a need for instructional resources. We provide a new assessment that measures students’ ability to apply the core concepts of EvMed to health-related scenarios. Instructors can use it to inform curriculum development by measuring student learning.

Introduction

Evolutionary medicine (EvMed) is a relatively new field that applies an evolutionary lens to deepen our understanding of health and disease [1, 2]. Given its potential to influence medicine and biomedical research, there have been calls to include EvMed in pre-medical education [1]. Efforts to answer this call can be seen in the growing number of courses teaching EvMed at the postsecondary level [6, 7]. A majority of research-intensive universities in the USA now offer at least one undergraduate course that either focuses on EvMed exclusively or includes EvMed as a unit within a more general evolution or health-related curriculum [6]. These courses are typically housed within biology or anthropology departments or cross-listed between the two [6]. The increasing prevalence of EvMed instruction at the undergraduate level has created a need for pedagogical resources, which has been partially addressed by a growing number of instructor resources including textbooks [8], lesson plans or activities [9, 10] and journal articles that include insights for teaching EvMed [3, 11]. While these resources focus on what to teach in EvMed or how to teach EvMed, resources that help assess student understanding of EvMed are largely absent. To fill this gap, we developed and validated the EvMed Assessment (EMA), 45 true-or-false items distributed across 11 question sets.

Assessment and core principles in backward design

The EvMed Assessment was developed as part of a larger effort to assemble pedagogical resources for EvMed using backward design [9]. Backward design is an educational method that stresses the importance of alignment among curricular components [12–14]. With backward design, the instructor first specifies the learning goals, then develops assessments to evaluate mastery of the learning goals, and lastly develops instructional materials that match the assessments and learning goals. The key benefit of this approach is that it facilitates internal alignment of what students are taught, what is assessed, and which concepts the instructor deems most important [15].

To provide EvMed instructors with a resource for setting learning goals, our team (D.G., R.N. and S.B.) previously conducted a study to identify the theoretical principles that underlie EvMed. In that study, we iteratively surveyed 56 biologists, anthropologists, medical doctors and other researchers who contributed to the field of EvMed via publications and participation in the International Society for Evolution, Medicine and Public Health [16]. This survey yielded a list of 14 core principles of EvMed: (i) types of explanation (proximate vs ultimate causes), (ii) evolutionary processes, (iii) reproductive success, (iv) sexual selection, (v) constrains on adaptation, (vi) general evolutionary trade-offs, (vii) life history theory, (viii) levels of selection, (ix) phylogeny, (x) coevolution, (xi) developmental plasticity, (xii) immune defenses, (xiii) environmental mismatch and (xiv) and the impact of cultural practices on health and disease [16]. The inclusion of each core principle on this list was approved by at least 80% of the panellists. Agreement with the inclusion of each principle ranged from 80% for sexual selection to 100% for types of analyses, evolutionary processes, cultural practices, and life history theory.

These core principles provide guidance for EvMed instructors as they construct course learning goals [15] and course assessments. However, creating effective assessment questions can be time consuming and difficult, making questions that have been tested and validated across multiple contexts a useful resource. Creating this type of resource that aligns with the core principles for use by EvMed instructors was our goal in developing the EMA.

Assessments with validation evidence

Assessment validity is the extent to which the assessment measures what it claims to measure. More specifically, it is the extent to which inferences drawn from assessment scores accurately reflect students’ actual understanding of the relevant material. Assessments with validation evidence have been rigorously tested to demonstrate that various extraneous factors—such as unclear wording, questionable scientific accuracy of the content, or student use of general test-taking skills—appear to have minimal influence on the scores [17–19]. Two common types of assessments with validation evidence are concept inventories [21] and programmatic assessments [22–24]. Concept inventories are designed to measure students’ understanding of a relatively narrow scientific idea, such as natural selection [20, 25], genetic drift [21] or macroevolution [26]. Alternatively, programmatic assessments have been designed for an entire field, such as ecology and evolution [22], physiology [24] and general biology [23]. They typically measure a range of core concepts that have been established by a group of disciplinary experts (e.g. Vision and Change report on undergraduate biology education) and are not limited to one particular course; rather, they assess student learning within a field throughout an entire undergraduate program. Both concept inventories and programmatic assessments aim to produce aggregate results that enable inferences to be made about a curriculum’s efficacy and collective student learning.

While assessments with validation evidence share structural similarities with instructor-generated course exams, these assessment types differ in their primary purpose and design process. Instructor-generated exams usually lack validation evidence and cover a portion of the course, which may or may not be based on specific learning goals [27]. These primarily measure individual students’ learning to provide a summative assessment (i.e. a grade), but are not always effective at measuring collective student learning, which is a primary function of concept inventories and programmatic assessments. This additional function requires higher standards of validity and reliability [17, 28].

Validity and reliability are frameworks for evaluating the quality of the inferences that can be drawn from a measurement tool, such as an assessment or survey [17]. Validity addresses the question of whether an assessment truly measures what researchers want it to measure. This is typically evaluated using both quantitative and qualitative methods that examine (i) the extent to which the assessment accurately represents the relevant knowledge domain, (ii) the extent to which it requires students to use the skills and thought processes that researchers want them to use and (iii) whether assessment items that are designed to measure the same concept get answered the same way [17–19, 29]. Reliability addresses the consistency of an assessment’s measurements. This is typically evaluated using quantitative methods that look at test–retest consistency across individual students or at the consistency of students’ answers across items within a single assessment [17–19, 29]. Creating an exam with this level of validation is time consuming, and thus not practical for typical instructor-generated exams.

Many of the EvMed core principles are general evolutionary concepts that are covered within existing concept inventories; for example, understanding of phylogeny can be measured using the Evolutionary Tree Concept Inventory [30], and understanding of coevolution can be measured using the Host-Pathogen Interactions Concept Inventory [31]. (see Supplementary Table 1 for a list of concept inventories that cover various EvMed core principles [20, 21, 26, 30–34]). These existing concept inventories can be used to measure student understanding of individual evolutionary concepts within various non-human contexts. However, the application of evolution to novel contexts in human health and disease may influence how students operationalize these concepts, so there is a need for a new EvMed-specific assessment.

Project goals

Our goal was to provide EvMed instructors with an assessment resource with validation evidence that uses questions at high levels on Bloom’s taxonomy [3, 35] to measure student understanding of the EvMed core principles by requiring students to apply the core principles to a range of novel health-related scenarios. The design of the EMA is similar to a programmatic assessment because it covers multiple core concepts within a field, rather than a single relatively narrow idea. However, unlike a programmatic assessment that is typically administered across multiple courses in an undergraduate program, we foresee the use of the EMA being more similar to a concept inventory, which are primarily used in individual courses. We acquired validity evidence for this assessment using expert feedback, student interviews, and by administering the assessment in 11 different EvMed courses in universities throughout the USA.

Development

Assessment format and question development

The EMA uses a modified version of the multiple-true-false (MTF) format, in which each question contains an informational premise (stem) followed by several true-or-false statements (items) (Fig. 1) [22–24]. The advantages of the MTF format over the standard multiple-choice format are that (i) students can answer more items in a set amount of time, (ii) the assessment can be used to explore student understanding of multiple principles applied to the same scenario, and (iii) the MTF format has been shown to mimic free-response reasoning by revealing individual misconceptions within a chain of reasoning [21, 23, 36]. We modified the MTF format by replacing the options of ‘true’ and ‘false’ with ‘likely’ and ‘unlikely’ [22, 24]. This helped accommodate the inherent uncertainty in applying evolutionary principles to health-related topics and made it easier to create items that were scientifically accurate and sufficiently difficult. This approach reflects the tentative nature of science [37–39] and has previously been used in evolution and ecology assessment [22].

Question format on the EvMed Assessment
Figure 1.

Question format on the EvMed Assessment

The question stems describe a scenario in one to two paragraphs, often accompanied by a table or figure. These scenarios cover a wide range of health-related topics, including genetic conditions, cancer, infectious disease, ageing and mental health. We included a variety of health-related topics to reflect the broad scope of EvMed as a discipline [1, 3, 4]. For each stem, we wrote items that present a single inference or prediction based on information in the stem and test competencies in one or more core principles. While each item was intended to primarily assess one core principle, inherent conceptual overlap led to some items covering additional closely related principles (e.g. life history theory and trade-offs) [16]. The entire set of items within a question covers several core principles that can be applied to the given scenario. To ensure that items assess conceptual understanding rather than knowledge or familiarity with technical terms, we limited the use of scientific and medical jargon.

To develop the first draft of the EvMed Assessment (version 1.0), D.G. and T.M. wrote questions based on the core principles of EvMed [16]. Both researchers had extensive prior teaching and coursework experience in EvMed and general familiarity with the EvMed literature. We wrote stems by condensing information from a wide range of sources and generating hypothetical scenarios based on well-studied phenomena. This is a standard approach to question development for concept inventories that require students to transfer conceptual knowledge to novel scenarios [21–23]. R.N. and S.B. provided feedback on questions. R.N. is an international expert in EvMed, having written several books on the field, and S.B. is an expert in assessment development in biology education. Version 1.0 of the assessment included a total of 70 likely/unlikely items across 14 question stems (Fig. 2). This version included items covering every principle except sexual selection, where we found it impossible to identify examples of evolutionary inferences that could be conveyed with a brief description, had a clear-cut answer [39], and did not resort to gender essentialism.

Development and validation process of the EvMed Assessment. Parenthetical numbers indicate sample size
Figure 2.

Development and validation process of the EvMed Assessment. Parenthetical numbers indicate sample size

Qualitative validation: version 1.0 to 2.0

Version 1.0 was revised by a process of iterative evaluation and improvement of its content and substantive validity based on multiple rounds of student interviews and expert reviews that addressed issues identified with individual questions and items. See Fig. 2 for an overview of the entire assessment development process. The Institutional Review Board of Arizona State University approved the procedures for this study (ASU IRB #00010655).

Content validity addresses whether the assessment offers an accurate representation of the relevant knowledge domain; it is typically evaluated using expert review [17–19, 29]. To assess content validity, seven scholars active in teaching or conducting research in EvMed reviewed the assessment for accuracy on two separate occasions. A single expert (R.N.) reviewed the initial draft of the assessment (version 1.0). The other six experts (one PhD student, four postdoctoral scholars and one Assistant Professor) reviewed the updated version 1.2. The reviewers were asked to (i) point out any information that is inaccurate or presented as being more scientifically settled than it really is, (ii) select the correct answer for each item and rate their confidence in their answer on a scale of 1–10 and (iii) provide any other concerns they have (e.g. clarity, difficulty level, etc.). We revised or deleted items if at least one reviewer answered ‘incorrectly’ with high confidence or deemed the item overly complex or debatable. We also addressed any noted issues with accuracy or clarity as recommended by the reviewers.

Substantive validity addresses whether students are using the skills and thought processes that the developers want them to use when answering items on an assessment [17–19, 29]. This involves checking for issues like students answering correctly despite using incorrect reasoning or answering incorrectly despite using correct reasoning (e.g. because they misinterpreted what the item was asking). To assess substantive validity, we conducted a total of 31 think-aloud interviews with undergraduate students on versions 1.0–1.3 of the assessment. During interviews, students read the question stem out loud and explained the reasoning behind their answer choices. The interviewer took notes and asked clarifying questions about the students’ reasoning. The interviewer noted if the student (i) interpreted the stems and items correctly, (ii) was able to select the correct answer despite using the wrong reasoning or (iii) selected the wrong answer despite using the correct reasoning. We were watchful for the latter two issues because an effective assessment needs to consistently differentiate between correct use of the concepts being tested versus the presence of misconceptions [22–24].

The 31 think-aloud interviews were distributed across four initial rounds of revision (Fig. 2). In order to obtain a sample that reflects the student population for which this assessment is intended, we recruited participants who were currently enrolled in an EvMed course at one institution and recruited late in the semester to ensure they had some level of content knowledge. All interviews were conducted by either T.M. or D.G. and audio recorded. We revised an item or stem any time two or more students displayed the same issue, such as when the use of a particular misconception led to the correct answer, or when multiple students were able to answer an item using only common knowledge without engaging with core concepts. This process resulted in version 2.0 of the assessment, with 14 questions containing 62 items across all questions (Fig. 2).

Quantitative validation of version 2.0

Data collection and processing

To quantitatively evaluate the psychometric properties of the assessment, we conducted a large-scale pilot test of version 2.0. Given the length and cognitive complexity of the assessment, we were concerned about the potential effect of survey fatigue on student attentiveness [22, 40]. To keep testing time under the recommended 30-min time limit for most respondents [22, 40], we reduced the length of the assessment by dividing version 2.0 into seven core questions and seven supplemental questions. When taking the assessment, all students received the seven core questions along with a random sample of five supplemental questions. The core questions were chosen by identifying the minimum number of questions needed to cover all the EvMed core principles. To do so, we categorized items by the principles they tested and mapped out which question sets would provide coverage over an appropriate range of principles while also presenting a variety of health topics.

Version 2.0 of the assessment was administered online as an out-of-class assignment to 732 undergraduate students across seven different courses at six institutions (Table 1). The assessment was accompanied by a demographic survey. Four of the participating courses administered the assessment in a single semester, two administered it in two consecutive semesters, and one administered it in three consecutive semesters. Thus, the assessment was administered to students in 11 total course offerings; 10 of these were EvMed courses and one was an evolution course (Table 1). We administered the assessment toward the end of term so that the sample population would reflect the target population for the assessment. However, targeting EvMed courses made it difficult to obtain a large sample size because EvMed courses tend to be relatively small. To increase our sample size and evaluate how the assessment performs with students who have a background in evolution, but not EvMed, we also administered the assessment in one high-enrolment evolution course. Participants received a small amount of extra credit based on completion.

Table 1.

Overview of courses that administered version 2.0 of the EvMed Assessment

CourseUniversityNumber of semestersTotal studentsFinal student sample
EvMedNon-selective R1 HSI; southwest12116
EvolutionNon-selective R1 HSI; southwest1305179
EvMedNon-selective R1; mountain west27054
EvMedNon-selective M1; southeast23225
EvMedVery selective R1; west coast14531
EvMedSelective R1; southeast11311
EvMedSelective R1; west coast3246183
CourseUniversityNumber of semestersTotal studentsFinal student sample
EvMedNon-selective R1 HSI; southwest12116
EvolutionNon-selective R1 HSI; southwest1305179
EvMedNon-selective R1; mountain west27054
EvMedNon-selective M1; southeast23225
EvMedVery selective R1; west coast14531
EvMedSelective R1; southeast11311
EvMedSelective R1; west coast3246183

Hispanic-serving institution (HSI)

Only students who took at least 10 min. to complete the entire assessment were included in the final sample.

Table 1.

Overview of courses that administered version 2.0 of the EvMed Assessment

CourseUniversityNumber of semestersTotal studentsFinal student sample
EvMedNon-selective R1 HSI; southwest12116
EvolutionNon-selective R1 HSI; southwest1305179
EvMedNon-selective R1; mountain west27054
EvMedNon-selective M1; southeast23225
EvMedVery selective R1; west coast14531
EvMedSelective R1; southeast11311
EvMedSelective R1; west coast3246183
CourseUniversityNumber of semestersTotal studentsFinal student sample
EvMedNon-selective R1 HSI; southwest12116
EvolutionNon-selective R1 HSI; southwest1305179
EvMedNon-selective R1; mountain west27054
EvMedNon-selective M1; southeast23225
EvMedVery selective R1; west coast14531
EvMedSelective R1; southeast11311
EvMedSelective R1; west coast3246183

Hispanic-serving institution (HSI)

Only students who took at least 10 min. to complete the entire assessment were included in the final sample.

Of the 732 students who completed the assessment, 684 (93.4%) consented to their data being used for the study. Because this was a low-stakes assignment, we were concerned that some students may have put in minimal effort (e.g. quickly clicked through the questions). Thus, we inspected how long students took to complete the entire survey and determined that a minimum of 10 min appeared to be an appropriate cut-off, which is consistent with the cut-offs used in similar studies [22, 23]. This eliminated 184 total students from our sample. One additional student was removed because they left most items blank, leaving 499 students (68.3% of initial population) in our final sample. The majority of students in this sample identified as women (76.4%), with a smaller percent of men (21.6%) and non-binary students (2%). Most students identified as White/European American (45.6%), Asian/Asian American (22.8%) or Hispanic/Latinx (12.6%) (see Table 2 for full demographics).

Table 2.

Demographics of students who took part in think-aloud interviews and the pilot study

Student characteristicInterviews (N = 39)Quantitative validation (N = 499)
N%N%
Gender
 Men1025.6%10821.6%
 Women2974.4%38276.4%
 Nonbinary00%102%
Race/Ethnicity
 Asian/Asian America923.1%11422.8%
 Black/African America37.7%183.6%
 Hispanic/Latinx1128.2%6312.6%
 Native American/Native Hawaiian12.6%51%
 White1538.4%22845.6%
 Other00%7214.4%
Academic year
 Lower (first year/sophomore)2256.4%14028%
 Upper (junior/senior)1538.5%35370.6%
 Graduate25.1%51%
Major
 Biology-related fields2064.5%n/an/a
 Anthropology-related fields516.1%n/an/a
 Other619.4%n/an/a
 Missing820.5%n/an/a
Highest parental education level
 Completed bachelor’s degree2976.3%n/an/a
 Did not complete bachelor’s degree923.6%n/an/a
First language
 English2871.8%n/an/a
 Not English1128.2%n/an/a
Student characteristicInterviews (N = 39)Quantitative validation (N = 499)
N%N%
Gender
 Men1025.6%10821.6%
 Women2974.4%38276.4%
 Nonbinary00%102%
Race/Ethnicity
 Asian/Asian America923.1%11422.8%
 Black/African America37.7%183.6%
 Hispanic/Latinx1128.2%6312.6%
 Native American/Native Hawaiian12.6%51%
 White1538.4%22845.6%
 Other00%7214.4%
Academic year
 Lower (first year/sophomore)2256.4%14028%
 Upper (junior/senior)1538.5%35370.6%
 Graduate25.1%51%
Major
 Biology-related fields2064.5%n/an/a
 Anthropology-related fields516.1%n/an/a
 Other619.4%n/an/a
 Missing820.5%n/an/a
Highest parental education level
 Completed bachelor’s degree2976.3%n/an/a
 Did not complete bachelor’s degree923.6%n/an/a
First language
 English2871.8%n/an/a
 Not English1128.2%n/an/a

not applicable (n/a)

Interview sample includes 31 students who participated in the initial qualitative validation (version 1.0–1.3) and 8 who participated in additional post-pilot validation (version 2.0).

Table 2.

Demographics of students who took part in think-aloud interviews and the pilot study

Student characteristicInterviews (N = 39)Quantitative validation (N = 499)
N%N%
Gender
 Men1025.6%10821.6%
 Women2974.4%38276.4%
 Nonbinary00%102%
Race/Ethnicity
 Asian/Asian America923.1%11422.8%
 Black/African America37.7%183.6%
 Hispanic/Latinx1128.2%6312.6%
 Native American/Native Hawaiian12.6%51%
 White1538.4%22845.6%
 Other00%7214.4%
Academic year
 Lower (first year/sophomore)2256.4%14028%
 Upper (junior/senior)1538.5%35370.6%
 Graduate25.1%51%
Major
 Biology-related fields2064.5%n/an/a
 Anthropology-related fields516.1%n/an/a
 Other619.4%n/an/a
 Missing820.5%n/an/a
Highest parental education level
 Completed bachelor’s degree2976.3%n/an/a
 Did not complete bachelor’s degree923.6%n/an/a
First language
 English2871.8%n/an/a
 Not English1128.2%n/an/a
Student characteristicInterviews (N = 39)Quantitative validation (N = 499)
N%N%
Gender
 Men1025.6%10821.6%
 Women2974.4%38276.4%
 Nonbinary00%102%
Race/Ethnicity
 Asian/Asian America923.1%11422.8%
 Black/African America37.7%183.6%
 Hispanic/Latinx1128.2%6312.6%
 Native American/Native Hawaiian12.6%51%
 White1538.4%22845.6%
 Other00%7214.4%
Academic year
 Lower (first year/sophomore)2256.4%14028%
 Upper (junior/senior)1538.5%35370.6%
 Graduate25.1%51%
Major
 Biology-related fields2064.5%n/an/a
 Anthropology-related fields516.1%n/an/a
 Other619.4%n/an/a
 Missing820.5%n/an/a
Highest parental education level
 Completed bachelor’s degree2976.3%n/an/a
 Did not complete bachelor’s degree923.6%n/an/a
First language
 English2871.8%n/an/a
 Not English1128.2%n/an/a

not applicable (n/a)

Interview sample includes 31 students who participated in the initial qualitative validation (version 1.0–1.3) and 8 who participated in additional post-pilot validation (version 2.0).

Statistical analyses

We used classical test theory (CTT) to examine the psychometric properties of the assessment. CTT is a theoretical framework that is widely used to assess educational and psychological measurement tools, including their overall validity and reliability, as well as the performance of individual items within an assessment [41, 42]. While other frameworks such as item response theory (IRT) have more recently been developed, CTT remains appropriate to use when the sample size does not meet the recommended minimum for IRT [23, 24].

Assessment-level reliability and validity

An assessment’s reliability is based on the consistency of its measurements. One form of reliability is internal consistency, which describes the extent to which all the items in a test measure the same concept. We used Cronbach’s α to measure the internal consistency of version 2.0. For assessments that aim to measure a single construct, internal consistency is considered ‘excellent’ when α > 0.9, ‘good’ when 0.8 > α > 0.9 and ‘acceptable’ when 0.7 > α > 0.8 [43]. Cronbach’s α was 0.71 for the entire assessment and 0.73 for the core questions alone. Given that the EMA measures understanding of multiple concepts, this is an acceptable level of internal consistency.

An assessment’s validity is based on how well the scores reflect the construct of interest. One form of validity is criterion validity, which is based on the idea that test scores should correlate with variables theorized to be associated with the construct being measured. To evaluate criterion validity, we compared students’ performance on the assessment to their year of study and whether they were enrolled in an evolution course or an EvMed course. Test scores were higher for students who were further along in their educational career and likely had a stronger evolution background. A small sample of first-year students who did particularly well on the assessment are a notable exception, although they may be exceptional students as evidenced by their enrolment in upper-level courses. Further, we found that students enrolled in EvMed courses scored higher (M = 22.38, SD = 4.53) than students enrolled in the evolution course (M = 19.84, SD = 4.34) (t(389) = 6.13, P < 0.001). This provides a level of criterion validity that this test is measuring students’ EvMed knowledge (Table 3).

Table 3.

Mean student scores on the EvMed Assessment, listed according to academic year and course enrolment

Percent correct on corePercent correct on core + supplement
Year in college
 First (n = 8)65.6 (6.25)65.2 (5.45)
 Second (n = 132)63.5 (1.19)62.0 (1.03)
 Third (n = 136)67.3 (1.30)66.5 (1.07)
 Fourth (n = 195)69.0 (0.99)67.1 (0.87)
 Fifth + undergraduate (n = 22)72.6 (2.72)71.5 (2.34)
 Graduate student (n = 5)83.1 (3.64)78.3 (4.13)
Course
 EvMed (n = 320)70.2 (0.79)68.7 (0.67)
 Evolution (n = 179)62.2 (1.01)60.9 (0.86)
Percent correct on corePercent correct on core + supplement
Year in college
 First (n = 8)65.6 (6.25)65.2 (5.45)
 Second (n = 132)63.5 (1.19)62.0 (1.03)
 Third (n = 136)67.3 (1.30)66.5 (1.07)
 Fourth (n = 195)69.0 (0.99)67.1 (0.87)
 Fifth + undergraduate (n = 22)72.6 (2.72)71.5 (2.34)
 Graduate student (n = 5)83.1 (3.64)78.3 (4.13)
Course
 EvMed (n = 320)70.2 (0.79)68.7 (0.67)
 Evolution (n = 179)62.2 (1.01)60.9 (0.86)

Standard errors are in parentheses. One students did not provide their academic year.

Table 3.

Mean student scores on the EvMed Assessment, listed according to academic year and course enrolment

Percent correct on corePercent correct on core + supplement
Year in college
 First (n = 8)65.6 (6.25)65.2 (5.45)
 Second (n = 132)63.5 (1.19)62.0 (1.03)
 Third (n = 136)67.3 (1.30)66.5 (1.07)
 Fourth (n = 195)69.0 (0.99)67.1 (0.87)
 Fifth + undergraduate (n = 22)72.6 (2.72)71.5 (2.34)
 Graduate student (n = 5)83.1 (3.64)78.3 (4.13)
Course
 EvMed (n = 320)70.2 (0.79)68.7 (0.67)
 Evolution (n = 179)62.2 (1.01)60.9 (0.86)
Percent correct on corePercent correct on core + supplement
Year in college
 First (n = 8)65.6 (6.25)65.2 (5.45)
 Second (n = 132)63.5 (1.19)62.0 (1.03)
 Third (n = 136)67.3 (1.30)66.5 (1.07)
 Fourth (n = 195)69.0 (0.99)67.1 (0.87)
 Fifth + undergraduate (n = 22)72.6 (2.72)71.5 (2.34)
 Graduate student (n = 5)83.1 (3.64)78.3 (4.13)
Course
 EvMed (n = 320)70.2 (0.79)68.7 (0.67)
 Evolution (n = 179)62.2 (1.01)60.9 (0.86)

Standard errors are in parentheses. One students did not provide their academic year.

Item-level statistics

Validity and reliability are properties of the assessment as a whole. However, it is also important to analyse the performance of individual items. To do this, we examined item difficulty and item discrimination. Item difficulty (P) is the percent of students who answered the item correctly; it ranges from 0.0 (everyone answered incorrectly) to 1.0 (everyone answered correctly). While there are no standard cut-off values for item difficulty, some recommend values between 0.3 and 0.7 [44]. Item discrimination (D) is a measure of how well an item distinguishes between students with high test scores versus those with low test scores. Discrimination scores range from −1.0 to 1.0; values near 1.0 indicate that mainly high-scoring students answered the item correctly, values near 0.0 indicate that high-scoring and low-scoring students performed equally, and negative values indicate that low-scoring students were more likely to answer correctly [44]. Discrimination scores above 0.2 are recommended [45]. Using this cut-off value, we flagged items with a D of <0.2 for further review, as these items are potentially problematic due to insufficient differentiation. A total of 29 out of 62 items were flagged. Meanwhile, items that had both good difficulty scores (0.3–0.7) and good discrimination scores (≥0.2) were deemed suitable for inclusion in version 3.0 without any additional review.

Final draft of the assessment (version 3.0)

Creating version 3.0

We began our review of the 29 flagged items by examining each item’s discrimination score, difficulty score, think-aloud interview results and expert feedback from the previous validation stage. We deleted two questions because a majority of items within these questions were flagged for low discrimination, which indicates that these questions were performing poorly as a whole. Additionally, we deleted a third full question and three individual items from other questions because expert reviewers had expressed mild concern about their overall quality. We initially kept these in version 2.0 to further assess their performance, but deleted them after they were flagged for low discrimination. In total, we deleted eight flagged items in this manner.

Conversely, we kept seven flagged items as-is because (i) their discrimination scores were moderately below the recommended cut-off (D ≥ 0.13), (ii) their difficulty scores were within the recommended range and (iii) prior student interviews suggested no validity issues. Based on prior interviews, most of these items appear to display somewhat low discrimination because they test principles for which student misconceptions are particularly widespread.

The remaining 14 flagged items were revised to improve clarity and then re-assessed for substantive validity via additional student interviews. We conducted eight additional interviews with students recruited from an upper-level biology course; all students had taken at least one upper-level course on evolution, but no courses on EvMed specifically (see Table 2 for demographics). Based on these interviews, we deleted one item and kept the remaining 13 items for the final version of the assessment.

Overview of version 3.0

The final version of the EvMed Assessment consists of six core questions and five supplemental questions; each question contains 3–6 items. The six Core Questions present scenarios within six different health-related topics: genetic conditions, autoimmune disorders, infectious disease, cancer, mental health and drug function. Items in the core questions cover all of the EvMed core principles at least once, with the exception of sexual selection (Supplementary Table 2).

The supplemental questions were designated as such because they (i) address core principles that already have coverage in the core questions and (ii) cover health-related topics that are either already covered in the core questions (e.g. infectious disease) or are less central to EvMed (e.g. zoology). All supplemental questions passed the validity checks and can be added into the EvMed Assessment as one sees fit. A complete copy of the EvMed Assessment can be found in the Supplementary Material.

Discussion

Using the EvMed Assessment

The EMA is a multi-purpose assessment tool. The EMA can be used to (i) gather data on student learning to improve instruction, (ii) identify which core principles students find most difficult, (iii) draw from this pool of questions with validity evidence when designing in-class activities or course exams and (iv) compare the efficacy of different teaching settings or styles. When administered at the very beginning of a course, the EMA can be used to identify which core principles are least familiar to students and adjust instruction for that term accordingly. When administered at the end of a course, the EMA can be used to identify which core principles students continue to struggle with and revise future iterations of the course to address those principles more effectively. Furthermore, instructors can also use the EMA as a test bank of high Bloom’s level questions with validation evidence, which can be used for formative assessments (e.g. worksheets or clicker questions) and summative assessments (e.g. exams). When using the EMA as a test bank, we encourage instructors to pick and choose content from both the core questions and supplemental questions to match the learning goals of the lesson or unit.

Limitations and future directions

One limitation of the validation process was that most of the expert reviews were conducted by early-career scientists who may not have the same depth of knowledge as more senior researchers. This limitation, combined with the fact that EvMed as a discipline continues to evolve, means that instructors may disagree about whether some assessment items have settled, clear-cut answers. We encourage instructors to leverage their own expertise and the latest EvMed research when making judgement calls about whether to include all items when administering this assessment.

One limitation of the development process of this assessment stems from the fact that education research in EvMed is a relatively new area of inquiry [3–5, 6, 16], and we did not have an extensive literature base on student misconceptions about EvMed core principles to draw upon when designing this assessment. For this reason, we relied on general misconceptions about evolution [46] and our own observations during teaching and learning. However, future research into student misconceptions would be helpful in further development of EvMed education resources. This exercise would be particularly useful for principles that are uniquely important to EvMed and where student understanding has not been systematically investigated, such as evolutionary mismatch or trade-offs. This type of work could lay the foundation for more specific concept inventories in EvMed. We encourage other researchers to build on our current work by developing and refining additional assessment resources for EvMed.

Acknowledgements

We thank the EvMed scholars who provided expert reviews of the EMA, the students who participated in think-aloud interviews, the instructors who agreed to distribute the EMA in their courses, and the students who completed the EMA during pilot testing. We also thank the ASU Brownell Biology Education Research Lab for thoughtful discussions about this project.

Author contributions

Taya Misheva (Data curation [equal], Formal analysis [equal], Project administration [lead], Validation [equal], Writing—original draft [lead]), Randolph Nesse (Conceptualization [equal], Funding acquisition [equal], Writing—review & editing [equal]), Daniel Grunspan (Conceptualization [equal], Data curation [equal], Formal analysis [equal], Funding acquisition [equal], Project administration [supporting], Supervision [equal], Writing—review & editing [equal]) and Sara Brownell (Conceptualization [equal], Funding acquisition [equal], Supervision [equal], Writing—review & editing [equal]).

Funding

This project was supported by National Science Foundation Division of Undergraduate Education grant nos. 1712188 and 1818659.

Conflict of interest

None declared.

References

1.

Moltzau Anderson
J
,
Horn
F.
(Re-) Defining evolutionary medicine
.
Ecol Evol
2020
;
10
:
10930
6
.

2.

Natterson-Horowitz
B
,
Aktipis
A
,
Fox
M
et al. .
The future of evolutionary medicine: sparking innovation in biomedicine and public health
.
Front Sci
2023
;
1
:
01
18
. DOI: 10.3389/fsci.2023.997136.

3.

Graves
JL
,
Reiber
C
,
Thanukos
A
et al. .
Evolutionary science as a method to facilitate higher level thinking and reasoning in medical training
.
Evol Med Public Health
2016
;
2016
:
358
68
.

4.

Nesse
RM
,
Bergstrom
CT
,
Ellison
PT
et al. .
Making evolutionary biology a basic science for medicine
.
Proc Natl Acad Sci
2010
;
107
:
1800
7
.

5.

Antolin
MF
,
Jenkins
KP
,
Bergstrom
CT
et al. .
Evolution and medicine in undergraduate education: a prescription for all biology students
.
Evol Int J Org Evol
2012
;
66
:
1991
2006
.

6.

Grunspan
DZ
,
Moeller
KT
,
Nesse
RM
et al. .
The state of evolutionary medicine in undergraduate education
.
Evol Med Public Health
2019
;
2019
:
82
92
.

7.

Hidaka
BH
,
Asghar
A
,
Aktipis
CA
et al. .
The status of evolutionary medicine education in North American medical schools
.
BMC Med Educ
2015
;
15
:
38
.

8.

Perlman
R.
Evolution and Medicine
.
Oxford, New York
:
Oxford University Press
,
2013
.

9.

International Society for Evolution, Medicine, and Public Health
.
ISEMPH—EvMedEd
.
EvMedEd
2023
.

10.

Michigan State University
.
Evo-Ed: making evolution relevant and accessible
.
EvoEd
2023
.

11.

Smith
VH
,
Rubinstein
RJ
,
Park
S
et al. .
Microbiology and ecology are vitally important to premedical curricula
.
Evol Med Public Health
2015
;
2015
:
179
92
.

12.

Wiggins
G
,
McTighe
J.
Understanding by Design
.
Alexandria, VA
:
ASCD
,
2005
.

13.

Reynolds
HL
,
Kearns
KD.
A planning tool for incorporating backward design, active learning, and authentic assessment in the college classroom
.
Coll Teach
2017
;
65
:
17
27
.

14.

Teasdale
R
,
Aird
H.
Aligning multiple choice assessments with active learning instruction: more accurate and equitable ways to measure student learning
.
J Geosci Educ
2023
;
71
:
87
106
.

15.

Orr
RB
,
Csikari
MM
,
Freeman
S
et al. .
Writing and using learning objectives
.
CBE Life Sci Educ
2022
;
21
:
fe3
.

16.

Grunspan
DZ
,
Nesse
RM
,
Barnes
ME
et al. .
Core principles of evolutionary medicine: a Delphi study
.
Evol Med Public Health
2018
;
2018
:
13
23
.

17.

Campbell
CE
,
Nehm
RH.
A critical analysis of assessment quality in genomics and bioinformatics education research
.
CBE Life Sci Educ
2013
;
12
:
530
41
.

18.

Mead
LS
,
Kohn
C
,
Warwick
A
et al. .
Applying measurement standards to evolution education assessment instruments
.
Evol Educ Outreach
2019
;
12
:
5
.

19.

American Educational Research Association, American Psychological Association, National Council on Measurement in Education
.
The Standards for Educational and Psychological Testing
.
Washington, DC
:
American Psychological Association
,
2014
.

20.

Anderson
DL
,
Fisher
KM
,
Norman
GJ.
Development and evaluation of the conceptual inventory of natural selection
.
J Res Sci Teach
2002
;
39
:
952
78
.

21.

Price
RM
,
Andrews
TC
,
McElhinny
TL
et al. .
The genetic drift inventory: a tool for measuring what advanced undergraduates have mastered about genetic drift
.
CBE Life Sci Educ
2014
;
13
:
65
75
.

22.

Summers
MM
,
Couch
BA
,
Knight
JK
et al. .
EcoEvo-MAPS: an ecology and evolution assessment for introductory through advanced undergraduates
.
CBE Life Sci Educ
2018
;
17
:
ar18
.

23.

Couch
BA
,
Wright
CD
,
Freeman
S
et al. .
GenBio-MAPS: a programmatic assessment to measure student understanding of vision and change core concepts across general biology programs
.
CBE Life Sci Educ
2019
;
18
:
ar1
.

24.

Semsar
K
,
Brownell
S
,
Couch
BA
et al. .
Phys-MAPS: a programmatic physiology assessment for introductory and advanced undergraduates
.
Adv Physiol Educ
2019
;
43
:
15
27
.

25.

Nehm
RH
,
Beggrow
EP
,
Opfer
JE
et al. .
Reasoning about natural selection: diagnosing contextual competency using the ACORNS instrument
.
Am Biol Teach
2012
;
74
:
92
8
.

26.

Nadelson
LS
,
Southerland
SA.
Development and preliminary evaluation of the measure of understanding of macroevolution: introducing the MUM
.
J Exp Educ
2009
;
78
:
151
90
.

27.

Wright
C
,
Huang
A
,
Cooper
K
et al. .
Exploring differences in decisions about exams among instructors of the same introductory biology course
.
Int J Scholarsh Teach Learn
2018
;
12
:
01
13
. DOI: 10.20429/ijsotl.2018.120214.

28.

Knight
J.
Biology concept assessment tools: design and use
.
Microbiol Aust
2010
;
31
:
5
8
.

29.

Artino
AR
,
La Rochelle
JS
,
Dezee
KJ
et al. .
Developing questionnaires for educational research: AMEE Guide No. 87
.
Med Teach
2014
;
36
:
463
74
.

30.

Kummer
TA
,
Whipple
CJ
,
Bybee
SM
et al. .
development of an evolutionary tree concept inventory
.
J Microbiol Biol Educ
2019
;
20
:
20.2.42
.

31.

Marbach-Ad
G
,
Briken
V
,
El-Sayed
NM
et al. .
Assessing student understanding of host pathogen interactions using a concept inventory
.
J Microbiol Biol Educ
2009
;
10
:
43
50
.

32.

Smith
JJ
,
Cheruvelil
KS
,
Auvenshine
S.
Assessment of student learning associated with tree thinking in an undergraduate introductory organismal biology course
.
CBE Life Sci Educ
2013
;
12
:
542
52
.

33.

Kalinowski
ST
,
Leonard
MJ
,
Taper
ML.
Development and validation of the conceptual assessment of natural selection (CANS)
.
CBE Life Sci Educ
2016
;
15
:
ar64
.

34.

Perez
KE
,
Hiatt
A
,
Davis
GK
et al. .
The EvoDevoCI: a concept inventory for gauging students’ understanding of evolutionary developmental biology
.
CBE Life Sci Educ
2013
;
12
:
665
75
.

35.

Anderson
LW
,
Krathwohl
DR
(eds.).
A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectvies
.
New York
:
Longman
,
2001
.

36.

Brassil
CE
,
Couch
BA.
Multiple-true-false questions reveal more thoroughly the complexity of student thinking than multiple-choice questions: a Bayesian item response model comparison
.
Int J STEM Educ
2019
;
6
:
16
.

37.

Popper
K.
The Logic of Scientific Discovery
.
Routledge
,
1959
.

38.

Mueller
S
,
Reiners
CS.
Pre-service chemistry teachers’ views about the tentative and durable nature of scientific knowledge
.
Sci Educ
2022
. doi: 10.1007/s11191-022-00374-8.

39.

Morrow
EH.
The evolution of sex differences in disease
.
Biol Sex Differ
2015
;
6
:
5
.

40.

Revilla
M
,
Höhne
JK.
How long do respondents think online surveys should be? New evidence from two online panels in Germany
.
Int J Mark Res
2020
;
62
:
538
45
.

41.

Lord
FM
,
Novick
MR.
Statistical Theories of Mental Test Scores
.
Addison-Wesley
,
1968
.

42.

Cappelleri
JC
,
Lundy
JJ
,
Hays
RD.
Overview of classical test theory and item response theory for quantitative assessment of items in developing patient-reported outcome measures
.
Clin Ther
2014
;
36
:
648
62
.

43.

Crocker
LM
,
Algina
J.
Introduction to Classical and Modern Test Theory
.
New York
:
Holt, Rinehart, and Winston
,
1986
.

44.

Ashraf
ZA
,
Jaseem
K.
Classical and modern methods in item analysis of test tools
.
Int J Res Rev
2020
;
7
:
397
403
.

45.

Bichi
AA.
CLASSICAL TEST THEORY: an introduction to linear modeling approach to test and item analysis
.
2016
;
02
.

46.

University of California Berkeley
.
Misconceptions about evolution
.
Underst Evol
2021
.

Author notes

Daniel Z. Grunspan and Sara E. Brownell These authors are co-last authors.

Retired.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.