Discotest Initiative

Theo L. Dawson, Ph.D., Lectica, Inc.

Author's note

Special thanks to Kurt W. Fischer, Zak Stein, Clint Fuhs, Aiden Thornton, and David Baptista for their contributions to this article. I am particularly indebted to Dr. Fischer—for Skill Theory, for his powerful record of working collaboratively across paradigms, and for his warmth and generosity as mentor and collaborator. I also want to express my appreciation to all of the DiscoTest Initiative’s advisors and long-term supporters, especially Becky Reese, Alden Blodget, Tom Curren, Marc Schwartz, and Sharon Solloway, each of whom has stood by this work for many years.

Abstract

As a doctoral candidate in human development at UC Berkeley 1996, I witnessed two disturbing trends—the first in educational assessment and the second in educational research. On the assessment side, large scale high stakes “accountability” testing was beginning to take hold. Technological and methodological constraints dictated that these tests would be composed of questions with right and wrong answers. In other words, they would focus on the application of facts, vocabulary, definitions, rules, and procedures, and be limited to domains of knowledge that would easily lend themselves to these targets, such as science & mathematics. There would be little attention to demonstrations of deep comprehension or skills for thinking, communicating, interacting, and lifelong learning. Given that humans tend to value what we test (Hofman, Goodwin, & Kahl, 2015), it seemed clear that these assessments would drive instruction toward more didactic and alienating pedagogies focused on facts, vocabulary, definitions, rules, and procedures, and away from more progressive, engaging, skill- and understanding-focused pedagogies (Firestone, Frances, & Schorr, 2004; Afflerbach, 2005; Hursh, 2008; Dawson & Stein, 2011; Amrein & Berliner, 2003). It also seemed likely that curricula would come to focus more on tested than untested subjects (Heafner & Fitchett, 2012; Pederson, 2007).

The second trend of concern was the increasing popularity of educational research focused on speeding up learning and development (Duckworth, 1979). High stakes tests of right and wrong answers, coupled with the unquestioned assumption that learning (or developing) faster was a good thing seemed like a recipe for disaster (Schwartz, Sadler, Sonnert, & Tai, 2009). It wasn’t difficult to imagine a future in which content standards would become increasingly unrealistic, and doing well on tests would become the primary driver of curricula and instruction—especially for disadvantaged populations (Lipman, 2004). This possibility raised important educational, ethical, and justice issues (Dawson & Stein, 2011; Stein, 2016). The work described in this article began with a simple question that stemmed from these concerns: Would it be possible to build a scalable and ethically defensible technology for delivering standardized educational assessments that (1) help preserve our children’s inborn love of learning, (2) measure and support deep understanding, and (3) build essential skills for thinking, communicating, interacting, and lifelong learning?

Keywords: cognitive development, Skill Theory, educational assessment, performance assessment, formative assessment, learning theory, virtuous cycle of learning, developmental maieutics

Learning and educational assessment

“Would it be possible to build a scalable and ethically defensible technology for delivering standardized educational assessments that (1) help preserve our children’s inborn love of learning, (2) measure and support deep understanding, and (3) build essential skills for thinking, communicating, interacting, and lifelong learning?”

This is the question that launched what is now known as the DiscoTest initiative—a 20-year journey during which my colleagues and I—including my mentor, Dr. Kurt W. Fischer— have developed research methods, conducted research, invented technologies, and listened to hundreds of educators and thousands of test-takers in an effort to chip away at the challenging task of translating developmental theory into educational practice. In this article, I will expand upon the three broad goals listed above, then share some of the highlights of our methods, research, technological innovations, and perhaps most importantly, what we’ve learned about translating theory into assessment practice.

I will also be introducing some necessary jargon, including the terms Goldilocks Zone, Virtuous Cycle of Learning (a.k.a., VCoL or VCoL+7), robust learning, developmental maieutics, Lectical Level, Lectical Dictionary, and lexicating.

Goal 1: preserving children’s love of learning

Dr. Fischer frequently claimed that “the most common way students finish the sentence ‘Learning is…’ is with the word, ‘boring.’” (personal communication) When I speak to lay audiences, I often open talks about interest and motivation in learning with a story from my first career, “When I was a midwife, I met over 500 newborns. All of them learned to walk. In fact, they were so intent on learning to walk they behaved like addicts. No matter how much discouragement they received in the form of bumps on the head and other injuries, they just got up and tried again. It seemed like nothing could stop them.” I then ask, “Who in the audience has children over age 10? Please raise your hand. Keep your hand up if your child seemed addicted to learning at age 1. Okay, now keep your hand up if your child seemed addicted to learning at age 4. Now, keep your hand up if your child seemed addicted to learning at age 8.” By the time we get to age 8, very few hands are still in the air. It could be that parents just get tired of holding up their hands, but I don’t think so. Something happens between age 4 and age 8 that changes the way most children experience learning—something that shouldn’t happen.

Children are born with a passion for learning (Gopnik, Meltzoff, & Kuhl, 1999). This love of learning is reinforced by the brain’s dopamine- and opioid-driven motivation-reward cycle when learning conditions are optimal—in other words, when (1) the learner is interested, (2) the task in view is appropriately challenging, and (3) the learner has access to necessary resources and support (Hamid et al., 2013; Klenowski, Morgan, & Bartlett, 2014; Laurent, Morse, Ashleigh K., Balleine, Bernard W., 2015). This intersection of interest, suitability, and support is what we call the Goldilocks Zone.

Learning to walk in infancy exemplifies optimal learning conditions. The learner is intrinsically interested, is in control of the difficulty of the task, and generally has access to all of the necessary resources, including social and physical supports. Learning in school, for most children, is different. Dr. Fischer estimated that the percentage of students who regularly have an opportunity to learn in optimal conditions is about 20% (personal communication).

Clearly, we don’t intentionally send our children to school to learn to dislike learning, yet this is exactly what happens to many of them—they lose their sense of learning as a pleasurable activity. Consider students in the “bottom half” of the class. They repeatedly experience class material as too difficult. The result is frequent failure and accompanying negative feedback in the form of grades, scores, and adult affect. Students at the top of the class don’t fare much better. Even if they qualify for enrichment, gifted students find most school material too easy, which often leads to chronic boredom. Students in both groups can become disengaged, alienated, and distrustful of authority figures. They may even look for other, less healthy ways to activate the motivation-reward cycle (McIntosh, MacDonald, & McKeganey, 2005; Paulson, Coombs, & Richardson, 1990).

People are more likely to become lifelong learners if they are educated in a way that preserves their inborn love of learning (Csikszentmihalyi, Rathunde, & Whalen, 1997). Educational practices that undermine this love of learning are unfortunate both for individuals and society (UNESCO, 2015; Stein, 2016).

Optimal learning conditions support optimal learning

When educators create learning conditions that support the brain’s motivation-reward cycle, they are also creating conditions for optimal learning. As many 20th-century educational researchers discovered and rediscovered, optimal learning involves an iterative process that is a kind of virtuous cycle, a feedback loop that begins with an appropriate learning objective, is put to the test through goal-directed action, and increases understanding and skill through feedback from the environment (Werner, 1957; Piaget, 1977; Skinner, 1984; Bandura, 1977; Fischer & Bidell, 2006; Campbell & Bickhard, 1987; Mareschal et al., 2007). This virtuous cycle view of learning is broadly applicable, characterizing neural-network models (Spitzer, 1999), language acquisition (Tomasello, 2005) and even political theory and governance (Buck & Villines, 2007).

Baldwin and Piaget first characterized this feedback loop as an ongoing attempt to achieve balance between assimilation and accommodation (Baldwin, 1906; Piaget, 1985). More recently, neuroscientists have observed chemical cycles in the brain that motivate learning and enhance memory (Lisman & Grace, 2005; Wittmanna, Schiltzb, Boehlerc, & Düzel, 2008). Whether viewing human learning from the perspective of neurons, psychological states, or behavior, success appears to depend on engaging in virtuous cycles—positive feedback loops—that involve appropriate learning objectives, knowledge acquisition, goal-directed actions that put new knowledge or skills to the test, and actionable feedback.

Getting learning conditions just right

Early in the 20th century, Vygotsky (1978), who like other early constructivists, viewed learning as a virtuous feedback cycle, argued against using tests of knowledge to determine intelligence. He thought that student assessment should focus on helping educators set ideal learning goals—goals that were just beyond what a given student could accomplish without support—thus optimizing the impact of education on learning. The range in which learning goals are “just challenging enough” is commonly referred to as the zone of proximal development (ZPD) (Valsiner & Van Der Veer, 1999).

Since Vygotsky introduced ZPD, other learning scientists have shown repeatedly that learning challenges that are “in the zone” are more engaging and support more effective learning (Bandura, 1977; Brophy, 1999; Wigfield, Eccles, Schiefele, Roeser, & Davis-Kean, 2007; Fischer & Immordino-Yang, 2002). Learning scientists have also found that learning challenges are more likely to engage interest if they involve knowledge-seeking or the application of knowledge—the kind of activities that build skill and deep understanding (Renninger, 1992; Silvia, 2008). There is also evidence that students are more engaged when they are able to understand what they are being taught and when they have a sense of agency (Taylor & Parsons, 2011; Ahlfeldta, Mehtab, & Sellnow, 2005; Kuh & Umbach, 2004; Strong, Silver, & Robinson, 1995). Recent research on flow (interest, concentration, and enjoyment) in learning, which like learning in the Goldilocks Zone is associated with both dopamine and opioids, has revealed that learners experience increased flow, “when the perceived challenge of the task and their own skills [are] high and in balance, the instruction [is] relevant, and the learning environment [is] under their control” (p. 18) (Shernoff, Csikszentmihalyi, Shneider, & Shernoff, 2003). More recently, and as noted above, neuroscientists have found that children’s inborn passion for learning stems from another virtuous cycle—the brain’s dopamine opioid motivation-reward cycle (Berridge, Robinson, Terry E., 1998; Stahl & Feigenson, 2015).

The cumulative evidence suggests that preserving children’s inborn love of learning may be as simple as ensuring that children’s interests, task difficulty, and support stay in good enough alignment to keep the brain’s motivation and reward cycle functioning optimally.

Sacks (1999) (p. 256-257) has argued that “Test-driven classrooms exacerbate boredom, fear, and lethargy, promoting all manner of mechanical behaviors on the part of teachers, students, and schools, and bleed school children of their natural love of learning.” Many other scholars have made similar arguments (Firestone, Frances, & Schorr, 2004; Afflerbach, 2005; Hursh, 2008; Dawson & Stein, 2011; Amrein & Berliner, 2003). However, as argued above, the right assessments have an essential role to play in preserving students’ inborn love of learning when they help educators and learners identify the intersection between interests, task difficulty, and support—the Goldilocks Zone—by pointing to what a learner is most likely to benefit from learning next.

Ideal assessments would help educators create virtuous cycles of learning that operate in harmony with the brain’s motivation-reward cycle—to preserve and enhance students’ inborn love of learning, and perhaps extend this love of learning to a wide range of subjects. They would do this by diagnosing students’ current level of understanding or skill and pointing to what comes next, and would be constructed with methods that could be employed in any subject area.

The challenges: In 1996 there were no metrics that measured level of skill and understanding with a precision that would help teachers identify the Goldilocks Zone. Indeed, existing assessment technologies were more suitable for ranking and selection than classroom diagnostics (Dawson & Stein, 2011). To build an assessment technology precise enough for diagnostic use and flexible enough to serve a wide range of subject areas, it would first be necessary to (1) develop a sufficiently precise and flexible learning metric, and (2) design methods for establishing “just right” learning goals.

Goal 2: Measuring and supporting deep understanding

Our second goal for the DiscoTest Initiative was to develop assessments that both measure and support deep understanding. We define deep understanding as, “learning in a way that richly connects new knowledge to existing knowledge, such that the new knowledge is “robust”—can be applied effectively in a range of real-world contexts—and can become well enough networked with existing knowledge to create a solid foundation for future learning.” This is not the kind of knowledge required to do well on conventional tests—even those that include items that present real-world problems (Salz & Toledo-Figueroa, 2009).

Even the world’s best standardized assessments are almost entirely composed of questions with right and wrong answers (Dawson, 2017b). Most items are still multiple choice, and written response questions are typically scored with:

lists of typical right and wrong answers;
rubrics that focus on the extent to which a student has gotten something correct, whether it be a definition, the application of a procedure, or the interpretation of a story; or
atheoretical data-analytics-based computerized scoring systems.

When Bloom (Bloom, Engelhart, Furst, Hill, & Krathwohl, 1956) developed his taxonomy of educational objectives, he drew a contrast between remembering and understanding that is widely accepted by researchers and educators (Mayer, 2002), although some scholars reject Blooms treatment of knowledge and remembering as synonyms, (Paul, 2012). Correctness, which always (except in the case of guessing) involves remembering, often does not require understanding. For example, children can learn to multiply accurately without being able to recognize a real-world instance in which multiplication would be useful, learn the definition of almost any word without understanding its meaning well enough to make use of it outside of a specific context in the classroom, or learn to write an essay that follows a specific format without being able to tell a coherent story (Kuhn, 2000). Although there is not anything inherently wrong with measuring correctness, it is important to keep in mind that although correctness may be evidence of recall, it is not necessarily evidence of understanding.

Researchers have taken several approaches to measuring understanding. Some have focused on building multiple-choice items in which each possible selection represents a specific way of understanding a given problem (Sadler, 1999). Others have focused on concept mapping (Edmonton, 1999), tests of transfer, (Mayer, 2002), or various forms of authentic assessment, such as portfolio or performance assessment (Wilson, 1996; Reeves & Okey, 1996; Stecher, 2010). There have even been several calls for standardized performance assessments e.g. (Peck, A, Singer-Gabella, Sloan, & Lin, 2014), and these are gradually making their way into the educational marketplace. However, these are being developed primarily for contexts in which skills are well defined and there are right and wrong ways of approaching or solving a problem, as in medical procedures e.g., (Stylopoulos et al., 2004). In fact, current standardized “authentic assessments” often do not measure deep understanding as it is defined here. Their targets are often similar to those of other forms of standardized assessment—correct actions or responses. And, as I have already argued, correctness is not adequate evidence of understanding.

Deep understanding, as the term is used here, requires robust learning—the kind of learning that lays the foundation for future learning (Detterman & Sternberg, 1993). It has long been accepted that asking students to support their answers with arguments increases their depth of understanding (Chi, de Leeuw, Chiu, & LaVancher, 1994; Entwistle, 2004). Indeed, assessments that pose real-world problems without pat answers (Hofman, Goodwin, & Kahl, 2015; Sawyer, 2006) have been shown to reveal a great deal about students’ depth of understanding (Frederiksen, 1984). When students make supporting arguments, their understanding of target concepts and skills becomes apparent in the connections they make, the evidence they call on, and the way in which they use ideas. This is why Piaget’s clinical interviewing technique—an approach specifically designed to reveal ways of understanding—was employed to elicit justifications (Kohlberg, 1984a).
Ideal assessments would be direct measures of deep understanding, and would deliver timely information and resources that would reward and support teaching practices that support learning for deep understanding.

The challenges

To accomplish goal 2—measuring and supporting deep understanding, we would first have to meet the challenges from Goal 1: (1) develop a sufficiently precise and flexible learning metric, and (2) design methods for establishing “just right” learning goals. Then, we would need to employ these tools to determine what different levels of understanding look like, which would require (3) employing our metric and methods to create, curate, and manage a (constantly expanding) body of knowledge about the ways in which learners build increasingly sophisticated understandings in a wide range of knowledge domains. Achieving these goals would require both methodological and technological breakthroughs.

Goal 3: Support the development of essential skills

Curricula that foster deep understanding are also likely to support the development of skills for thinking, communicating, interacting, and lifelong learning, but there is by no means a one-to-one correspondence (Pelligrino, W., Hilton, & National Research Council, 2012). From the start, we determined that both DiscoTests and the learning suggestions made on the basis of DiscoTest scores would explicitly require students to exercise these skills and support teachers in building them. We focus on 7 broad (and often overlapping) skills:

reflectivity,
self-monitoring and awareness,
seeking and evaluating information,
making connections,
applying new knowledge and skills,
seeking and working with input or feedback, and
awareness of cognitive and behavioral biases and skills for avoiding them.

We call these the +7 skills. They are attached to an iterating learning cycle known as VCoL (for Virtuous Cycle of Learning), which is composed of four components—goal setting, information gathering, application, and reflection. Together, VCoL and the +7 skills comprise the learning model that drives DiscoTest development, VCoL+7 (described in greater depth on page XX).

Skill 1: Reflectivity

The first skill, reflectivity, is actually a habit of mind. It is the disposition to reflect—a cultivated habit of reflecting on outcomes, information, emotions, or events as they occur. The evidence from both adult and childhood research indicates that people who are less reflective learn more slowly and make less effective decisions than those who are more reflective (Marsick, 1988; Giovannelli, 2003; Friedman, 2004). A reflective disposition has been shown to have a role in learning and performance in a number of contexts (West, Toplak, & Stanovich, 2008), including leadership development and skill, (Marsick, 1988) and teacher development and skill (Giovannelli, 2003), and as an essential component of problem based learning (Savery & Duffy, 1996). These authors suggest a simple mechanism: a disposition to reflect leads to more reflective activity and reflective activity acts as a catalyst for learning.

Ideal assessments would cultivate reflectivity by providing opportunities for reflection, rewarding reflective activity, and supporting pedagogical practices that cultivate students’ disposition to reflect.

Skill 2: Self-monitoring & awareness

The second set of skills, self-monitoring & awareness, also sometimes referred to as self-regulation, concurrent self-assessment, or mindful practice, involves the ability to attend to, recognize, and regulate one’s attention, thoughts, sensations, emotions, and behavior on a moment-to-moment basis (Puustinen & Pulkkinen, 2001; Karoly, 1993). These skills are important for learning and development because they not only aid in self-regulation, but also improve our ability to identify what we don’t know by increasing awareness of our knowledge and thought processes. They also open us to more sources and types of information by enhancing our ability to attend to our inner and outer environments—including the social environment.

Self-monitoring and awareness practices include:

observing and documenting our own behavior or feelings, as in journaling (Scardamalia & Bereiter, 1985);
self-evaluation, which generally involves evaluating our own behavior or responses relative to some kind of criterion, such as a rubric used as a check-list or guide (Nicol & MacFarlane-Dick, 2006); and
cultivating non-judgmental openness to experience, achieved through intentional attention to the judgments we make reflexively when receiving information or interacting with others (Tang et al., 2007; Karoly, 1993).

As with any kind of learning, students are more likely to benefit from self-monitoring and awareness practices when what the practices reveal poses challenges that are not overly difficult. If these revelations are too often negative or too frequently pose unreachable challenges, the practices may have negative effects (Kluger, Avraham N & DeNisi, 1996).

Ideal assessments would support self-monitoring and awareness by encouraging self-reflection and self-evaluation, foster openness to critical self-assessment by emphasizing the role of feedback in fostering robust learning, and help to ensure that students are learning in the Goldilocks Zone, so their self-evaluations are more likely to reveal attainable challenges.

Skill 3: Seeking and evaluating information

The third set of skills, seeking and evaluating information, evidence, and perspectives, is a large and varied set of skills that are critical for effective learning and decision-making (Pithers & Soden, 2010; Baron & Sternberg, 1987).

Information seeking takes two primary forms: The first of these is inquiry or “finding out for yourself.” Inquiry involves activities like observing, questioning, perspective-seeking, and experimenting. It can be formal or casual. Formal inquiry involves an established process such as scientific experimentation, cooperative learning, grounded theory, or action-inquiry. Casual inquiry is done within the context of everyday work or life—for example, to solve an immediate problem or help us make a workplace decision. The second form of information seeking involves finding out what others think, know, or have done. This includes seeking opinions and finding out what researchers have learned or what other people have done (e.g., best practices). This kind of research involves activities like listening (as in a classroom) or looking things up. A great deal of information seeking is a fundamentally social activity, because it involves direct or indirect communication with others, it can therefore be an activity that builds social skills (Garside, 1996).

Skills for evaluating information—often referred to as critical thinking skills—include identifying relevant perspectives, evaluating sources or expertise (including their authority, currency, objectivity), evaluating the quality of arguments, and evaluating evidence (including their validity, reliability, and accuracy) (Garside, 1996).

Skills for evaluating information are critical for optimal learning (Meyers, 1986). We’ve all heard the expression “garbage in, garbage out” with reference to computer programming. The same is true of human learning. The quality of the information we take in impacts the quality of our decisions. But more than the quality of individual decisions is at stake—the quality of the information we take in has a lasting impact on the quality of students’ developing knowledge networks. We cannot afford to neglect evaluation skills if we want to support optimal learning.

The most effective approaches to building skills for seeking and evaluating information involve learners in a variety of skill-focused interactive learning practices, such as collaborative learning (Garside, 1996), and problem-based learning (Savery & Duffy, 1995). The more learning time students spend in reflective, skill-focused interactive learning activities, the more likely it is that students will become proficient in information gathering and evaluation. Approaches to teaching critical thinking that primarily involve memorizing or remembering vocabulary, definitions, procedures, or rules do not build evaluative skills. To build evaluative skills learners must engage in a great deal of real-world, relevant practice—the kind of practice that contributes to deep understanding (Granott & Parziale, 2002; Mascolo & Fischer, 2010).

Ideal assessments would provide opportunities to practice skills for seeking and evaluating information, and support and reward teaching that provides these opportunities.

Skill 4: Making connections

The fourth skill, making connections, involves identifying and testing relations between ideas, information, perspectives, and evidence. Our knowledge exists in a dynamic living neural network (Crossleya et al., 2013; Posner & Rothbart, 2007; Finn et al., 2015). When we learn, information is added to this network, creating new connections, and reinforcing or pruning other connections. When students learn primarily by memorizing, relatively few connections are made. But when students learn actively and reflectively, connections are likely to be more numerous (Caine & Caine, 1991). We can influence the strength and quality of these connections by consciously engaging in the connection-making process. And when we deliberately engage in practices that support high quality (logic and evidence-based) connections, we not only develop our slow-thinking conscious brain, we also help our fast, but unconscious, connection-making brain to make more robust associations. This leads to higher quality decisions even when we don’t have much time for deliberation (Kahneman, 2011; Liao, 2008). Clearly, learning to make connections is a high priority.
Ideal assessments would help students learn to make robust connections, while supporting and rewarding the kind of teaching that provides these opportunities.

Skill 5: Applying new knowledge and skills

The fifth skill, applying new knowledge and skills, involves using new skills and knowledge to address issues or problems—as often as possible, real-world problems without tidy right and wrong answers. Applying knowledge involves an experimental attitude along with skills for identifying opportunities for application, deciding on a mode of application, and of course, actually employing knowledge to address issues, solve problems, or construct arguments. There are a number of practices that increase the skill with which students apply knowledge. These include action learning, inquiry learning, project-based learning, developing action plans, and persuasive writing or critical discourse (Garside, 1996; Savery & Duffy, 1995). Applying knowledge as part of the learning process increases engagement (Taylor & Parsons, 2011) and supports the process of connecting new knowledge to existing knowledge.

Ideal assessments would provide opportunities for students to apply their knowledge in ill-structured real-world contexts, while supporting and rewarding instructional practices that encourage the frequent application of knowledge.

Skill 6: Seeking and making use of feedback

The sixth set of skills, seeking and making use of feedback, are essential for evaluating the quality of our attempts to work with new knowledge (Askew & Lodge, 2000; Butler & Winne, 1995). Just as it is important to apply new knowledge, it is important to evaluate the success of these applications. For this, students need to value and desire feedback and must have skills for seeking and working with feedback. These skills include identifying good sources of feedback (i.e., knowledgeable people who provide constructive, actionable feedback), noticing subtle positive or negative signals from people or the environment, and experimenting with (applying) feedback.
High-quality feedback helps students see what works and what doesn’t, and can steer them toward new learning goals—goals personally tailored for their Goldilocks Zone. Moreover, students with good feedback seeking skills are better equipped to take charge of honing their own knowledge and skills (Boud & Molloy, 2012).

Ideal assessments would provide constructive, actionable feedback for students and teachers, and offer opportunities for students to practice providing and processing feedback.

Skill 7: Awareness of cognitive and behavioral biases and skills for avoiding them

The seventh and final set of skills, awareness of cognitive and behavioral biases and skills for avoiding them, involves a number of built-in mental biases and strategies for addressing them. According to dual-process theory, our brains have two mental systems, System 1, the unconscious and fast mental system, and System 2, the conscious and slow mental system (Kahneman, 2011). System 1 relies on scripts and associations to quickly draw conclusions and prescribe actions. Its default settings work well for hunter-gatherers and small children, but not so well for modern humans, who must deal with abstract knowledge and complexity. In today’s world System 2 must develop skills that reduce the effects of System 1’s tendency to rush to judgment with inadequate evidence (Amsel et al., 2008).

System 1 suffers from several built-in biases, including (but not limited to) overconfidence, prejudice, mistaking luck for cause, mistaking statistics for cause, and judging based on inadequate evidence. Thankfully, because System 2 can help educate System 1, we can combat these biases and possibly even teach System 1 to function more optimally (Cook, Lewandowsky, & Ecker, 2017; Kahneman, 2011).

Ideal assessments would provide opportunities for students to learn and practice skills for detecting and combating a range of cognitive biases.

Addressing the challenges

Overarching challenges

My colleagues and I learned several lessons as we applied cognitive-developmental theory in the "real world." Some of these lessons were more general than others. I have selected three of the “general” lessons to touch on here. They relate to naming things, the difficulties of doing something novel, and the dangers of reinventing the wheel.

“"Be careful how you name things"” Dr. Paul Holland, personal communication, 1998

On the day the last member of my Dissertation Committee signed off on my dissertation, Dr. Paul Holland took me to lunch in order to, “share some important advice.” The gist of his advice was, “take care in naming things.” At the time, Dr. Holland’s meaning was a mystery, but over the years as my colleagues and I have developed products, processes, and technologies, naming things has become a necessity. The name DiscoTest is a case in point. It hearkens to dancing and discourse and was designed to capture both the joy and interactive nature of learning and teaching. Although it may appear that playful punning is our primary naming goal, that is not the case. Our primary rule for naming new products or methods is, “As much as possible, do not hijack existing terms.”

“Everything is impossible until it’s done.” Nelson Mandela

Seemingly impossible undertakings like changing education by changing educational testing pose more than pragmatic hurdles. They come with all of the challenges that adhere to anything that threatens the status quo. If you plan to do something ambitious that challenges the status quo, it won’t be enough to succeed in producing a credible product. You will need many allies, a large body of evidence, strong conviction, good strategy, agility, lots of time, sheer stubbornness, luck, and perhaps most importantly, the ability to learn from mistakes and leverage setbacks (Perrini, 2006).

“Don’t reinvent the wheel.” Anonymous

Determining how to structure DiscoTests required a thorough understanding of existing thought and research. It is impossible to overstate the importance of leveraging existing research and thought or expanding one’s lens to multiple paradigms. Leveraging existing thought not only ensures that you don’t waste time by “reinventing the wheel,” but it reduces resistance to your approach by letting other scholars know that you understand your debt to their work. Expanding your lens to include other paradigms enriches your own perspectives and allows you to see where conclusions from different schools of thought converge and diverge. Convergence can increase confidence in the truth value of your conclusions, and divergence signals potential areas of weakness. As you are likely to notice when reading this narrative, the DiscoTest research lens spans neuroscience, cognitive psychology, system dynamics, evolutionary psychology, behaviorism, learning theory, clinical psychology, dual process theory, test theory, and philosophy.

Methodological and technological challenges

Most of the specific challenges faced by the DiscoTest Initiative have been methodological or technological. In this section, I briefly describe the methods and technologies employed to meet the five learning-related challenges outlined above, as well as a sixth challenge—electronic scoring—that relates more strongly to teachers’ needs.

Develop the Lectical Assessment System (a scoring system for Fischer’s skill scale).
Design developmental maieutics (methods for building DiscoTests).
Develop the first DiscoTests.
Develop the Lectical Dictionary (a curated taxonomy of meanings)
Invent an electronic scoring system.
Develop VCoL+7 (our learning model).

I should point out that this list of challenges is not presented in chronological order. In reality, the solutions to these challenges are deeply interrelated and were developed through many iterations in which advances in one area informed advances in other areas. In fact, all of our research, technology, strategy, and governance is deliberately designed to support virtuous cycles of learning at every level. Even our assessments and scoring systems are designed to develop over time.

Developing the Lectical Assessment System

As explained above, to build a system of standardized assessments that would support learning in the Goldilocks Zone, we needed to (1) develop a sufficiently precise and flexible learning metric, and (2) design methods for establishing “just right” learning goals. It turned out that these two objectives were interdependent, in the sense that a more precise metric would make it possible to describe learning sequences that would help teachers set “just right” learning goals, and that these learning sequences would, in their turn, contribute to refinements in scoring precision.

To meet these objectives, we relied primarily upon the work of cognitive developmentalists, especially the Piagetians and neo-Piagetians. In fact, the first challenge was substantially reduced by prior research and instrument development in this field. For example, Kohlberg (Kohlberg, 1984b) had created a bootstrapping method for developing domain-specific cognitive developmental scoring systems, and several researchers had followed with domain-specific scoring systems that employed this method (Armon, 1984; Selman & Byrne, 1974; King & Kitchener, 1994; Keller & Wood, 1989). These developmental sequences, which were aligned with Piaget’s (Piaget, 1971) stage model, provided several examples of domain-specific developmental scoring systems. But although this research was instructive, Kohlberg’s method was unsuitable for our purposes for several reasons. First, it took many years to collect the longitudinal data required to create valid sequences. We needed a solution that would allow us to measure the level of performance in multiple subject areas without decades of research. Second, sequences were domain-specific and there was no simple way to align levels from one assessment with those of another (King, Kitchener, Wood, & Davison, 1989) (e.g., Kohlberg’s moral judgment stage 3 and Kitchener & King’s (Kitchener & King, 1990) reflective judgment stage 3). To make it easy to compare growth across domains we needed a generalized developmental assessment system that would allow assessments in different domains to be calibrated to the same scale. Third, the measurable levels in these developmental systems spanned several years. To help educators identify the Goldilocks Zone with precision, we needed a more fine-grained account of the growth of understanding.

Without doubt, a different approach was required, one that would allow for scale refinement, and ideally, make it easier to calibrate many learning sequences to the same scale. We needed a domain-independent developmental scoring system. Fortunately, in 1996, a few domain independent developmental models were already in existence, including those of Fischer (Fischer, 1980), Case (Case, 1985), and Commons (Commons, Richards, with Ruf, Armstrong-Roche, & Bretzius, 1984). Additionally, there was one domain-independent developmental assessment system, Commons’ General Stage Scoring System (GSSS) (Commons et al., 1995).

In 1996, I was using the GSSS in my dissertation research—a study of the development of evaluative reasoning about education (Dawson-Tunik, 2004)—and was incorporating it into a new cross-sectional/longitudinal method for describing learning pathways calibrated to the General Stage Model scale (described in the following section).

Between 1998 and 2002, my colleagues and I conducted several validation studies, showing that the General Stage Scoring System measured the same dimension as the longitudinally validated domain-specific stage scoring systems of Armon and Kohlberg (Dawson, 2001; Dawson, 2002; Dawson, 2003; Dawson, Xie, & Wilson, 2003). The results demonstrated that domain-independent scoring was feasible and that the GSSS tapped the same underlying dimension as other cognitive-developmental scoring systems.

Unfortunately, the GSSS, like the existing domain-specific scoring systems, lacked the precision required for diagnostic assessment. Thus, toward the end of this period, we began looking to other developmental models to help with the identification of more fine-grained within-level differences, which was necessary for increasing the precision of scores. At the time, Commons’ model, being comprised of 13 levels organized in a simple hierarchy, provided little guidance. And although the results from several validation studies provided clear evidence of increasing conceptual elaboration during each General Stage Model stage (Dawson-Tunik, Commons, Wilson, & Fischer, 2005), it was not clear how we could use this knowledge to improve scoring.

Fischer’s Skill Theory (Fischer, 1980; Fischer & Bidell, 2006) provided two important clues. First, his skill levels were comprised of 13 levels organized into groups of levels nested in tiers, in which each tier represented a fundamentally different conceptual structure (a Piagetian stage), and each level represented a different logical structure within that tier. This nesting suggested a fractal structure to development, in which logical structures (levels) nested within conceptual structures (tiers), might contain even more fine-grained conceptual or logical structures.

Before we could investigate this possibility, it was necessary to create a new assessment system—the Lectical Assessment System—aligned to Fischer’s Skill Scale (Fischer, 1980). Initially, this scoring system was composed only of scoring rules for the original 13 Skill Scale levels, which are known as Lectical Levels. Since then, human scoring has been continuously refined, resulting in (1) scoring rules for a more fine grained 4-phase-per-level scale based on observed within-level development of both conceptual and logical structures, and (2) a curated taxonomy of meanings—the Lectical Dictionary—that has made it possible to provide scorers with increasingly sophisticated technological scaffolds (See “Developing the Lectical Dictionary” on page 37).

In 2003, we reported inter-rater agreement rates of 80-97% within 1/2 of a level (Dawson et al., 2003). Today, Lectica’s analysts typically agree with one another 90% of the time within 1/5 of a level, producing statistical reliabilities that are more than acceptable for low-stakes classroom assessment. To date, over 40,000 assessments and interviews (covering a variety of subjects in science and the humanities) have been scored with the Lectical Assessment System.

Developmental maieutics

In 1996, there was no scalable approach for describing the kind of fine-grained learning sequences required to align learning goals with assessment scores. Dawson’s (Dawson-Tunik, 2004) dissertation research had introduced a method for employing a domain-independent developmental metric and in-depth concept analysis to describe learning sequences, but these sequences were described at the whole stage level and were therefore not granular enough for diagnostic purposes.

However, by 2004, the Lectical Assessment System, which from year to year provided increasingly precise measurements, had made it possible to develop a set of methods—developmental maieutics—that would achieve this standard.

Developmental maieutics (Dawson & Stein, 2008) is a set of qualitative and quantitative research methods designed to inform the development of assessments that can tell teachers:

how a student currently understands and works with his or her knowledge;
what a student needs to master at his or her present level in order to construct optimal understandings at the next level; and
what that student is most likely to benefit from learning next.

The original formulation of developmental maieutics was an iteratively structured research approach that began with the establishment of a collaborative relationship with teachers (or practitioners of various types with interests in human development, e.g. coaches, therapists, etc.), with whom we selected assessment topics and constructs. We then constructed a rough sense of selected topics based on existing knowledge and used this initial sense of the assessment domain to build a set of developmental instruments targeting agreed upon concepts and learning goals. These instruments, in turn, were employed in preliminary interview research in which we collected longitudinal data from samples with a wide age-range (often age 5 through adulthood). Then, leveraging both cross-sectional and longitudinal evidence, we generated empirically grounded learning sequences for targeted constructs.

We define learning sequences (also known as a developmental pathway) as, “empirically and theoretically grounded reconstructions of pathways toward the acquisition of concepts, skills, or capabilities.” Well-conceived learning sequences have a wide range of applications. They can be used to improve our understanding of human development, craft curricula, inform assessment development, and when used diagnostically, help teachers determine what comes next for a given student (Achieve Inc., 2015).

The original maieutic approach to describing learning sequences involved submitting interview and assessment data to three forms of qualitative analysis. First, we analyzed texts for their developmental level using the Lectical Assessment System. Then, we independently—blind to Lectical Score—analyzed their conceptual content by closely examining the meanings expressed in these texts. Finally, the Lectical Scores and concept analyses were brought together as learning sequences through a theoretically grounded process of rational reconstruction (Habermas, 1993; Dawson & Gabrielian, 2003; Dawson & Stein, 2008).

Employing the maieutic method, we have described learning sequences for conceptions of leadership, good education, epistemology, learning, morality, and the self, as well as for critical thinking, decision-making, the physics of energy, and problem-solving (Dawson & Stein, 2008; Dawson & Stein, 2004b; Dawson & Stein, 2004a; Dawson & Stein, 2006). We discuss how these are employed below, in the section on DiscoTests.

Numerous researchers have developed learning sequences for a wide range of concepts and skills (Achieve Inc., 2015). The sequences developed for DiscoTests differ from most other sequences in four ways. First, they are all calibrated to the same learning metric. Second, they generally cover a substantial period of the lifespan, often from childhood through adulthood. Third, their granularity is designed to be “just right” for diagnostic assessment purposes. And fourth, they are part of an integrated network of learning sequences that serve diverse assessment and instructional functions. (It is important to note here that the methods of developmental maieutics have evolved over time, increasing in their scalability and technological sophistication, as will become apparent in the discussions of the Lectical Dictionary and electronic scoring system, below.)

Developing DiscoTests

The design parameters for DiscoTests were based on the learning and assessment goals described at the outset of this chapter. They stipulated that every assessment event would provide a learning experience for students, equip teachers with diagnostics for making evidence-based instructional decisions, and provide policy makers with information that could be employed to support teacher and curriculum development. In other words, they would be designed to support virtuous cycles of learning—for everyone involved.

DiscoTest items

Every DiscoTest is composed of open-ended questions that require thoughtful written responses with explanations, and is accompanied by a set of diagnostic and formative reports and instructional resources. At the core of all DiscoTests are one or more ill-structured, real-world problems or scenarios (Sinnott, 1989), focused on a specific set of skills or concepts. These are followed by a series of questions that require students to connect ideas, think through problems, and communicate their reasoning through explanation. Similar test formats were already being used in cognitive developmental research (Colby & Kohlberg, 1987) and the authentic/formative assessment movement (Myford, 1996). Our use of this format stemmed directly from the intention to create assessments that would provide students with opportunities to (a) build their understanding of targeted skills and concepts, (b) develop +7 learning and reasoning skills, and (c) hone written communication and argumentation skills, while rewarding instructional practices that contribute to the achievement of these goals.

The DiscoTest format was also determined by our intention to measure understanding—as opposed to correctness. Table 1 provides a comparison of targets and formats in DiscoTests and conventional standardized assessments.

Table 1: A comparison of DiscoTests with conventional standardized assessments

Feature	DiscoTests	Other standardized assessments
Scores	represent level of understanding based on a valid learning scale	number of correct answers
Target	the depth of an individual's understanding (demonstrated in the complexity of arguments and the way the test taker works with knowledge)	the ability to recall facts, or to apply rules, definitions, or procedures (demonstrated by correct answers)
Format	paragraph-length, written responses	right/wrong judgments or right/wrong applications of rules and procedures

Responses, judgments, and explanations

To illustrate this difference, let us examine an example from a DiscoTest focused on the conservation of matter. The example involves a scenario-based item in which students are presented with a question about the impact of oxidation on mass. Students are shown the image in Figure 1, which is accompanied by this description: “Sophia balances a pile of stainless steel wire and ordinary steel wire on a scale. After a few days the ordinary wire in the pan on the right starts rusting.”

Figure 1: Oxidation and mass scenario

The original multiple-choice version of this item posed the question: “What will happen to the pan with the rusting wire?” Students were presented with the following choices:

The pan will move up.
The pan will not move.
The pan will move down.
The pan will first move up and then down.
The pan will first move down and then up.

The DiscoTest version of this item posed this question: “What will happen to the height of the pan with the rusting wire? Please explain your answer thoroughly.” Student responses to this prompt demonstrate different levels of understanding. Here are some examples of responses from 12th graders:

Lillian: “The pan will move down because the rusted steel is heavier than the plain steel.”

Josh: “The pan will move down, because when iron rusts, oxygen atoms get attached to the iron atoms. Oxygen atoms don't weigh very much, but they weigh a bit, so the rusted iron will "gain weight," and the scale will to down a bit on that side.”

Ariana: “The pan will go down at first, but it might go back up later. When iron oxidizes, oxygen from the air combines with the iron to make iron oxide. So, the mass of the wire increases, due to the mass of the oxygen that has bonded with the iron. But iron oxide is non-adherent, so over time the rust will fall off of the wire. If the metal rusts for a long time, some of the rust will become dust and some of that dust will very likely be blown away.”

The correct answer to the multiple-choice question is, "The pan will move down." From a diagnostic perspective, the selection of this response tells us nothing about student understanding. The most we can legitimately infer from a correct answer is that the student has learned that when steel rusts, it gets heavier. That is evidence of recall, not understanding. In contrast, the DiscoTest item yields answers that reveal different levels of understanding. Most readers will immediately see that Josh's answer reveals more understanding than Lillian's, and that Ariana's reveals more understanding than Josh's.

Students’ responses expose another weakness in the multiple-choice format. Multiple-choice items’ focus on correctness can lead to serious under- or over-estimation of student capability. Ariana, based on her response, would be likely to select one of the incorrect multiple-choice answers, and Lillian and Josh are given equal credit for correctness even though the levels of understanding they demonstrate are not equally sophisticated. This not only leads to false conclusions about student capability, it is unjust (Dawson & Stein, 2011; Stein, 2016).

In addition to providing diagnostic information about understanding, the DiscoTest format supports learning by asking students to explain. Researchers in the rational constructivist tradition have found evidence that explanation appears to generate new knowledge “by encouraging learners to find underlying rules and regularities.” (Xu, 2016). The DiscoTest format also provides rich responses that can be used to determine a Lectical Score—and even make ratings on other dimensions, such as argumentation skill. Finally, the responses to DiscoTest items are data that increase our understanding of learning and development, while informing the construction of increasingly accurate diagnostics and feedback.

Scoring

All DiscoTests are calibrated to the Lectical Scale (Lectica’s 4-phase-per-level version of Fischer’s Skill Scale). This not only means that growth on any DiscoTest can be graphed on the Lectical Scale as shown in Figure 1, but also that all learning sequences, resources, feedback, and suggestions are calibrated to the scale. Thus, when a student receives a score on an assessment that targets conceptions of evidence, that score is associated with a...

description of the way the student is likely to be thinking about evidence;
description of how students at the next level generally think about evidence;
description of the students’ growth edge—what he or she is most likely to benefit from learning next;
learning activity designed to help the student gain a better understanding of this growth edge; and
learning activity designed to help move the student toward the next level of understanding or skill.

DiscoTests also provide graphic representations of student growth like the “report card” shown in Figure 2. This figure shows part of a teacher report for a 6th grader named Rebecca who has taken the LRJA—a set of DiscoTests focused on reflective judgment—several times over 2 1/2 years. In addition to showing Rebecca’s scores, the graph provides a growth trajectory based on her overall pattern of growth. (Student versions of the report card do not currently include scores.) The student version of this report card can function as a source of extrinsic motivation or gamification (Landers, 2014).

Figure 2: A DiscoTest “report card”

Developing the Lectical Dictionary

In 1998, I received a grant from the Spencer Foundation to explore the possibility of electronic developmental scoring. The published report stemming from this study (Dawson & Wilson, 2004) described an approach to electronic scoring—LAAS—that involved a deep study of the conceptual structure of verbal performances and the construction of a “database of developmentally organized concepts.” Although the approach was promising, it was also expensive and arduous—unlikely to be readily scalable, so the effort was put aside. Ten years later, while building a new scoring platform for our analysts, I revisited the notion of building a database of developmentally organized concepts, this time as a scaffold for human scoring. Thinking of database contents as a “curated taxonomy of meanings,” my colleagues and I began to experiment.

In 2013, leveraging a database containing over 30,000 scored assessments and interviews spanning first speech through adulthood, a group of Lectical Analysts began the process of identifying “Lectical Items”—words or phrases like "evidence," “good evidence,” and "reliable evidence" that carry meaning. These were defined as “words and short (up to 4-word) phrases whose simplest meanings are unlikely to be useful before a given phase (1/4 of a Lectical Level) of development.” Using an online interface and a process that came to be called lexicating, analysts identified potential Lectical Items, researched their empirical distribution in our database, and assigned them to developmental phases according to (1) the empirical evidence of their first appearance, and (2) an analysis of their meaning relative to Lectical Levels and Phases. The platform designed for this purpose allows for the continuous reexamination of item placement as new evidence and developmental insights emerge. The “curated taxonomy of meanings” that emerged is now called the Lectical Dictionary.

During the first two years of its development, the dictionary was employed primarily to scaffold online scoring by providing visual cues about the Lectical Level of ideas expressed in performances. As analysts scored, Lectical Items that represented the highest phase constructs in a performance were highlighted in red, those at the next highest phase were highlighted in orange, those at the next highest phase were highlighted in yellow, and those at the fourth highest phase were highlighted in green.

Within 12 months, inter-rater agreement had improved substantially, and two developmentally interesting patterns had begun to emerge. First, in the scoring interface, performances were beginning to look like rainbows, generally with more yellow and orange items than red and green items—regardless of the developmental phase of a given performance. Second, in the Lectical Dictionary, we were beginning to see clear patterns in the development of “conceptual strands”. For example, an examination of items containing the word evidence, revealed easily observed progressions in the development of its meaning, such as the following:

phase 09b: something that I know is true
phase 09c: good information that comes from something people have seen or proved (same as a fact)
phase 09d: more or less proven facts that you can use to persuade others
phase 10a: information that comes from good research and can be used to support arguments
phase 10b: information that comes from different kinds of research and sources and needs to be evaluated before you use it to support arguments

Today, Lectical Items are assigned to a Lectical Phase based on a combination of empirical evidence, the judgment of analysts, and a variety of helper algorithms. Lectical Dictionary entries begin with first speech and cover the full span of development. In addition to being assigned to a Lectical Phase, many of the items in the Lectical Dictionary have been assigned to thematic strands such as deliberation, conflict resolution, the physics of energy, or evidence. For example, at present, there are over 7000 terms in the Lectical Dictionary that relate to evidence.

Longitudinal and cross-sectional analyses of sequences like this one have demonstrated that each successive conception builds upon previous conceptions (Dawson & Gabrielian, 2003; Dawson-Tunik, 2004). These findings are consistent with the developmental theory upon which the Dictionary is based (Piaget, 1985; Fischer, 1980), and suggest that Lectical Items assigned to a particular phase can be said not only to represent the understandings of that phase but also the building blocks for future conceptions. The distribution of Lectical Items from different phases within a given performance can, therefore, be thought of as evidence of the historical pattern of an individual’s development.

The rate of Dictionary development has increased as what we learn about patterns in the acquisition of concepts is gradually integrated into our methods and technology. For example, we have learned that there are regularities in the progression of verb conjugation within particular developmental levels and that new single-word Lectical Items typically don’t appear next to “and” or “or” until the phase following the phase to which they have been assigned. Patterns like these, when adequately regular, allow a degree of automation in the lexicating process.

The Lectical Dictionary is continuously monitored, refined, and added to by a team of trained analysts. Each time we develop a new assessment, our analysts integrate new conceptual content into the Lectical Dictionary. Over time, as we build new assessments in new subject areas, the Lectical Dictionary will become an increasingly comprehensive taxonomy of learning.

As of this writing, the Lectical Dictionary contains over 200,000 Lectical Items, and the pattern of Lectical Item distribution across developmental phases has proven to be stable enough to provide the basis for an accurate and reliable electronic developmental scoring system—the first of its kind. Moreover, the level descriptions and learning resources in DiscoTests are now exclusively based on sequences from the Lectical Dictionary.

Inventing an electronic scoring system

Because their workloads are already full to overflowing, most teachers to whom we have spoken cannot see themselves adopting a new assessment technology unless it provides immediate benefits without increasing their burden. This means that before DiscoTests can be adopted widely, they will have to be scored electronically.

For decades, test developers have attempted to create accurate and reliable electronic scoring systems for texts. This goal has even been called the "holy grail" of educational assessment (Whittington & Hunt, 1999). Today, several essay-scoring systems for texts are in use (Zupanc & Bosnic, 2015). Most are created through the purely computational, big data analyses of essays scored by humans with a variety of standardized rubrics focused on correctness or “graded” texts (texts that have been determined through expert opinion to be suitable for specific school grades). Not one of them is grounded in a strong theory of learning and development and none are part of a coherent vision for the optimal education of our children. They have been created primarily in response to (1) pressure from the educational community for more written items in standardized tests and (2) the expense of human scoring (Zupanc & Bosnic, 2015).

Automatic essay scoring tends to focus primarily on aspects of texts that lend themselves to computational analysis—such as sentence length, syntax, morphology, punctuation, size of vocabulary, word length, and the presence of targeted vocabulary. Some of their developers claim that these proxies can be used to measure constructs like meaning (Pearson Education, 2010), coherence (Shermisand & Hamner, 2013), and effective writing (Rich, Schneider, & D’Brot, 2013), but these claims are difficult to evaluate due to the paucity of published information. Importantly, none of the existing automated essay scoring systems reported in the literature appear to be designed to measure learning as the growth of understanding.

One automated scoring system—the Lexile system—stands out from other electronic scoring systems for texts, in part because Lexile developers call their scale a developmental scale (Smith, 2009). Lexiles uses vocabulary and syntax analysis to evaluate the writing level of texts (Smith, 2009). The Lexile approach is based on two fundamental assumptions: Shorter sentences are easier to understand, and more common words are easier to understand (White & Clement, 2001). It is a largely an a-theoretical approach, developed through the quantitative analysis of a large corpus of graded texts.

The electronic scoring system invented to score DiscoTests—the Computerized Lectical Assessment System (CLAS)—differs from other solutions to automated scoring in several ways:

It measures level of understanding.
It is grounded in a strong theory of learning and development.
Its scores are diagnostic and rich in meaning, in that they are tied to richly-described, evidence-based learning sequences.
It can readily be calibrated to track the development of understanding in a wide range of subject areas.
It is the core of a platform for delivering any number of subject-specific assessments that are designed to support the development of understanding and skill.

Unlike the purely computational scoring algorithms of conventional automated scoring systems for texts, CLAS's algorithms leverage the Lectical Dictionary, which is developed through an analyst-machine-learner collaboration called lexicating. Our analysts are constantly engaged in “conversations” with CLAS about the placement of Lectical Items. These placements are based on the combination of evidence in test-taker performances, developmental theory, existing Dictionary entries (document patterns in the development of constructs with similar meanings), and current CLAS scores. This collaborative approach, because it results in continuous refinements to the Lectical Dictionary, ensures that CLAS’s algorithms continuously increase in accuracy and meaning.

When CLAS produces a score, this score tells us where an assessment performance lands on the Lectical Scale, what the score means in terms of a test-taker's mastery of the skills and concepts targeted by the assessment, and what the test taker is likely to benefit from learning next.

CLAS employs the Lectical Dictionary (described above) and discriminant analysis (Dawson & Wilson, 2004) to examine the developmental evidence in text performances. Rather than looking for the highest level attributes of texts, CLAS asks which developmental phase is the best fit to a performance, based on how test takers appear to have constructed meanings over time, as represented in the densities of terms from each Lectical Phase identified in the performance. On average, CLAS bases scoring decisions on 1.6 bits of information for every unique word in a given performance. For example, if a student uses 200 unique words, CLAS will have approximately 360 bits of information upon which to base a scoring decision. As of this writing, CLAS algorithms are based on human and computer analysis of over 45,000 human-scored texts (and growing), and CLAS scores on 6 different Lectical Assessments agree with human scores 85% of the time within 1/5 of a Lectical Level (Dawson, 2017c), which is very close to current human inter-rater agreement rates of 90% within 1/5 of a level.

The Lectical Dictionary is at the core of Lectica’s methods and technology. It not only makes CLAS possible, but as noted above, is an instrumental component of our methods for describing learning sequences and developing customized learning resources and report feedback. It has even been used to create a better spell checker—one that takes the developmental level of a performance into account when it makes spelling suggestions. This spell checker is triggered when the density of misspelled words in a performance exceeds 2%, providing test-takers with an opportunity to make corrections before CLAS calculates a score.

CLAS confers several advantages over human scoring, it

makes it possible, for the first time, to conduct large scale developmental assessment,
is entirely objective,
speeds the process of developing learning sequences and assessment resources and feedback, and
dramatically reduces the cost of scoring large numbers of assessments.

Since its introduction in 2014, CLAS has been used to score thousands of assessments for several research and evaluation projects. A demonstration is available online.

Developing VCoL+7

Perhaps the most difficult challenge we have faced as researchers attempting to build practical tools for real-world use, has been learning how to communicate with practitioners and lay audiences. In the early years, we often felt like we were banging our heads against a wall of resistance—especially when we tried to teach practitioners and lay audiences about developmental theory and how it relates to learning. Gradually, we were forced to accept that it might not be essential or even useful for practitioners to learn developmental theory. Indeed, what they wanted and needed was a framework they could put to work immediately, not an abstract theory they would have to translate into classroom practice. It was only when we understood this that VCoL+7 began to emerge.

In developing the VCoL model, our aims were threefold. First, the model would be solidly grounded in existing research and learning theories. Second, it would be simple enough to be understood in a useful way by both learners and practitioners. And third, it would be consistent with the cognitive-developmental model that undergirds our work. Our formulation of the VCoL model attempts to integrate and simplify numerous iterative learning models from the 20th century (Bandura, 1992; Lave & Wenger, 1991; Vygotsky, 1966; Dewey, ; Kolb, 1984; Piaget, 1985; Skinner, 1981; Simonsen et al., 2014). In its simplest form, VCoL includes setting an appropriate learning goal (in the Goldilocks Zone), acquiring information, applying that information, reflecting on outcomes, and setting a new goal based on these outcomes.

VCoLs can be thought of as simple virtuous feedback loops that leverage the brain’s built-in motivational mechanism and ensure, through practice and reflection, that each instance of learning has the largest possible positive impact on the development of an individual’s knowledge network. They can be applied to small learning tasks (Can I stay upright if I let go of the chair?) or an entire curriculum, and can be cultivated as a habit of mind to support effective lifelong learning.

Figure 3: VCoL model

ADD VCOL MODEL

The +7 skills extend the basic VCoL model, and draw the attention of educators toward the fundamental skills required to learn and reason effectively: (a) reflectivity, (b) self-monitoring and awareness, (c) seeking and evaluating information, (d) making connections, (e) applying new knowledge and skills, (f) seeking and working with input or feedback, and (g) awareness of cognitive and behavioral biases and skills for avoiding them.

This is accomplished in three important ways. First, simply implementing VCoL in the classroom provides students with practice applying knowledge and reflecting on outcomes. Second, DiscoTest reports include learning recommendations that target relevant +7 skills. For example, if a DiscoTest targets students’ understanding of evidence, their report might include a learning suggestion (presented as a VCoL) that targets skills for gathering and interpreting evidence, or if the DiscoTest targets students’ multiplication skills, their report might include a VCoL that involves making connections between different ways of multiplying. Third, through workshops and courses, including a MOOC that is currently under development, we will show teachers how to build +7 skills in the classroom.

During the last 10 years, we have participated in several studies of the effectiveness of experimental and existing curricula. This research has demonstrated that curricula that involve learners in more VCoLs are associated with steeper developmental trajectories than programs that incorporate fewer VCoLs (Dawson, 2017a; Dawson, 2016; Dawson & Thornton, 2017). Projections based on our K-13 database of over 20,000 assessments suggest that seniors in schools with the most integrated, problem-focused, and hands-on curricula—programs that are VCoL rich—are on average as much as 5 years ahead (on the Lectical Scale) of students in schools with programs that are the least VCoL rich (2-3 years ahead of average seniors in schools of similar socioeconomic status)( Dawson & Thornton, 2017). Some of our analyses suggest that the development of average seniors in the low VCoL group plateaus in grade 9 (Dawson, 2016). In practical terms, this means that average seniors in high VCoL schools graduate with the ability to explain reasons for differing perspectives in this way:

“The difference could be due to personal bias. People who believe one thing personally could be reluctant to look for counterexamples and so are then led to believe by their own studies that there is only one answer. It is also hard to quantify data gathered from children. Differences in data interpretation could also contribute to the differing views (LRJA 1100007675).”

The average student in a low VCoL school is more likely to explain reasons for differing perspectives in this way:

“Different research, tests, and conclusions would result in different analysis of the data. Also, if they believe a certain way, it is easier to have the data fit their belief than the other belief. Because of the different minds of the scientists, it is difficult for a same conclusion to come from results that can vary (LRJA 1100001808).”

The reasoning in the second example is not only less developmentally sophisticated, its logical coherence is poor. In fact, it is difficult to find a 12th-grade example from the low VCoL schools in which the reasoning is logically coherent. We have argued that this lack of logical coherence results from learning in ways that neglect deep understanding, and have shown that in grades 4–8, low coherence (after controlling for initial Lectical Level) predicts slower development on the Lectical Scale than high coherence (Dawson & Seneviratna, 2015; Dawson & Thornton, 2017).

Discussion

“There is now reasonable consensus that the Piagetian picture is not right, in fundamental ways” (Xu, 2016), p. 12).

The DiscoTest Initiative owes its existence to the Piagetian paradigm. What we have accomplished here would not have been possible without Piaget, yet statements like the one above are abundant in educational research today. Never mind that the “fundamental” flaws in the Piagetian picture are limited to aspects of his stage model or evidence that some of the phenomena he observed have now been observed in younger children (Xu, 2016). Never mind that without Piaget’s stage theory, his notion of reflective abstraction, his distinction between accommodation and assimilation, his elaborately articulated descriptions of logical structures, or his clinical method, Kohlberg, Fischer, Case, and Commons would have had no starting place and the DiscoTest initiative would have been a non-starter.

Sadly, widespread belief in the demise of the Piagetian paradigm has acted as a barrier to this work. On several occasions, our manuscripts, grant applications, and conference proposals have been rejected on the basis that their theoretical approach and methods are outdated or have been discredited.

One of the more difficult aspects of translating theory into practice involves the management of tensions between the demands of the academic world and the realities of the educational and business landscape. On the business side, we must fund our research by branding and marketing our work to educational institutions. But to publish our research, we must avoid looking like we have commercial interests. We have frequently been criticized by reviewers for branding, and one article was refused because we were unwilling to eliminate an account of how we are applying our approach to the development of DiscoTests. I do not dismiss concerns about the bias introduced by commercial interests, but would argue that the simple fact that research has been conducted by a business and may generate income should not, by itself, interfere with researchers’ ability to disseminate refereed research results. In fact, it could be argued that there are good reasons to encourage businesses to disseminate their research through refereed publications. Maintaining a divide between business and the academy leaves the purchasers of many educational products with no reliable basis for evaluating the educational merits of these products.

Obstacles to dissemination through academic publications notwithstanding, the DiscoTest Initiative has attracted several doctoral students over the years, resulting in a number of dissertations that use our assessment data or relate to our mission (Stein, 2014; Heikkinen, 2014; Jackson, 2003; Potter, 2016; Fuhs, 2015; Thornton, 2023).

The DiscoTest Initiative represents an ambitious attempt to translate theory into everyday educational practice by employing it to guide the development of a new form of educational assessment—one that drives instruction toward methods that (1) help preserve our children’s inborn love of learning, (2) measure and support deep understanding, and (3) build essential skills for thinking, communicating, interacting, and lifelong learning. It took 20 years to move from the initial idea to a practical scalable product and useable learning model—despite the excellent research foundation provided by our predecessors. As with all ambitious undertakings, there were many problems to solve, hurdles to traverse, and setbacks to manage along the way. (Needless to say, this narrative presents a much tidier picture than the reality.)

And there are new challenges to come. During the 2018-2019 school year, we plan to begin delivering the first DiscoTest, free of charge, to 4-12 teachers. These teachers will undoubtedly uncover flaws, suggest improvements, and steer us in unanticipated directions. Because we have implemented an iterative, design-oriented approach to the development of DiscoTest, we believe we are well-positioned to take on these challenges.

I close with an alternative to Xu’s view of the “Piagetian picture,” which from the perspective of the work described here, “provides much useful guidance for the development of practical learning tools that promise to help improve education in fundamental ways.” I hope that this application of Piagetian and neo-Piagetian theory—Skill Theory in particular—to testing and learning will provide the next generation of cognitive-developmentalists with some of the inspiration and evidence they will need to take this paradigm to the next level.

References

Achieve Inc. (2015). The role of learning progressions in competency-based pathways. Retrieved from https://www.achieve.org/files/Achieve-LearningProgressionsinCBP.pdf

Afflerbach, P. (2005). High stakes testing and reading assessment. National Reading Conference Policy Brief.

Ahlfeldta, S., Mehtab, S., & Sellnow, T. (2005). Measurement and analysis of student engagement in university classes where varying levels of PBL methods of instruction are in use. Higher Education Research & Development, 24, 5-20.

Amrein, A. L., & Berliner, D. C. (2003). The effects of high-stakes testing on student motivation and learning. Educational Leadership, 60(5), 32-38.

Amsel, E., Klaczynzki, P. A., Johnston, A., Bench, S., Close, J., Sadler, E., & Walker, R. (2008). A dual-process account of the development of scientific reasoning: The nature and development of metacognitive intercession skills. Cognitive Development, 23(4), 452-471.

Armon, C. (1984). Ideals of the good life: Evaluative reasoning in children and adults. Doctoral dissertation, Harvard University, Cambridge, MA.

Askew, S., & Lodge, C. (2000). Gifts, ping-pong, and loops—linking feedback and learning. In S. Askew (Ed.), Feedback for learning (pp. 1-18). London: Routledge-Falmer.

Baldwin, J. M. (1906). Mental development in the child and the race: Methods and processes. New York: The Macmillan Company.

Bandura, A. (1977). Social learning theory. Englewood Cliffs, NJ: Prentice-Hall.

Bandura, A. (1992). Social cognitive theory. In V. Ross (Ed.), Six theories of child development: Revised formulations and current issues (pp. 1-60). London, England: Jessica Kingsley.

Baron, J. B. E., & Sternberg, R. J. E. (Eds.). (1987). Teaching thinking skills: Theory and practice. W. H. Freeman & Co, Publishers; New York, NY, US.

Berridge, K. C., Robinson, Terry E. (1998). What is the role of dopamine in reward: hedonic impact, reward learning, or incentive salience? Brain Research Reviews, 28, 309-369.

Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives (Handbook 1: Cognitive domain). New York: McKay.

Boud, D., & Molloy, E. (2012). Rethinking models of feedback for learning: the challenge of design. Assessment & Evaluation in Higher Education, 38, 698-712.

Brophy, J. E. (1999). Toward a model of the value aspects of motivation in education: Developing Appreciation for Particular Learning Domains and Activities. Educational Psychologist, 34, 75-85.

Buck, J., & Villines, S. (2007). We the People: Consenting to a deeper democracy; A guide to sociocratic principles and methods. Washington DC: Sociocracy.info.

Butler, D. L., & Winne, P. H. (1995). Feedback and self-regulated learning: A theoretical synthesis. Review of Educational Research, 65(3), 245-281.

Caine, R. N., & Caine, G. (1991). Making connections: Teaching and the human brain. Wheaton, MD: Association for Supervision and Curriculum Development.

Campbell, R. L., & Bickhard, M. H. (1987). A deconstruction of Fodor’s anticonstructivism. Human Development, 30(1), 48-59.

Case, R. (1985). The four stages of development: a reconceptualization. In Intellectual development: Birth to adulthood (pp. 81-117). New York: Academic Press.

Chi, M. T., de Leeuw, N., Chiu, M.-H., & LaVancher, C. (1994). Eliciting self-explanations improves understanding. Cognitive Science, 18, 439-477.

Colby, A., & Kohlberg, L. (1987). The measurement of moral judgment: Standard issue scoring manual (1). New York: Cambridge University Press.

Commons, M. L., Richards, F. A., with Ruf, F. J., Armstrong-Roche, M., & Bretzius, S. (1984). A general model of stage theory. In M. Commons, F. A. Richards, & C. Armon (Eds.), Beyond Formal Operations (pp. 120-140). New York: Praeger.

Commons, M. L., Straughn, J., Meaney, M., Johnstone, J., Weaver, J. H., Lichtenbaum, E., Rodriquez, J. (1995). The general stage scoring system: How to score anything. Proceedings from Annual meeting of the Association for Moral Education, New York.

Cook, J., Lewandowsky, S., & Ecker, U. K. H. (2017). Neutralizing misinformation through inoculation: Exposing misleading argumentation techniques reduces their influence. PLOS One. Retrieved from http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0175799

Crossleya, N. A., Mechelli, A., Vértes, P. E., Winton-Brown, T. T., Patel, A. X., Ginestet, C. E., Bullmore, E. T. (2013). Cognitive relevance of the community structure of the human brain functional coactivation network. Proceedings of the Nationa Academy of Sciences of the United States of America, 110.

Csikszentmihalyi, M., Rathunde, K., & Whalen, S. (1997). Talented teenagers: The roots of success and failure. London: Cambridge University Press.

Dawson-Tunik, T. L. (2004). “A good education is…”The development of evaluative thought across the life-span. Genetic, Social, and General Psychology Monographs, 130(1), 4-112.

Dawson-Tunik, T. L., Commons, M. L., Wilson, M., & Fischer, K. W. (2005). The shape of development. The European Journal of Developmental Psychology, 2(2), 163-196.

Dawson, T. L. (2001). Layers of structure: A comparison of two approaches to developmental assessment. Genetic Epistemologist, 29, 1-10.

Dawson, T. L. (2002). A comparison of three developmental stage scoring systems. Journal of Applied Measurement, 3, 146-189.

Dawson, T. L. (2003). A stage is a stage is a stage: A direct comparison of two scoring systems. Journal of Genetic Psychology, 164, 335-364.

Dawson, T. L. (2016). Are our children learning robustly? Retrieved May 1, 2017, from https://theo-dawson.medium.com/are-your-children-learning-robustly-f5c6a68dc52d?

Dawson, T. L. (2017a). Reliability and validity. Retrieved May 1, 2017

Dawson, T. L. (2017b). What PISA measures. What we measure. Retrieved March 30, 2017, from https://theo-dawson.medium.com/what-pisa-measures-what-we-measure-e083621f2c87?

Dawson, T. L. (2017c, October 18-20). The calibration of CLAS, an electronic developmental scoring system. Proceedings from Annual Conference of the Northeastern Educational Research Association, Trumbull, CT.

Dawson, T. L., & Gabrielian, S. (2003). Developing conceptions of authority and contract across the life-span: Two perspectives. Developmental Review, 23, 162-218.

Dawson, T. L., & Seneviratna, G. (2015, July). New evidence that well-integrated neural networks catalyze development. Proceedings from ITC, Sonoma, CA.

Dawson, T. L., & Stein, Z. (2004a). National Leadership Study results. Hatfield, MA: Developmental Testing Service, LLC.

Dawson, T. L., & Stein, Z. (2004b). Epistemological development: It’s all relative. Proceedings from Annual Meeting of the Jean Piaget Society, Toronto.

Dawson, T. L., & Stein, Z. (2006). Mind Brain & Education study: Final report. Mind Brain & Education study

Dawson, T. L., & Stein, Z. (2008). Cycles of research and application in education: Learning pathways for energy concepts. Mind, Brain, & Education, 2(2), 90-103.

Dawson, T. L., & Stein, Z. (2011, June). Virtuous cycles of learning: redesigning testing during the digital revolution. Proceedings from The International School on Mind, Brain, and Education, Erice (Sicily), Italy.

Dawson, T. L., & Thornton, A. M. A. (2017, October 18-20). An examination of the relationship between argumentation quality and students’ growth trajectories. Proceedings from Annual Conference of the Northeastern Educational Research Association, Trumbull, CT.

Dawson, T. L., & Wilson, M. (2004). The LAAS: A computerized developmental scoring system for small- and large-scale assessments. Educational Assessment, 9, 153-191.

Dawson, T. L., Xie, Y., & Wilson, M. (2003). Domain-general and domain-specific developmental assessments: Do they measure the same thing? Cognitive Development, 18, 61-78.

Detterman, D. K., & Sternberg, R. J. (Eds.). (1993). Transfer on trial: Intelligence, cognition, and instruction. Ablex Publishing Corp; Norwood, NJ, US.

Dewey, J. ( ). My pedagogic creed. In John Dewey: The early works, 1882-1898 (Vol. 5, pp. 84-95). London: Southern Illinois University Press, Feffer & Simons, Inc.

Duckworth, E. (1979). Either we’re too early and they can’t learn it or we’re too late and they know it already: The dilemma of applying Piaget. Harvard Educational Review, 49, 297-312.

Edmonton, K. M. (1999). Assessing science understanding through concept maps. In J. J. Mintzes, J. H. Wandersee, & J. D. Novak (Eds.), Assessing science understanding: A human constructivist view (pp. 19-41). Burlington, MA: Elsevier Academic Press.

Entwistle, N. (2004). Promoting deep learning through teaching and assessment: conceptual frameworks and educational contexts. Proceedings of the ESRC Teaching and Learning Research Programme, First Annual Conference, Edinburgh.

Finn, E. S., Shen, X., Scheinost, D., Rosenberg, M. D., Huang, J., Chun, M. M., Constable, R. T. (2015). Functional connectome fingerprinting: identifying individuals using patterns of brain connectivity. Nat Neurosci, 18(11), 1664-1671.

Firestone, W. A., Frances, L., & Schorr, R. Y. (Eds.). (2004). The ambiguity of teaching to the test: standards, assessment, and educational reform. Mahwah, NJ: Erlbaum Associates.

Fischer, K. W. (1980). A theory of cognitive development: The control and construction of hierarchies of skills. Psychological Review, 87, 477-531.

Fischer, K. W., & Bidell, T. R. (2006). Dynamic development of action, thought, and emotion. In N. Eisenberg, W. Damon, & R. M. Lerner (Eds.), Handbook of child psychology: Theoretical models of human development (6 ed., Vol. 1, pp. 313-399). Hoboken, New Jersey: John Wiley & Sons.

Damon & R. M. Lerner (Eds.), Handbook of child psychology: Theoretical models of human development (6 ed., Vol. 1, pp. 313-399). New York: Wiley.

Fischer, K. W., & Immordino-Yang, M. H. (2002). Cognitive development and education: From dynamic general structure to specific learning and teaching. Spencer Foundation.

Perrini, F. (Ed.). (2006). The new social entrepreneurship: What awaits social entrepreneurial ventures? Cheltenham, UK: Edward Elgar.

Frederiksen, N. (1984). The real test bias: Influences of testing on teaching and learning. American Psychologist, 39, 193-202.

Friedman, A. A. (2004). The relationship between personality traits and reflective judgment among female students. Journal of Adult Development, 11(4), 297.

Fuhs, C. J. (2015). A latent growth analysis of hierarchical complexity and perspectival skills in adulthood. Doctoral Dissertation, Fielding Graduate University, Santa Barbara, CA.

Garside, C. (1996). Look who’s talking: A comparison of lecture and group discussion teaching strategies in developing critical thinking skills. Communication Education, 45.

Giovannelli, M. (2003). Relationship between reflective disposition toward teaching and effective teaching. Journal of Educational Research, 96, 293-309.

Gopnik, A., Meltzoff, A. N., & Kuhl, P. K. (1999). The scientist in the crib: Minds, brains, and how children learn. New York, NY: William Morrow & Co.

Granott, N., & Parziale, J. (Eds.). (2002). Microdevelopment: Transition processes in development and learning. Cambridge, UK: Cambridge University Press.

Habermas, J. (1993). Justification and application. Cambridge, MA: MIT Press.

Hamid, A. A., Pettibone, J. R., Mabrouk, O. S., Hetrick, V. L., Schmidt, R., Weele, V., Berke, Brandon J. (2013). Mesolimbic dopamine signals the value of work. Nature Neuroscience, 19, 117-126.

Heafner, T. L., & Fitchett, P. G. (2012). Tipping the scales: National trends of declining social studies instructional time in elementary schools. Journal of Social Studies Research, 36, 190-215.

Heikkinen, K. M. (2014). The development of social perspective coordination skills in grades 3-12. Ed.D. Doctoral Dissertation, Harvard, Cambridge, MA.

Hofman, P., Goodwin, B., & Kahl, S. (2015). Re-balancing assessment: Placing formative and performance assessment at the heart of learning and accountability. Retrieved April, 6, 2017 from http://www.measuredprogress.org/wp-content/uploads/2015/06/Re-Balancing-Assessment-White-Paper.pdf

Hursh, D. (2008). High-stakes testing and the decline of teaching and learning. New York: Rowman & Littlefield.

Jackson, B. (2003). In search of peak experiences through life. Seeking the arenas and strategies for replicating the flow experience: A developmental perspective. Doctoral dissertation, Fielding Graduate Institute, Santa Barbara.

Kahneman, D. (2011). Thinking, fast and slow. New York: Farrar, Straus, and Giroux.
Karoly, P. (1993). Mechanisms of self-regulation: A systems view. Annual Review of Psychology, 44(1), 23-52.

Keller, M., & Wood, P. (1989). Development of friendship reasoning: A study of interindividual differences in intraindividual change. Developmental Psychology, 25(5), 820-826.

King, P., & Kitchener, K. (1994). Developing reflective judgement: Understanding and promoting intellectual growth and critical thinking in adolescents and adults. San Francisco: Jossey-Bass Publishers.

King, P. M., Kitchener, K. S., Wood, P. K., & Davison, M. L. (1989). Relationships across developmental domains: A longitudinal study of intellectual, moral, and ego development. In M. L. Commons, J. D. Sinnot, F. A. Richards, & C. Armon (Eds.), Adult development. Volume 1: Comparisons and applications of developmental models (pp. 57-71). New York: Praeger.

Kitchener, K. S., & King, P. M. (1990). The reflective judgment model: ten years of research. In M. L. Commons, C. Armon, L. Kohlberg, F. A. Richards, T. A. Grotzer, & J. D. Sinnott (Eds.), Adult development (Vol. 2, pp. 62-78). New York: Praeger.

Klenowski, P., Morgan, M., & Bartlett, S. E. (2014). The role of δ-opioid receptors in learning and memory underlying the development of addiction. British Journal of Pharmacology, 172, 297-310.

Kluger, Avraham N, & DeNisi, A. (1996). The effects of feedback interventions on performance: A historical review, a meta-analysis, and a preliminary feedback intervention theory. Psychological Bulletin, 119(2), 254-284.

Kohlberg, L. (1984a). Empirical methods and results. In The psychology of moral development: The nature and validity of moral stages (Vol. 2, pp. 387-425). San Francisco: Jossey Bass.

Kohlberg, L. (1984b). Moral stages and moralization: A cognitive developmental approach. In The psychology of moral development: The nature and validity of moral stages (Vol. 2, pp. 170-205). San Francisco: Jossey Bass.

Kolb, D. A. (1984). Experiential learning: Experience as a source of learning and development. Englewood Cliffs, NJ: Prentice-Hall.

Kuh, G. D., & Umbach, P. D. (2004). College and character: Insights from the National Survey of Student Engagement. New Directions for Institutional Research, 2004(122), 37.

Kuhn, D. (2000). Does memory development belong on an endangered topic list? Child Development, 71, 21-25.

Landers, R. N. (2014). Developing a Theory of Gamified Learning. Simulation & Gaming, 45(6), 752-768.

Laurent, V., Morse, Ashleigh K., Balleine, Bernard W. (2015). The role of opioid processes in reward and decision-making. British Journal of Pharmacology, 172, 449-459.

Lave, J., & Wenger, E. (1991). Situated learning: Legitimate peripheral participation. New York: Cambridge University Press.

Liao, S. M. (2008). A defense of intuitions. Philosophical Studies, 140, 247-262.

Lisman, J. E., & Grace, A. A. (2005). The Hippocampal-VTA Loop: Controlling the entry of information into long-term memory. Neuron, 5, 703-713.

Mareschal, D., Johnson, M., Sirois, S., Spratling, M., Thomas, M., & Westermann, G. (2007). Neuroconstructivism: Volumes I & II. Oxford: Oxford University Press.

Marsick, V. J. (1988). Learning in the workplace: The case for reflectivity and critical reflectivity. Adult Education Quarterly, 38, 187 - 198.

Martinez, M. E., & Lipson, J. I. (1989). Assessment for learning. Educational Leadership, April, 73-75.

Mascolo, M. F., & Fischer, K. W. (2010). The dynamic development of thinking, feeling and acting over the lifespan. In W. F. Overton (Ed.), Biology, cognition, and methods across the life-span. Volume 1 of the Handbook of life-span development (pp. 149-194). Hoboken, NJ: Wiley.

Mayer, R. E. (2002). Rote versus meaningful learning. Theory Into Practice, 41, 226-232.

McIntosh, J., MacDonald, F., & McKeganey, N. (2005). The reasons why children in their pre and early teenage years do or do not use illegal drugs. International Journal of Drug Policy, 16, 254-261.

Meyers, C. (1986). Teaching students to think critically. A guide for faculty in all disciplines. San Francisco: Jossey-Bass.

Myford, C. M. (1996). Authentic assessment in action: Studies of schools and students at work. American Journal of Education, V104(N2), 162-165.

Nicol, D. J., & MacFarlane-Dick, D. (2006). Formative assessment and self-regulated learning: A model and seven principles of good feedback practice. Studies in Higher Education, 31(2), 199-218.

Pandero, E., & Jonsson, A. (2013). The use of scoring rubrics for formative assessment purposes revisited: A review. Educational Research Review, 9, 129-144.

Paul, R. (2012). Bloom’s Taxonomy and critical thinking instruction: Recall is not knowledge. In Critical thinking: What every person needs to survive in a rapidly changing world (pp. 519-526). Foundation for Critical Thinking.

Lipman, P. (2004). High stakes education: Inequality, globalization, and school reform. Routledge/Falmer, New York.

Paulson, M. J., Coombs, R. H., & Richardson, M. A. (1990). School performance, academic aspirations, and drug use among children and adolescents. Journal of Drug Education, 20, 289 - 303.

Pearson Education. (2010). The Intelligent Essay Assessor. Retrieved from http://kt.pearsonassessments.com/download/IEA-FactSheet-20100401.pdf

Peck, C., A, Singer-Gabella, M., Sloan, T., & Lin, S. (2014). Driving blind: Why we need standardized performance assessment In teacher education. Journal of Curriculum and Instruction, 8, 8-30.

Pederson, P. V. (2007). What Is measured Is treasured: The Impact of the No Child Left Behind Act on nonassessed subjects. The Clearing House: A Journal of Educational Strategies, Issues and Ideas, 80(6), 287-291.

Pelligrino, J., W., Hilton, M. L., & National Research Council (Eds.). (2012). Education for life and work: developing transferable knowledge and skills in the 21st century. Washington, DC: The National Academies Press.

Piaget, J. (1971). The theory of stages in cognitive development. In D. R. Green (Ed.), Measurement and Piaget (pp. 1-11). New York: McGraw-Hill.

Piaget, J. (1977). Structuralism and dialectic. In H. E. Gruber & J. J. Von√©che (Eds.), The essential Piaget (pp. 775-779). New York: Basic Books.

Piaget, J. (1985). The equilibration of cognitive structures: The central problem of intellectual development (T. Brown & K. J. Thampy, Trans.). Chicago: The University of Chicago Press.

Pithers, R. T., & Soden, R. (2010). Critical thinking in education: a review. Educational Research, 42, 237-249.

Posner, M. I., & Rothbart, M. K. (2007). Expertise. In M. I. Posner & M. K. Rothbart (Eds.), Educating the human brain. Washington, DC: American Psychological Association.

Potter, P. (2016). Becoming a coach: Transformative learning and hierarchical complexity of coaching students. Ph.D. Fielding Graduate University., Santa Barbara.

Puustinen, M., & Pulkkinen, L. (2001). Models of self-regulated learning: A review. Scandinavian Journal of Educational Research, 45(3), 269-286.

Reeves, T. C., & Okey, J. R. (1996). Alternative assessment for constructivist learning environments. In B. G. Wilson (Ed.), Constructivist learning environments: Case studies in instructional design (pp. 191-191).

Renninger, K. A. (1992). Individual interest and development: Implications for theory and practice. In K. A. Renninger, S. Hidi, & et al. (Eds.), The role of interest in learning and development (pp. 361-395). Hillsdale, NJ, US: Lawrence Erlbaum Associates, Inc.

Rich, C. S., Schneider, M. C., & D’Brot, J. M. (2013). Applications of automated essay evaluation in West Virginia. In M. D. Shermis & J. Burstein (Eds.), in Handbook of automated essay evaluation: Current applications and new directions (pp. 99-123). New York: Routledge.

Sacks, P. (1999). Standardized minds: the high price of America’s testing culture and what we can do to change it. Cambridge, MA: Perseus Press.

Sadler, P. M. (1999). The relevance of multiple choice tests in assessing science understanding. In J. J. Mintzes, J. H. Wandersee, & J. D. Novak (Eds.), Assessing science understanding: A human constructivist view (pp. 251-278). Burlington, MA: Elsevier Academic Press.

Salz, S., & Toledo-Figueroa, D. (2009). Take the test: Sample questions from OECD’s PISA assessments. Retrieved from https://www.oecd.org/pisa/pisaproducts/Take%20the%20test%20e%20book.pdf

Savery, J. R., & Duffy, T. M. (1995). Problem-based learning: An instructional model and its constructivist framework. Educational Technology, 35, 31-38.

Savery, J. R., & Duffy, T. M. (1996). Problem-based learning: An instructional model and its constructivist framework. In B. G. Wilson (Ed.), Constructivist learning environments: Case studies in instructional design (pp. 135-135).

Sawyer, R. K. (2006). Educating for innovation. Thinking Skills and Creativity, 1, 41-48.

Scardamalia, M., & Bereiter, C. (1985). Fostering the development of self-regulation in children’s knowledge processing. In S. E. Chipman, J. W. Segal, & R. Glaser (Eds.), Thinking and learning skills: Research and open questions (Vol. 2, pp. 563-577). Hillsdale, NJ: Erlbaum.

Schwartz, M. S., Sadler, P. M., Sonnert, G., & Tai, R. H. (2009). Depth versus breadth: How content coverage in high school science courses relates to later success in college science coursework. Science Education, 93(5), 798-826.

Selman, R., & Byrne, D. F. (1974). A structural-developmental analysis of levels of role taking in middle childhood. Child Development, 45(3), 803-806.

Shermisand, M. D., & Hamner, B. (2013). Contrasting state-of-the-art automated scoring of essays: Analysis. In M. D. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 313-346). NewYork: Routledge.

Shernoff, D. J., Csikszentmihalyi, M., Shneider, B., & Shernoff, E. S. (2003). Student engagement in high school classrooms from the perspective of flow theory. School Psychology Quarterly, 18, 158-176.

Silvia, P. J. (2008). Interest—The curious Emotion. Current Directions in Psychological Science, 17(1), 57-60.

Simonsen, J., Svabo, C. S., Strandvad, M., Samson, K., Hertzum, M., & Hansen, O. E. (Eds.). (2014). Situated design methods. Cambridge, MA: MIT Press.

Sinnott, J. D. (1989). A model for solution of ill-structured problems: Implications for everyday and abstract problem solving. In Everyday problem solving: Theory and applications (pp. 72-99). New York: Praeger Publishers.

Skinner, B. F. (1981). Selection by consequences. Science, 213(4507), 501-504.

Skinner, B. F. (1984). Selection by consequences. Behavioral & Brain Sciences, 7(4), 477-510.

Smith, M. (2009). The reading-writing connection. Retrieved May 1, 2017 from https://lexile-website-media-2011091601.s3.amazonaws.com/resources/materials/Reading-Writing_Connection.pdf

Spitzer, M. (1999). The mind within the net: Models of learning, thinking, and acting. Cambridge, MA: Massachussetts Institute of Technology.

Stahl, A. E., & Feigenson, L. (2015). Observing the unexpected enhances infants’ learning and exploration. Science, 348, 91-94.

Stecher, B. (2010). Performance assessment in an era of standards-based educational accountability. Retrieved January 28, 2018, from https://scale.stanford.edu/system/files/performance-assessment-era-standards-based-educational-accountability.pdf

Stein, Z. (2014). Tipping the scales: Social justice and educational measurement. Ed.D. Doctoral Dissertation, Harvard, Cambrige, MA.

Stein, Z. (2016). Social justice and educational measurement. New York, NY: Routledge.

Strong, R., Silver, H. F., & Robinson, A. (1995). Strengthening student engagement: What do students want. Educational Leadership, 53, 8-12.

Tang, Y., Ma, Y., Wang, J., Fan, Y., Feng, S., Lu., Q., Posner, M. I. (2007). Short-term meditation training improves attention and self-regulation. PNAS, 104(43), 17152-17156.

Taylor, L., & Parsons, J. (2011). Improving student engagement. Current Issues in Education, 14.

Tomasello, M. (2005). Constructing a language: a usage-based theory of language acquisition. Cambridge MA: Harvard University Press.

UNESCO. (2015). Rethinking education: Towards a global common good? Retrieved May 2, 2017, from http://unesdoc.unesco.org/images/0023/002325/232555e.pdf

Valsiner, J., & Van Der Veer, R. (1999). The encoding of distance: the concept of the zone of proximal development and its interpretations. In P. Lloyd & C. Fernyhough (Eds.), Lev Vygotsky: Critical Assessments, Volume 3 (pp. 2-30).

Vygotsky, L. S. (1966). Development of the higher mental function. In Psychological research in the U.S.S.R (pp. 44-45). Moscow: Progress Publishers.

Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes (M. Cole, V. J. Steiner, S. Schribner, & E. Souberman, Trans.). Cambridge, MA: Harvard University Press.

Werner, H. (1957). The concept of development from a comparative and organismic point of view. In D. B. Harris (Ed.), The concept of development (pp. 125-148). Minneapolis: University of Minnesota Press.

West, R. W., Toplak, M. E., & Stanovich, K. E. (2008). Heuristics and biases as measures of critical thinking: Associations with cognitive ability and thinking dispositions. Journal of Educational Psychology, 100, 930-941.

White, S., & Clement, J. (2001). Assessing the Lexile Framework: Results of a panel meeting. Retrieved May 1, 2017, from http://files.eric.ed.gov/fulltext/ED545932.pdf.

Whittington, D., & Hunt, H. (1999). Approaches to the computerized assessment of free text responses. Proceedings of the Third Annual Computer Assisted Assessment Conference, Loughborough.

Wigfield, A., Eccles, J. S., Schiefele, U., Roeser, R. W., & Davis-Kean, P. (2007). Development of achievement motivation. In W. Damon, R. M. Lerner, D. Kuhn, R. H. Siegler, & N. Eisenberg (Eds.), Child and Adolescent Development: An Advanced Course (pp. 406-425). Wilson, B. G. (1996). Constructivist learning environments: Case studies in instructional design. Englewood Cliffs, New Jersey: Educational Technology Publications.

Wittmanna, B. C., Schiltzb, K., Boehlerc, C. N., & Düzel, E. (2008). Mesolimbic interaction of emotional valence and reward improves memory formation. Neuropsychologia, 46, 1000-1008.

Xu, F. (2016). Preliminary thoughts on a rational constructive approach to cognitive development: Primitives, symbols, learning, and thinking. In D. Barner & A. S. Baron (Eds.), Core knowledge and conceptual change (pp. 11-28).

Zupanc, K., & Bosnic, Z. (2015). Advances in the field of automated essay evaluation. Informatica, 39, 383-395.

DiscoTest: Evolution & rationale