Hollands & Bakir (2015). Efficiency of automated detectors of learner engagement and affect compared with traditional observation methods

Hollands, F. M., & Bakir, I. (2015). Efficiency of Automated Detectors of Learner Engagement and Affect Compared with Traditional Observation Methods. http://doi.org/10.13140/RG.2.1.1332.8082

A really interesting article, looking at direct human observation, video observation, sensor-light and sensor-heavy observations. For small projects, direct human observation seems to work best (for now), but costs too much to use at scale (or for long). Technology-mediated observations cost more to develop, have lower reliability, but are able to be deployed at much higher scales and for much lower costs over time. The tools will get better, but I think focusing on log files is shortsighted – they only track what software has been adapted to log.

How might computational ethnographic methods help here? Machine learning to approximate direct human observation, but using technologies that can scale (both number of observations and lower cost).

Notes:

p.4: Developing automated detectors of affect and engagement requires a significant upfront investment. — Highlighted Sep 11, 2016

p.4: However, given the ease with which the detectors can be applied to many hours of log files for many students, they can yield several hundred thousand to several million observation labels at a cost of 1-28 cents per label — Highlighted Sep 11, 2016

p.4: magnitude of cost being inversely related to the scale of application — Highlighted Sep 11, 2016

p.5: While the low costs of applying automated detectors at scale are clearly attractive, accuracy of these detectors is less compelling. Agreement between machine-assigned labels and human coder labels averaged around 0.35 across all detectors we investigated, falling into Landis & Koch’s (1977) “fair agreement” range. — Highlighted Sep 11, 2016

p.5: We conclude that for small-scale studies of engagement and affect, in-person classroom observations recorded using either pen and paper or a smartphone application are the least costly and the most reliable. For large-scale studies, automated detectors are vastly less costly per unit of data collected but are currently low in reliability. As automated detectors become more reliable in assessing learners’ affect and engagement, we expect they will be embedded in the software itself so that the learner’s state can be detected real-time and the software will respond accordingly with messages, talking agents, or different activities, just as a live teacher might change pace or activity if she sees students yawning or looking puzzled. — Highlighted Sep 10, 2016

p.6: Karweit and Slavin (1981) investigated the relationship between four different measures of time used in the classroom scheduled time, actual instructional time, engaged time, and engaged rate with mathematics achievement and found that the engagement measures were the most strongly related to achievement — Highlighted Sep 11, 2016

p.6: many protocols have been developed for the assessment of teacher and — Highlighted Sep 11, 2016

p.6: Carroll (1963) suggested that the most direct evidence available for validly assessing perseverance would come from observation of the amount of time the student is actively engaged in learning. However, he asserted that, at that time, measurements of perseverance were “practically nonexistent” — Highlighted Sep 11, 2016

p.7: Can these be automated and documented/interpreted/visualized in realtime using code? Hardware sensors + software interpretation? — Written Sep 11, 2016

p.7: student activity in classrooms (see Simon & Boyer Eds., 1970) and specifically of student engagement (see Volpe, DiPerna, Hintze, & Shapiro, 2005; Fredericks et al., 2011). Some measures of engagement rely on student self-reports, some on teacher reports, and some on observational measures. Fredericks et al. (2004) report that studies of student engagement generally attempt to capture one or two dimensions of student engagement but that ideally all three behavior, emotion, and cognition should be measured. — Highlighted Sep 11, 2016

p.7: But… if valid computational ethnographic methods could be developed, this cost could be drastically reduced and availability increased… — Written Sep 11, 2016

p.7: direct observation is more objective, more precise for evaluating specific target behaviors, and more externally valid as it assesses behavior as it is actually occurring in the school context. On the other hand, they note that direct observation is more costly in terms of time, money, and resources because a qualified observer must be in the classroom for sustained periods of time — Highlighted Sep 11, 2016

p.7: systematic direct observation of students provides one of the most useful strategies for establishing links between assessment and intervention. An alternative to direct observation in the classroom is video-taping students individually with a webcam — Highlighted Sep 11, 2016

p.7: video footage is viewed and coded ex-post. — Highlighted Sep 11, 2016

p.7: Observers using the protocol watched students sequentially in a classroom, recording an observation code every 30 seconds on a bubble sheet. Coding options included absent, engaged, off-task, and six additional codes to capture the kind of activity such as peer interaction, or engagement in a process-oriented or productorientedmathematical task — Highlighted Sep 11, 2016

p.7: Requires a human observer to employ a coding scheme. Easier, but still a huge limiting factor… — Written Sep 11, 2016

p.7: Behavioral Observation of Students in Schools (BOSS) requires observers to code a student’s behavior every 15 seconds over a 15-minute observation period. Coding options include active engagement, passive engagement, off-task motor, off-task passive, and off-task verbal. While earlier applications of BOSS involved recording observation codes with pen and paper, recordings can now be made electronically using a $30 iPhone or iPad application — Highlighted Sep 11, 2016

p.7: HART affective categories include boredom, confusion, delight, engaged concentration, frustration and surprise — Highlighted Sep 11, 2016

p.8: HART synchronizes field observations to internet time so that BROMP data can be precisely synchronized to the log file2 data from educational software. This allows researchers to compare the user’s observed state of affect and engagement with her specific actions in the software. — Highlighted Sep 11, 2016

p.8: Sounds VERY intrusive, though. Key loggers on computers. Transdermal sensors… — Written Sep 11, 2016

p.8: D’Mello, Duckworth, and Dieterle (under review) describe state-of-the art approaches to assessing student cognition, affect, and motivation during learning activities. These “AAA approaches” use “advanced computational techniques for the analytic measurement of fine-grained components of engagement in a fully automated fashion” (p.4). Computer-based assessments of engagement derived from sensor signals such as keystrokes, log files, facial or eye movements, posture, or electrodermal activity offer the advantage of objectivity and reliability compared with human assessments — Highlighted Sep 11, 2016

p.8: once machine-learning models have been built to detect patterns of behavior associated with specific states of affect or engagement, they can be applied at scale to new student data collected by automated sensors with low to negligible marginal costs — Highlighted Sep 11, 2016

p.8: “Sensor light” sounds like a good balance. Direct observation without intervention. Non-intrusive… — Written Sep 11, 2016

p.8: D’Mello et al. (under review) distinguish between sensor-free, sensor-light, and sensor-heavy detection methods. They provide several examples of studies which implement sensor-free measurement of engagement by relying on the log files of students working on computer-based activities (D’Mello, Craig, Witherspoon, McDaniel, & Graesser, 2008; Pardos, Baker, San Pedro, Gowda, & Gowda, 2013; Bixler & D’Mello, 2013; Baker et al., 2012; and Sabourin, Mott, & Lester, 2011). Sensor-light approaches use inexpensive, ubiquitous, and relatively unobtrusive devices such as webcams or microphones to collect signals (e.g., Whitehill, Serpell, Lin, Foster, & Movellan, 2011). Sensor-heavy approaches involve expensive equipment such as eye trackers, pressure pads, and physiological sensing devices (e.g., Kapoor & Picard, 2005) which are hard to use in the field at scale. Software is used to “read” and automatically categorize the signals collected by the various sensors. — Highlighted Sep 11, 2016

p.9: How can feedback from “sensor-light” techniques be used to shape the learning activities in a classroom in action? — Written Sep 11, 2016

p.9: In order to improve student engagement levels in learning activities, it is necessary not only to detect disengagement, but to understand the causes well enough to be able to design corrective responses — Highlighted Sep 11, 2016

p.10: If automated detectors of engagement and affect can be built into the software itself, the possibility arises of real-time automated responsiveness to the student’s emotional state as well as her academic performance. Given the substantial programming and instructional design resources required to build detectors and adaptiveness into any one software program, the question arises as to whether this strategy is economically viable within the price range generally tolerated for educational software. If such adaptive and responsive software is to be widely affordable to schools, the most cost-effective development strategies must be adopted and they must be applied to software programs or ITSs used at scale. Cost-effective strategies in this context would be those in which the least amount of resources are used to develop responsive ITSs that lead to the greatest improvement in student learning. — Highlighted Sep 11, 2016

p.10: Can it be used in F2F classes? I’d assume so, with a 1975 origin date for the method… — Written Sep 11, 2016

p.10: standard methodology for estimating costs for the purposes of economic evaluations of educational interventions is the “ingredients method” developed by Levin (1975) and further refined by Levin and McEwan (2001). This approach estimates the opportunity cost of all resource components required to implement the intervention — Highlighted Sep 11, 2016

p.11: But, log files only store what is programmed to be logged. And software must be adapted to generate the logs. What about activities that occur outside of dedicated software? How can activity logs or documentation be generated outside of a single system? Computational ethnography… — Written Sep 11, 2016

p.11: the most ubiquitous methods are classroom observations recorded using pen and paper, a smartphone, a tablet, or a computer. Video analysis is also fairly common. Physiological detectors are used rarely and most often in lab situations rather than in typical classrooms due to their high costs and the difficulty of transporting and setting up the equipment in the field. Most recently, there has been a growing use of automated detectors applied to the log files generated when learners engage with computer software — Highlighted Sep 11, 2016

p.14: Table 1. Summary of Studies and Coding Options — Highlighted Sep 11, 2016

p.16: Table 2. Summary Table of Costs of Observation Methods — Highlighted Sep 11, 2016

p.20: Ingredients used in this study and associated costs are also shown in Table B7. Data collection costs accounted for 78% of the total costs when teacher judgments were used. As before, total costs for data collection and self-judgments were$412 per student observed, $770 per hour of observation time, or $4.30 per affect label assigned every 20 seconds. Total costs when teacherswere involved were higher due to the greater costs of their time: $425per student observed, $793 per hour of observation time, or $4.43per affect label.However, because the teachers did not undergo FACS training, the costs were lower than for trained judges. — Highlighted Sep 10, 2016

p.21: The cost of “observing” each student for affect and engagement was therefore $23, the cost per hour of observation was $4, and the cost per observation label was just over a penny — Highlighted Sep 10, 2016

p.21: But what about interactions that happen outside of keystrokes…? — Written Sep 11, 2016

p.21: Their tasks included cleaning the data files, synchronizing the observation labels with the Inq-ITS log files so that affect labels could be matched to user keystroke patterns, identifying patterns in the data that appeared to indicate a particular affective state (“feature engineering”), writing the machine learning algorithms to identify and count the instances of each pattern in the log files, and finally applying the detectors to new log file data to obtain machine-generated predictions of students’ affective state based on their keystrokes. — Highlighted Sep 11, 2016

p.23: How might computational ethnographic methods compare with these results? Likely higher cost to develop the tools, but lower costs (and higher reliability?) once up and running? — Written Sep 11, 2016

p.23: Studies that involved classroom observations as opposed to video analysis or automated detectors were the lowest cost overall, ranging from around $3,500-$7,500. Inter-raterreliability was more or less comparable for observations recorded using a pen and paper protocol and those recorded using a smartphone application. All of them fell into Landis and Koch’s (1977) “substantial agreement” range, with one achieving a kappa at the top of this range, most likely because the observers were more experienced in the use of the observation protocol. — Highlighted Sep 11, 2016

p.24: While the low costs of applying automated detectors at scale are clearly attractive, accuracy of these detectors is less compelling. Agreement between the machine-assigned labels and the human coder labels averaged around 0.35 across all detectors, falling into Landis & Koch’s (1977) “fair agreement” range — Highlighted Sep 11, 2016

p.24: automated detectors developed using data from a population of students belonging to one demographic grouping did not generalize well to populations drawn from other groupings, — Highlighted Sep 11, 2016

p.24: As automated detectors become more reliable in assessing learners’ affect and engagement, we expect they will be embedded in the software itself so that the learner’s state can be detected real-time and the software will respond accordingly with messages, talking agents, or different activities, just as a live teacher might change pace or activity if she sees students yawning or looking puzzled. — Highlighted Sep 11, 2016