The Usability of Multimedia Automated Psychological Tests to Screen for Alzheimer's DiseaseEmory Hill, PhD. Screen, Inc., Seattle, WA
Kenric W. Hammond, MD VA Puget Sound Health Care System, Seattle, WA
Proceedings of the American Medical Informatics Association Symposium 2000:1030.
The task of developing a multimedia neuropsychological test battery to screen the elderly for Mild Cognitive Impairments and Alzheimer's disease revealed the value of "designing for usability" in the evolution of the instrument. This paper describes the systematic process that produced an instrument ready for psychometric validation and meeting the goal of self-administration. The method was progressive repetition of usability tests with target users alternating with expert reviews. With each version the subject pool was larger and more varied with respect to cognitive ability and age. A reliability evaluation was conducted on the final version.
Conclusions: Progressive expansion resulted in test interactions that were readily used by impaired as well as normal people. The instrument rated high in satisfaction, while measures of distracting cognitive interference remained low. All tests demonstrated significant reliability. These properties permit longitudinal testing to track changes and treatment effects even after significant impairments are detected.
The prevalence of Alzheimer's Disease (AD) doubles approximately every five years from age 65 to age 85(1). At present, there is a much greater likelihood of secondary prevention than either primary prevention or cure Identifying people at risk for the disease would assist the selection of patients for intensive diagnostic evaluation and symptom retardation therapy. Presently, mental status screening is often delayed until symptoms are relatively advanced. Effective screens capable of signaling the risk of dementia in the target population are needed. The purpose of this study was to develop and evalu-ate the usefulness of a computerized neuropsychological detection tool designed to detect the mild cognitive impairments most predictive of Alzheimer's Disease. This effort began with the premise that an effective cognitive screening battery must meet these usability criteria:
1. Acceptance by target subjects.
To realize these goals, we adopted the following design principles:
1. Acceptable tests should be pleasant to take. They
will produce little cognitive interference attributable to the testing
Test Selection: Previous studies (2,4-7) indicate that scores on the following cognitive tasks can be combined to significantly enhance the predictive validity of a neuropsychological test battery for mild coognitive impairments predictive of AD:
1. Generating matching names for pictured objects.
Also included are direct and indirect measures (a Stroop test and within-test proactive interference, respectively) of "inhibitory breakdowns" that are postulated to occur between a variety of cognitions when AD is imminent.(8) Since spatial disorientation is reliably observed early in the course of AD(9), a Clock test measuring spatial relations was included.
Test design and response platforms: The multimedia designing programs produced by MacromediaTM were selected to program interactions with the computer. They offered the integration of text, graphics, animation, digital video, and sound. They also permitted timing adjustments to minimize distracting pauses. For simplicity, a touch screen was chosen as the input method. A brief, simple set of orienting tasks was adequate to prepare subjects for testing. Automated tests cannot rely upon the validity and reliability studies used to justify measuring the same dimensions of cognition with their traditional counterparts(10). In this study, both the stimulus and response characteristics of the automated tests are markedly different from relevant traditional tests. For instance, the range of responses in the computerized tests was limited to touch screen responses. However, the substituted responses were found in previous studies to produce error rates similar to the rates produced by verbal responses.(7,11) Major studies will be necessary to assess the validity of these new tests.
Test development involved three components of prototype testing and refinement:
1. A multidisciplinary evaluation by experts.
Steps (1) and (2) were repeated because of the often-observed lack of overlap between the usability problems identified at different stages of development by experts and those identified by end users in the target population(12).
Subjects: The professional inspectors were selected from different theoretical backgrounds. Two neuropsychologists, a medical informatics specialist, and a geriatric education specialist all performed inspections of two versions of each test. The target group end-users were recruited by advertisements requesting the paid participation of "healthy older adults" as well as "people with memory problems". Subjects had to meet the following inclusion criteria: 55-95 years old, adequate visual acuity, English speaking, adequate hearing, normal arm and hand agility, and the ability to sustain a seated position (40 minutes minimum). Subjects were excluded who had very recent surgery or illness, cognitive side effects of medication, or any recent history of alcohol abuse.
Procedures: Each subject was asked to attend two sessions, each lasting approximately one hour. Informed consent was obtained to begin the first session. On both occasions subjects were given the automated test battery, followed by questionnaires about satisfaction with, and cognitive interference during, testing. They completed the Weschler Memory Scale III's Logical Memory 1 and 2 components (memory for 2 paragraphs), before and after automated testing, respectively. The Mattis Dementia Rating Scale13, a test that samples a wide range of cognitive functions (attention, perseveration, construction, conceptualization, and memory), was completed before the second automated testing session (within six weeks of the first). These tests provided cognitive ability scores to demonstrate how broadly applicable the tests are and later serve as test validity measures. Table 1 shows the four cycles of end-user prototype testing that were performed, with each cycle administered to an increasing number of subjects:
Table 1. Progressive Sampling of Subjects
Version 1, Expert Inspection. Professional inspectors examined the interface and evaluated its compliance with recognized usability categories. These include its simplicity, use of natural dialog that speaks the user's language, consistency, feedback, and the prevention of errors due to general confusion(12). The inspectors were blinded to each other's opinions, and their comments were aggregated. Evaluators were encouraged to think aloud about their experience as they took the tests, and their questions were answered as they proceeded. During this first inspection, numerous comments were made concerning distracting inconsistencies in voice level, text dissolves, prompt wording, prompt timing, graphic substitutions, and discrepancies between the text and audio. The education specialist reported a need for more feedback after errors were made and recommended additional practice for one section. The computer specialist pointed out the graphic, transition, and audio inconsistencies. The neuropsychologists commented on measurement usefulness with respect to the abilities of older people (e.g. "Things move too quickly" and "No regrouping pauses after the tasks"). They pointed out the effects that earlier tests might have upon later ones, the value of interference items, and ways to alter the tests to more critically assess memory function.
Subject Usability Inspection: Most defects in the tests could be detected by observing the subjects, even when the subjects were not aware of them. Almost all corrections were made as a result of observing subjects, not in response to their comments. The subjects made few comments about the tests except about test difficulty or about themselves. However, several noted a problem with button sensitivity (two quick touches counting as two answers rather than a single intended response).
Version 2, Subject Usability Inspection: All five elderly subjects administered Version 2 of the tests were without cognitive impairments, based upon the WMS Logical Memory scores. All five were able to complete the test battery without assistance. They all reported greater benefits of computer testing over in-person testing. They appreciated the clarity and repetition of instructions, when needed. Three of these subjects reported problems with automated testing associated with "speed" or the frustration of not being able to revise answers.
Version 3, Subject Usability Inspection: Only one of the three subjects who took Version 2 reported problems with Version 3. These problems involved "touching the screen too soon" and not being able to revise answers. The dimension of concentration enhancement was mentioned, directly or indirectly, as a benefit of automated testing by four of the eight subjects taking Version 3 at their first testing session. They reported being "less distracted" than during testing by a person and not feeling "pressured or graded". One said, "I can concentrate on the questions and not look to a face for inflections plus or minus." Another said, "I could concentrate better because my total attention was on the screen." The smooth functioning of self-administered computer tests seem to permit people to expose otherwise embarrassing limitations in a manner that avoided activating their interpersonal defenses. The primary problem with test accuracy was the double touching problem. Delays between items were added to prevent double responses to the same item. The researcher observed three times as many technical problems encountered by subjects than the subjects reported themselves. All subjects, regardless of computer experience or cognitive ability (one had been diagnosed with "Probable Alzheimer's"), were able to complete all tests without assistance of any kind. User Satisfaction: In the present context, satisfaction is based upon the ability to progress comfortably from task to task without procedural confusion.
As shown in Table 2, all 11 subjects taking Version 3 agreed that the automated tests gave adequate preparation before each test. They all reported that the tests measure parts of memory that are important. Almost all of these subjects found the automated testing to be "comfortable", "natural", and "relevant to skills used in every day in my life". None found the testing method to be irritating. The range of opinions was greater with respect to how complicated the tests felt and the extent to which the tests were "more like games than medical tests." Subjects who took Version 3 at their second testing were the most positive of all; they had seen the tests change from crude, awkwardly timed tests to "well-timed, polished" tests.
Version 4: Professional Usability Inspection: When the fourth version was nearly complete, the professional inspectors again performed a usability inspection. In this inspection, only one comment was made in each of the following categories: voice level, graphic clarity, instruction clarity, and feedback after correct or incorrect responses.
Table 2. Version 3 Satisfaction Scores
Final Subject Usability Inspection: Twenty-nine male and 44 female volunteers (n=73) between 55 and 93 years old (Mean Age = 77.3) were first tested on the final version. Seven of these subjects had already received diagnoses of dementia or probable Alzheimer's disease. Two had previously diagnosed mild cognitive impairments, and one had chronic schizophrenia.
Only one of the 81 total subjects taking the final version was unable to complete all of the tests. This subject had been diagnosed with Alzheimer's Disease two years previously and had by far the lowest total Mattis Dementia Rating Scale score, near the mean for AD patients. One other subject, diagnosed a year earlier with dementia, needed human guidance (in the form of encouragement) to continue the practice items for one test, after requesting eight times to see the instructions again. He was able to take the test, and, although his score was low, his score was measureable. It seemed unlikely that a human tester would have repeated the instructions 8 times before concluding that he was unable to take the test. This subject was able to self-administer all the other tests in the intended manner. Four other subjects with diagnoses of AD were able to complete the tests; two needed minor assistance with the instructions on one test, and two completed all tests without assistance. Three of the 73 subjects with no previously known impairments required minor (single instance) assistance with instructions on one test. Over 90% of the subjects, including those with dementia, completed all tests without assistance of any kind.
Forty-eight percent of the subjects reported that they were at least "somewhat" afraid of not knowing what to do on the computer, but prior computer experience bore no relationship to satisfaction scores. Subjects first administered Versions 3 or 4 and satisfaction questionnaires (N=22) all described the automated tests as pleasant and comfortable to take. Their answers to open-ended questions comparing automated to traditional tests were categorized in Table 3. The only automation problem category was speed. Six comments were made about speed, almost all about doing poorly on the timed tests.
Table 3. Benefits of Automation Category
Ten items of the questionnaire, validated as measures of test anxiety by Sarason (14), measure the frequency of task-relevant and task-irrelevant interfering thoughts during test-taking on a five point scale (Never =1, Once =2, A few times =3, Often =4, Very often =5). The mean scores for all subjects (N=22) on all scales were between 1 and 3, the highest being thoughts about the purpose of the tests (Mean = 2.5) and about "how often I got confused" (Mean = 2.6). Subjects were least distracted by thoughts about "something that made me tense" (Mean = 1.4) or "what the test giver would think of me" (Mean = 1.6) during testing. There were trends for subjects to report being less nervous before the second session than before the first, remain nervous less often, and have fewer intruding thoughts during their second sessions. The seven final version subjects with the lowest scores on both of the Weschler Logical Memory Tests were compared on the questionnaire items to ten final version subjects with higher scores. Although there were no significant differences between the groups on any of the questionnaire items, the more memory-impaired subjects tended to report more thoughts during testing about the difficulty of the tests, how poorly they were doing and how others have done on the tests. However, subjects with better Weschler test scores tended to report higher levels of pre-test nervousness, remaining more nervous throughout the tests, with more thoughts about the purpose of the tests and about how they would feel if told their results.
Reliability (These results have been supplemented with a much larger sample).
The stability of the component measures over time was assessed using test-retest reliability (Table 4). Correlation coefficients were calculated to compare the responses from Time 1 to Time 2. A one-month test-retest period was chosen for the reliability test. A short retest interval is preferred, even for tests of memory, because of the importance of limiting any significant changes in cognitive ability attributable to either actual change or changes in medication. The test-retest reliability correlation coefficients were all significant at the level of .001.Table 4. Test-Retest Reliability of Final Version Initial Retest Score (One Month Re-test interval)
DISCUSSION & CONCLUSION
All criteria used to assess the feasibility of this type of automated, self-administered testing were met in this sample of elderly volunteers. The primary criterion of usability - the range of elderly people who could fully self-administer the tests - was successfully met. Over 90% of 81 subjects, seven of whom already had Alzheimer's Disease, were able to complete the tests without assistance of any kind. Only two subjects were unable to complete all tests. Both had been previously diagnosed with dementia. All of the remaining subjects, some with significant limitations on standardized tests of cognitive ability, were able to complete all of the tests with minimal or no assistance. The secondary criterion for usability was user satisfaction and the ability to complete tests without marked cognitive interference during testing. This criterion was exceeded. Computer interactions were designed that provided a high degree of satisfaction with the tests, compared to face-to-face testing. Self-administered computer testing appears to offer people a way to expose their otherwise embarrassing limitations without activating their interpersonal defenses. The tests' progression of tasks and instructions minimized confusion and anxiety, causing low levels of anxiety-based cognitive interference during testing, as measured on standardized questionnaires for test anxiety14. Subjects reported that they experienced very few distracting thoughts during testing. They were most likely to be distracted by thoughts about their confusion during testing and unlikely to be distracted by thoughts about what the tester thought of them.
Satisfaction with the advantages of computer testing was expressed for the clarity and availability of instructions, the visual interest evoked by the graphics, the relative absence of stress and interpersonal anxiety, and the general ability of the tests to hold interest. Open-ended questions confirmed that subjects were positive in their perception of computer testing. Dissatisfaction with the tests was expressed only with respect to the speed demands of the tests. There were no noticeable trends toward greater satisfaction with the final test version than with earlier versions, except those due to the reduction of anxiety associated with a second testing session.
The basic structure, style, and concern for user friendliness was present in early as well as late versions. This usability permits longitudinal follow-up testing to track the effects of treatment even after significant impairment is detected. The test-retest reliability correlation coefficients were all strong enough to provide a foundation for future studies of the tests' sensitivity and specificity. In addition to offering economical, standardized administration, another possible advantage these computer tests offer over paper and pencil methods is their ability to record response latency data. Validity research is needed to determine the sensitivity and specificity of the tests and whether or not longitudinal changes in latency measures can improve test validity.