[Bestplus] 2 BEST Plus updates
Schwerdtfeger, Jane
JaneS at doe.mass.edu
Mon Feb 14 18:13:44 EST 2005
Dear BEST Plus trainers: we have recently had two questions come up about
BEST Plus scoring, and since this is the time many programs are testing for
the second time with BEST Plus, I thought I'd mention this to all of you.
You may run across something similar at your own programs, and this may help
you. This is long, but hopefully useful!
1) One program recently had this issue:
This curious situation arose because we have a set of identical twin sisters
in our intermediate (that's SPL 3-4 at CNA) ESOL class. I gave them a follow
up BEST Plus this morning. Their teacher and I were comparing their scores
and we noticed what appeared to be a discrepancy. They each scored 1.58 on
listening comprehension and 1.15 on language complexity. One scored 2.25 on
communication while the other had 2.16. So far, so good. The confusing part
was that the student with the higher (albeit only slightly) communication
score had a lower scale score ( 459 vs. 463). Naturally they like to know
each other's score. It's difficult to explain why the overall score was
higher for the one with the lower subscale score. If the difference were in
more than one subscale, then I would guess that somehow items were weighted
differently, but in this case I can't account for it.
2) Another issue came up with comparing the scores from one student's pre-
and post-tests:
I would like to fax you 2 summary reports - a pre and post test for one
student - that a teacher sent to me asking for some explanation, which
neither I nor Jane were able to give him. The issue is, as you will see,
that the student's Subscale Scores all went up in the 2nd test from the 1st,
and are all in or above the SPL 4 range. However, the student's overall
Scale Score went down and the SPL is a 3.
(Attached is a word document that charts out the student's two sets of pre-
and post-test scores for you to see.)
<<BP Summary Report Scores 2-05.doc>>
I thought it was how the questions were weighted, but wanted to confirm with
Carol. Her reply is below:
Hi Dori and Jane, Thanks for sending this information on to us and, Jane,
for passing the other 'mystery' on about the two sisters. These are great
questions and we hope that as we get more pre- and post-test data back, that
we will be able to study whether the subscale ranges need to be adjusted.
Right now they are based on field test data--however, the operational
version of the test has been tweaked. Data from actual use will inform any
adjustments. The short explanation is, Jane, you were pretty much right that
it has to do with the difficulty level of the questions. Below are the more
extensive explanations from Dorry that I "translated" a tiny bit. Let me
know if you want to talk about these or need more information.
Regarding the two sisters: The two sisters scored 4 points different--that
is not significant difference (esp. on a scale from 88 to 999!) It's quite
eerie when you think about how close they were. This actually shows that the
adaptive version works well. It's quite interesting that they scored exactly
the same in two scale categories. However, there is NOT a direct
relationship between raw scores (which is what the subscale scores
represent) and the scale score because the scale score takes into account
the difficulty of the questions.
Here is, perhaps, a more concrete example of what is going on. Let's say
that Jane and I each get four questions on a test. We each get three right
and one wrong. Both our raw scores are 3. However Jane's questions are much
more difficult than mine. Her ability (i.e., the scale score) will be higher
even though we got the same "raw" scores. The raw scores (e.g., the
averages) are only to be used diagnostically to show RELATIVE strengths and
weaknesses WITHIN an individual (as compared to averages in the SPLs). There
is no absolute way to interpret them and they cannot be used to show
differences BETWEEN two individuals. If they are causing trouble, they
should be removed from the score report. We will look into this as we get
more data. So please continue to report these issues. And let us know what
you think about having the subscales reported. Is this causing too much
confusion?
The example with the twin sisters shows how the adaptive nature of the BEST
Plus works. The different questions that they got, based on the ability
estimate, gave them different scale scores (even though the subscales--or
raw scores--appeared to be similar).
Regarding the pre- and post- test scores of Dori's student:
First, there is no statistical difference between the two scores. First
score was 448, second was 434. That's a difference of 14 points. With the
BEST Plus stopping rule, the standard error is 20 points. The two scores are
within one standard error. Most likely, the student's ability did not change
much from one test to the other. Unfortunately, she was right at the border
between SPL 4 and SPL 3. Her first score was only 9 points above the SPL 4
border, her second score only 4 points below the SPL 3 border. She's a low
4/high 3. There's not a big difference. If she had been a high 4 (e.g., 470)
and changed to a low 3 (e.g., 420), that's a difference of 50 points, which
is two and a half times the standard error. That would be something to be
concerned about, but the difference of 14 points is really not a big concern
although it crossed an SPL threshold and looks discouraging to the teacher
and the student.
The second thing is: if the student's ability didn't change, but for some
reason she got several easier questions in the second administration, her
raw score (e.g., the average subscales) could well improve (that is, she was
scoring higher, because the questions were easier). However, her scoring
higher on easier items didn't push her overall score to make a measurable
difference. The average subscale scores can only be interpreted in relative
terms, not in absolute terms, for diagnostic purposes only.
Remember those subscale averages were made on the fixed field test form, not
the adaptive version. As I mentioned above, if we get datasets for the gain
scores study (and I think MA and IL are sending us data), perhaps we can
adjust them.
I wouldn't say that this student has decreased ability, but she has not made
a significant measurable improvement.
It would be good to see her data and follow her through. There are
randomization and decision rules in the program that can be changed. We have
tried to do our best to prevent anyone from being disadvantaged. However,
special cases like this may show where further improvements could be made.
If the scores from the tests are still in the database, we would like to
have them. If you could make a backup of the pre- and post-database using
the software management system (SMS) and send it to us, it would be helpful.
I have attached the instructions that we send to trainers in training to
email us their database. I think they will work. If you have trouble, let me
know. Email them to both me and Dorry.
Thanks again for being attentive to this. Carol
Carol also said, "As you talk to others, it would be great if they could
send us any data similar to this of Dori's. The data we are already
collecting from programs in MA will enable us to look at the operational sub
scores and make adjustments as well."
Please do ask questions as you may get them--it is helpful for us all to
know, since programs may experience similar problems, and also so CAL can
potentially fix any problems as they crop up. If you want to discuss
Carol's response further, please give me a call-- (781) 338-3855. Thank you
for your help as BEST Plus experts and trainers!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BP Summary Report Scores 2-05.doc
Type: application/msword
Size: 24576 bytes
Desc: not available
Url : http://lists.literacytent.org/pipermail/bestplus/attachments/20050214/cce11d7f/BPSummaryReportScores2-05.doc
More information about the Bestplus
mailing list