I have a program in R for looking at grade distributions in my class. I found something weird recently with my 'ifelse' processing. I noticed that my program seemed to be over counting Cs and under counting Bs.

I'm not sure what's going on. It happened in the case where I was adding extra credit. It has something to do with the addition operation and variable assignment I guess. When I process untransformed data I get the correct # of Bs and Cs. But my goal in a program like this was flexibility - enter the data in excel and process. I'd prefer to add extra credit/curves/corrections etc. in R through processing vs. doing it all in excel.

A very short modified version (w/out reading in the data from csv) follows. Am I actually making a simple error in my math, or does R see these values differently than they may appear? I'm admittedly not a normal user of the 'ifelse' logic in R. But even if it is wrong it should give me the same wrong answer when applied against what appear to be the same values! If I run the summary function against all the vars I get matching results for grades$grade2 and grades$ec (the modified grade)

grades2 <- c(0.72,0.56,0.84,0.84,1.04,0.48,0.96, 0.8,0.68,0.92,0.72,0.6,0.92,0.72,0.88,0.88,0.76, 0.96,0.76,0.52,1,0.88,0.88,0.88,0.64) grades1 <-c(0.64,0.48,0.76,0.76,0.96,0.4,0.88, 0.72,0.6,0.84,0.64,0.52,0.84,0.64,0.8,0.8,0.68, 0.88,0.68,0.44,0.92,0.8,0.8,0.8,0.56) grades <- data.frame(cbind(grades1,grades2)) # format grades grades$letter1 <- ifelse(grades$grades1 >= .90,"A", ifelse(grades$grades1 >= .80 & grades$grades1 < .90, "B",ifelse (grades$grades1 >= .70 & grades$grades1 < .80,"C",ifelse(grades$grades1 >= .60 & grades$grades1 < .70, "D","F")))) # letter grade distribution table(grades$letter1) grades$letter2 <- ifelse(grades$grades2 >= .90,"A", ifelse(grades$grades2 >= .80 & grades$grades2 < .90, "B",ifelse (grades$grades2 >= .70 & grades$grades2 < .80,"C",ifelse(grades$grades2 >= .60 & grades$grades2 < .70, "D","F")))) # letter grade distribution table(grades$letter2) # bonus: grade1 to = grade2 grades$ec <- grades1 + .08 # why is it that this misclassifies an .80 as a C in this case but not in the # case of all of the previous grade1 and grade2 instances?? grades$letter.ec <- ifelse(grades$ec >= .90,"A", ifelse(grades$ec >= .80 & grades$ec < .90, "B",ifelse (grades$ec >= .70 & grades$ec < .80,"C",ifelse(grades$ec >= .60 & grades$ec < .70, "D","F")))) # letter grade distribution table(grades$letter.ec) summary(grades) # if you print grades you can see that it missclassifies an .80 as a C # for obs 8 for the calculated ec grade, but correctly classifies # an .80 as a B for cases of the grade1 and grade2 variables

probably some roundoff error such as 0.72+0.08=0.79999.

ReplyDeleteyou can use:

grades$ec <- round(grades1 + .08, digits=2)

did you already check this?

I've thought about that- but I just don't believe the number is rounded, in the case where I have the problem, I'm just taking a value like .72 and adding .08 giving me a 'clean' .80 for that value which should get assigned a 'B' vs. C. that's why I read in the grades line by line for this example, just in case reading from excel gave me some weird unexpected formatted values.

ReplyDeleteI agree that 'round' would be a good idea for future cases where my data is not so neat. I'll try it just to see if it works in this case. Thanks!

@anonymous - your correction worked. No experienced R user should be surprised I suppose - I guess its always good practice to round!

ReplyDeleteThanks again.

Rounding has nothing to do with it, and this also has nothing in particular to do with R. It's simply a limitation of a computer's ability to represent floating point numbers. You can't exactly represent every floating point number in binary. Can't be done.

ReplyDeleteDo this:

x <- 0.72 + 0.08

options(digits = 17)

x

See? Many times a "simple" looking float doesn't have an exactly representation in binary. Also note that all.equal() returns the correct value, as it should:

all.equal(grades$ec[8],0.8)

One methodological change would be to represent all of the values as integers (ex if you never have fractions of a point multiply them all by 100, if you use half points multiply by 1000). Integers don't have any floating point inconsistencies.

ReplyDeleteHave you tried the recode function in the car package (car::recode) it is a lifesaver most of the times.

ReplyDeleteI would program this totally to integer to avoid the rounding. And have you heard of cut()?

ReplyDeletecut(x=as.integer(round(grades$ec*100)),

breaks=c(0L,60L,70L,80L,90L,1000L),

labels=c('F',LETTERS[4:1]),right=FALSE)

Another approach to solve the problem would be to change the breakpoints. For example from 90 to 89.5.

ReplyDeleteThis problem is a corner of Circle 1 of 'The R Inferno': http://www.burns-stat.com/documents/books/the-r-inferno/

The author beat me to the R-inferno reference. Follow the link and read it - it's and excellent collection of things that make your (after)life a misery in R...

ReplyDelete(Some of the examples are more generally applicable, too)

+ + + + +

At the risk of writing something you already know: you can index vectors (arrays etc) conditionally.

grades$letter1[grades$grades1 >= .80 & grades$grades1 < .90] <- "B"

Also if your data set is sufficiently small (enough so that repeated assignments are not a problem) you only have to worry one end of the cutoffs:

grades$letter1 <- "F"

grades$letter1[grades$grades1 >= 0.6] <- "D"

grades$letter1[grades$grades1 >= 0.7] <- "C"

grades$letter1[grades$grades1 >= 0.8] <- "B"

grades$letter1[grades$grades1 >= 0.9] <- "A"

I find the above easier to maintain than nested ifelse() - ease of reading, however, is a matter of personal opinion and taste...

(Note: this does not address the question posed in you post. It gets you to the same place by a slightly different route.)