Econometric Sense: Why does IFELSE logic work differently on what appear to be the same values?

Friday, February 22, 2013

Why does IFELSE logic work differently on what appear to be the same values?

Embarrassingly I'm stumped on this...

I have a program in R for looking at grade distributions in my class. I found something weird recently with my 'ifelse' processing. I noticed that my program seemed to be over counting Cs and under counting Bs.

I'm not sure what's going on. It happened in the case where I was adding extra credit. It has something to do with the addition operation and variable assignment I guess. When I process untransformed data I get the correct # of Bs and Cs. But my goal in a program like this was flexibility - enter the data in excel and process. I'd prefer to add extra credit/curves/corrections etc. in R through processing vs. doing it all in excel.

A very short modified version (w/out reading in the data from csv) follows. Am I actually making a simple error in my math, or does R see these values differently than they may appear? I'm admittedly not a normal user of the 'ifelse' logic in R. But even if it is wrong it should give me the same wrong answer when applied against what appear to be the same values! If I run the summary function against all the vars I get matching results for grades$grade2 and grades$ec (the modified grade)

grades2 <- c(0.72,0.56,0.84,0.84,1.04,0.48,0.96,
0.8,0.68,0.92,0.72,0.6,0.92,0.72,0.88,0.88,0.76,
0.96,0.76,0.52,1,0.88,0.88,0.88,0.64)
 
grades1 <-c(0.64,0.48,0.76,0.76,0.96,0.4,0.88,
0.72,0.6,0.84,0.64,0.52,0.84,0.64,0.8,0.8,0.68,
0.88,0.68,0.44,0.92,0.8,0.8,0.8,0.56)
 
grades <- data.frame(cbind(grades1,grades2))
 
# format grades
 
grades$letter1 <- ifelse(grades$grades1 >= .90,"A", ifelse(grades$grades1 >= .80 & grades$grades1 < .90, "B",ifelse (grades$grades1 >= .70 & grades$grades1 < .80,"C",ifelse(grades$grades1 >= .60 & grades$grades1 < .70, "D","F"))))
 
# letter grade distribution
table(grades$letter1) 
 
grades$letter2 <- ifelse(grades$grades2 >= .90,"A", ifelse(grades$grades2 >= .80 & grades$grades2 < .90, "B",ifelse (grades$grades2 >= .70 & grades$grades2 < .80,"C",ifelse(grades$grades2 >= .60 & grades$grades2 < .70, "D","F"))))
 
# letter grade distribution
table(grades$letter2) 
 
# bonus: grade1 to = grade2
 
grades$ec <- grades1 + .08
 
# why is it that this misclassifies an .80 as a C in this case but not in the
# case of all of the previous grade1 and grade2 instances??
 
grades$letter.ec <- ifelse(grades$ec >= .90,"A", ifelse(grades$ec >= .80 & grades$ec < .90, "B",ifelse (grades$ec >= .70 & grades$ec < .80,"C",ifelse(grades$ec >= .60 & grades$ec < .70, "D","F"))))
 
# letter grade distribution
table(grades$letter.ec) 
 
summary(grades)
 
 
# if you print grades you can see that it missclassifies an .80 as a C
# for obs 8 for the calculated ec grade, but correctly classifies
# an .80 as  a B for cases of the grade1 and grade2 variables

Created by Pretty R at inside-R.org

9 comments:

AnonymousFebruary 22, 2013 at 12:13 PM
probably some roundoff error such as 0.72+0.08=0.79999.
you can use:
grades$ec <- round(grades1 + .08, digits=2)

did you already check this?
ReplyDelete
Replies
Matt BogardFebruary 22, 2013 at 12:20 PM
I've thought about that- but I just don't believe the number is rounded, in the case where I have the problem, I'm just taking a value like .72 and adding .08 giving me a 'clean' .80 for that value which should get assigned a 'B' vs. C. that's why I read in the grades line by line for this example, just in case reading from excel gave me some weird unexpected formatted values.

I agree that 'round' would be a good idea for future cases where my data is not so neat. I'll try it just to see if it works in this case. Thanks!
ReplyDelete
Replies
Matt BogardFebruary 22, 2013 at 12:35 PM
@anonymous - your correction worked. No experienced R user should be surprised I suppose - I guess its always good practice to round!

Thanks again.
ReplyDelete
Replies
AnonymousFebruary 22, 2013 at 6:35 PM
Rounding has nothing to do with it, and this also has nothing in particular to do with R. It's simply a limitation of a computer's ability to represent floating point numbers. You can't exactly represent every floating point number in binary. Can't be done.

Do this:

x <- 0.72 + 0.08
options(digits = 17)
x

See? Many times a "simple" looking float doesn't have an exactly representation in binary. Also note that all.equal() returns the correct value, as it should:

all.equal(grades$ec[8],0.8)
ReplyDelete
Replies
AnonymousFebruary 22, 2013 at 6:44 PM
One methodological change would be to represent all of the values as integers (ex if you never have fractions of a point multiply them all by 100, if you use half points multiply by 1000). Integers don't have any floating point inconsistencies.
ReplyDelete
Replies
ManoloFebruary 22, 2013 at 7:39 PM
Have you tried the recode function in the car package (car::recode) it is a lifesaver most of the times.
ReplyDelete
Replies
WingfeetFebruary 23, 2013 at 5:53 AM
I would program this totally to integer to avoid the rounding. And have you heard of cut()?

cut(x=as.integer(round(grades$ec*100)),
breaks=c(0L,60L,70L,80L,90L,1000L),
labels=c('F',LETTERS[4:1]),right=FALSE)
ReplyDelete
Replies
Pat BurnsFebruary 23, 2013 at 6:12 AM
Another approach to solve the problem would be to change the breakpoints. For example from 90 to 89.5.

This problem is a corner of Circle 1 of 'The R Inferno': http://www.burns-stat.com/documents/books/the-r-inferno/
ReplyDelete
Replies
AnonymousFebruary 24, 2013 at 9:51 PM
The author beat me to the R-inferno reference. Follow the link and read it - it's and excellent collection of things that make your (after)life a misery in R...
(Some of the examples are more generally applicable, too)

+ + + + +

At the risk of writing something you already know: you can index vectors (arrays etc) conditionally.

grades$letter1[grades$grades1 >= .80 & grades$grades1 < .90] <- "B"

Also if your data set is sufficiently small (enough so that repeated assignments are not a problem) you only have to worry one end of the cutoffs:

grades$letter1 <- "F"
grades$letter1[grades$grades1 >= 0.6] <- "D"
grades$letter1[grades$grades1 >= 0.7] <- "C"
grades$letter1[grades$grades1 >= 0.8] <- "B"
grades$letter1[grades$grades1 >= 0.9] <- "A"

I find the above easier to maintain than nested ifelse() - ease of reading, however, is a matter of personal opinion and taste...
(Note: this does not address the question posed in you post. It gets you to the same place by a slightly different route.)
ReplyDelete
Replies

Add comment