Sunday, January 23, 2011

Merging Multiple Data Frames in R

Earlier I had a problem that required merging 3 years of trade data, with about 12 csv files per year. Merging all of these data sets with pairwise left joins using the R merge statement worked (especially after correcting some errors pointed out by Hadley Wickham However, in both my hobby hacking and on the job, I was curious if there might be a better way to do this than countless sets of merge statements (not to mention the multiple lines of code required for reading in the csv files)

So, I sent a tweet to the #rstats followers with a link to where I posted the problem on this blog to see if I could get a hint. (twitter has been a very valuable networking tool for me, I've learned a lot about data mining, machine learning, and R from the tweet-stream. Tweets and blog posts from people like Hadley Wickham, Drew Conway, and J.D. Long have been tremendously helpful to me as I've taken up R.

Back to the topic at hand, below is my new code, based on suggestions from Hadley Wickham and a comment (from Harlan) that lead me to some answers to a similar question on stack overflow. The code below requires he reshape library as well as plyr, which I should mention appears to have been created by Hadley Wickham himself.

# read all Y2000 csv files
 
filenames <- list.files(path="/Users/wkuuser/Desktop/R Data Sets/TRADE_DATA/TempData00", full.names=TRUE)
import.list <- llply(filenames, read.csv)
 
# left join all Y2000 csv files
 
AllData00 <- Reduce(function(x, y) merge(x, y, all=FALSE,by.x="Reporting.Countries.Partner.Countries",by.y="Reporting.Countries.Partner.Countries",all.x =TRUE, all.y =FALSE),import.list,accumulate=F)
 
dim(AllData00) # n = 211 211
 
# rename common key variable to something less awkward and change World to World00
 
AllData00 <- rename(AllData00, c(Reporting.Countries.Partner.Countries="Partner", World = "World00"))
names(AllData00) # list all variable names
 
# keep only the partner name variable and total world trade
 
AllData00 <-AllData00[c("Partner","World00")]
 
dim(AllData00) # data dimensions
names(AllData00) # variable names
fix(AllData00)  # view in data editor
Created by Pretty R at inside-R.org

That  pretty well gives me the data set I need, for year 2000 data. I repeated the process for 2004 and 2008 data sets I had and then merged them with left joins to get the final data set. All I am after at this point is the total world trade for each of the countries/groups listed. This could probably be made even more efficient, but is is a lot less coding than what I initially used for the project. (see below- and this doesn't even include some of the renaming and sub-setting functions I performed above) And, this process would have to be repeated 2 more times for 2004 and 2008 data. To say that the above code is much more efficient is an understatement. (note the code below actually contains some mistakes as Hadley Wickham pointed out. For instance, in the merge statement, I have by.a and by.b, or by.'dataset', while in every case it should be by.x and by.y. I guess x and y are alias's for the data sets being merged, sort of like a and b would be in SQL, if you were to say:

create table newdataset as
select *
from dat1 a left join dat2 b
on a.partner=b.partner

So, I'm not sure why my code even worked to begin with. I do realize that instead of the merge statement in R I could have used the sqldf package in R, but I have had issues with my mac crashing when I try to load the library. Still, I don't think SQL would have made things any better, as I would still be doing a series of left joins vs. the more compact code using the reduce function in R. I've used sqldf in a windows environment before and it worked great by the way.

The code below first reads in each data file individually, and then executes the endless number of left joins to give me the same data set I got above with a fraction of the amount of required code. 

#  a
 a <- read.csv("X_A.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
 names(a)
 
 a <- rename(a, c(Reporting.Countries.Partner.Countries="Partner"))
 names(a)
 dim(a)
 
 #  b
 b <- read.csv("X_B.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
 names(b)
 
 b <- rename(b, c(Reporting.Countries.Partner.Countries="Partner"))
 names(b)
 dim(b)  
 
 #  c
 c <- read.csv("X_C.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
 names(c)
 
 c <- rename(c, c(Reporting.Countries.Partner.Countries="Partner"))
 names(c)
 dim(c)   
 
 #  de
 de <- read.csv("X_DE.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
 names(de)
 
 de <- rename(de, c(Reporting.Countries.Partner.Countries="Partner"))
 names(de)
 dim(de) 
 
 #  fgh
 fgh <- read.csv("X_FGH.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
 names(fgh)
 
 fgh <- rename(fgh, c(Reporting.Countries.Partner.Countries="Partner"))
 names(fgh)
 dim(fgh) 
 
 #  ijk
 ijk <- read.csv("X_IJK.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
 names(c)
 
 ijk <- rename(ijk, c(Reporting.Countries.Partner.Countries="Partner"))
 names(ijk)
 dim(ijk) 
 
 #  lm
 lm <- read.csv("X_LM.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
 names(lm)
 
 lm <- rename(lm, c(Reporting.Countries.Partner.Countries="Partner"))
 names(lm)
 dim(lm)
 
 #  nop
 nop <- read.csv("X_NOP.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
 names(nop)
 
 nop <- rename(nop, c(Reporting.Countries.Partner.Countries="Partner"))
 names(nop)
 dim(nop) 
 
 #  qr
 qr <- read.csv("X_QR.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
 names(qr)
 
 qr <- rename(qr, c(Reporting.Countries.Partner.Countries="Partner"))
 names(qr)
 dim(qr)
 
 
 #  s  odd name changed to 'SaloTomaPrincip' manaully in excel
 s <- read.csv("X_S.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
 names(s)
 
 s <- rename(s, c(Reporting.Countries.Partner.Countries="Partner"))
 names(s)
 dim(s) 
 
 #  tuv
 tuv <- read.csv("EX_TUV.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
 names(tuv)
 
 tuv <- rename(tuv, c(Reporting.Countries.Partner.Countries="Partner"))
 names(tuv)
 dim(tuv)
 
 
 #  wxyz
 wxyz <- read.csv("X_WXYZ.csv", na.strings=c(".", "NA", "", "?"), encoding="UTF-8")
 names(wxyz)
 
 wxyz <- rename(wxyz, c(Reporting.Countries.Partner.Countries="Partner"))
 names(wxyz)
 dim(wxyz)  
 
 
# ------------------------------------------------------------------
#  sequentially left join data sets
# ------------------------------------------------------------------ 
 
 
 # a & b 
 
 ab <- merge(a,b, by.a = Partner, by.b =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 dim(ab)
 names(ab) 
 14 + 18 - 1 # r = 211 c = 31
 
 # abc
 abc <- merge(ab,c, by.ab = Partner, by.c =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 dim(abc)
 names(abc)  
 31 + 24 -1 # n = 54
 
 # a_e
 a_e <- merge(abc,de, by.abc = Partner, by.de =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 dim(a_e)
 names(a_e) 
 54 + 18 -1 # n = 71
 
 # a_h
 a_h <- merge(a_e,fgh, by.a_e = Partner, by.fgh =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 dim(a_h)
 names(a_h) 
 71 + 23 -1 # n = 93
 
 # a_k
 a_k <- merge(a_h,ijk, by.a_h = Partner, by.ijk =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 dim(a_k)
 names(a_k) 
 93 + 16 -1 # n = 108
 
 # a_m
 a_m <- merge(a_k,lm, by.a_k = Partner, by.lm =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 dim(a_m)
 names(a_m) 
 108 + 26 - 1  # n = 133
 
 # a_p
 a_p <- merge(a_m,nop, by.a_m = Partner, by.nop =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 dim(a_p)
 names(a_p) 
 133 + 20 - 1 # n = 152
 
 # a_r
 a_r <- merge(a_p,qr, by.a_p = Partner, by.qr =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 dim(a_r)
 names(a_r) 
 152 + 6 -1 # n = 157
 
 # a_s
 a_s <- merge(a_r,s, by.a_r = Partner, by.s =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 dim(a_s)
 names(a_s)  
 157 + 27 - 1 # n = 183
 
 # a_v
 a_v <- merge(a_s,tuv, by.a_s = Partner, by.tuv =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 dim(a_v)
 names(a_v) 
 183 + 21 - 1 # n = 203
 
 # a_z  (complete data set after this merge)
 a_z <- merge(a_v,wxyz, by.a_v = Partner, by.wxyz =Partner, all = FALSE, all.x = TRUE, all.y = FALSE)
 dim(a_z) # n = 211
 names(a_z)  
Created by Pretty R at inside-R.org

No comments:

Post a Comment