Sunday, November 27, 2011

Basic Econometrics in R and SAS

Regression Basics

y= b0 + b1 *X  ‘regression line we want to fit’

The method of least squares minimizes the squared distance between the line ‘y’ and
individual data observations yi



That is minimize: ∑ ei2 = ∑ (yi - b0 -  b1 Xi )2   with respect to b0 and  b1 .
This can be accomplished by taking the partial derivatives of  ∑ ei2 with respect to each coefficient and setting it equal to zero.
∂ ∑ ei2 / ∂ b0 =  2 ∑ (yi - b0 -  b1 Xi )  (-1) = 0 
∂ ∑ ei2 / ∂ b1 =   2 ∑(yi - b0 -  b1 Xi )  (-Xi) = 0
Solving for b0 and  b1 yields the ‘formulas’ for hand calculating the estimates:
b0 = ybar - b1 Xbar
b1 = ∑ (( Xi - Xbar)  (yi - ybar)) / ∑ ( Xi - Xbar) =  [ ∑Xi Yi  – n xbar*ybar] / [∑X2 – n Xbar2
 =   S( X,y) / SS(X)
 
Example with Real Data: 

Given real data, we can use the formulas above to derive (by hand /caclulator/excel) the estimated values for b0 and b1, which give us the line of best fit, minimizing  ∑ ei2 = ∑ (yi - b0 -  b1 Xi )2  .

n= 5
∑Xi Yi   = 146
∑X2  = 55
Xbar = 3
Ybar =8

b1 =  [ ∑Xi Yi  – n xbar*ybar] / [∑X2 – n Xbar2]    (146-5*3*8)/(55-5*32) = 26/10 = 2.6
b0= ybar - b1 Xbar  = 8-2.6*3 = .20

You can verify these results in PROC REG in SAS.

/* GENEARATE DATA */

DATA REGDAT;
INPUT X Y;
CARDS;
1 3
2 7
3 5
4 11
5 14
;
RUN;

/* BASIC REGRESSION WITH PROC REG */

PROC REG DATA = REGDAT;
MODEL Y = X;
RUN;
QUIT;

OUTPUT:



Similarly this can be done in R using the 'lm' function:

#------------------------------------------------------------
#  regression with canned lm routine
#------------------------------------------------------------
 
# read in data manually
 
x <- c(1,2,3,4,5) # read in x -values
 
y <- c(3,7,5,11,14) # read in y-values
 
data1 <- data.frame(x,y) # create data set combining x and y values
 
# analysis
 
plot(data1$x, data1$y) # plot data 
reg1 <- lm(data1$y~data1$x) # compute regression estimates
summary(reg1)              # print regression output
abline(reg1)               # plot fitted regression line
Created by Pretty R at inside-R.org

 
Regression Matrices

Alternatively, this problem can be represented in matrix format. 
We can then formulate the least squares equation as:
 y = Xb 
   
where the ‘errors’  or deviations from the fitted line can be formulated by the matrix :
e = (y – Xb)

The matrix equivalent of ∑ ei2  becomes (y - Xb)’ (y - Xb) = e’e

= (y - Xb)’ (y - Xb) = y’y - 2 * b’X’y + b’X’Xb

Taking partials, setting = 0, and solving for  b   gives:

d e’e / d b = -2 * X’y +2* X’Xb = 0

2 X’Xb =   2 X’y

X’Xb = X’y

b = (X’X)-1  X’y   which is the matrix equivalent to what we had before:
[ ∑Xi Yi  – n xbar*ybar] / [∑X2 – n Xbar2]  =   S( X,y) / SS(X)
 These computations can be carried out in SAS via PROC IML commands:

/* MATRIX REGRESSION */

PROC IML;

/* INPUT DATA AS VECTORS */
yt = {3 7 5 11 14} ; /* TRANSPOSED Y VECTOR */
x0t = j(1,5,1); /* ROW VECTOR OF 1'S */
x1t = {1 2 3 4 5}; /* X VALUES */
xt =x0t//x1t; /* COMBINE VECTORS INTO TRANSPOSED X-MATRIX */

PRINT yt x0t x1t;

/* FORMULATE REGRESSION MATRICES */

y= yt`;     /* VECTOR OF DEPENDENT VARIABLES */
x =xt`; /* FULL X OR DESIGN MATRIX */
beta = inv(x`*x)*x`*y;  /* THE CLASSICAL REGRESSION MATRIX */
PRINT beta;
TITLE 'REGRESSION MATRICES VIA PROC IML';
QUIT;
RUN;

OUTPUT
The same results can be obtained in R as follows:
#------------------------------------------------------------
#   matrix programming based approach
#------------------------------------------------------------
 
# regression matrices require a column of 1's in order to calculate 
# the intercept or constant, create this column of 1's as x0
 
x0 <- c(1,1,1,1,1) # column of 1's
x1 <- c(1,2,3,4,5) # original x-values
 
# create the x- matrix of explanatory variables
 
x <- as.matrix(cbind(x0,x1))
 
# create the y-matrix of dependent variables
 
y <- as.matrix(c(3,7,5,11,14))
 
# estimate  b = (X'X)^-1 X'y
 
b <- solve(t(x)%*%x)%*%t(x)%*%y
 
print(b) # this gives the intercept and slope - matching exactly 
         # the results above
Created by Pretty R at inside-R.org

3 comments:

  1. This is a good cross-validation, but can you explain why one would choose one method over another? Specifically, why use partial derivatives which can be exceedingly complex instead of SAS proc reg which is exceeding simple?

    Please respond to intelaqua@ymail.com. Thank you very much.

    ReplyDelete
  2. Thanks you for your comments. What you point out brings up a couple major oversights on my part in writing this post that I should have addressed in some sort of conclusion or intro to this post. The partial derivative discussion was simply a backgrounder on how and why regression works, and every student should work through the math at least for the single variable case. The matrix algebra was just a generalization of the calculus results to the multivariable case. As you pointed out, I used proc reg to sort of cross validate the results, or show that yes, when you run proc reg, you get the same results as you would if you worked through the math using calculus or matrix algebra. (of course, I don't think proc reg actually works through the calculus or directly inverts matrices, but uses some optimization algorithm similar to gradeint descent (see http://www.econometricsense.blogspot.com/2011/11/regression-via-gradient-descent-in-r.html ).
    I agree 100% with you that if you are interested in fitting a model, it is far easier to run proc reg, glm, glimmix etc. than to delve into matrix programming, as I agree it can get exceedingly complex. Most of the time, I'm utilizing some such proc in my work.

    However, as this blog is about not only discussing how to use software to solve our problems, it is also about conversations regarding the theory and mathematics supporting those results. So from time to time, I'll work through some of the mathematical details along with a few lines of code to provide a little more intuition about a particular topic. Having a firm grasp of matrix algebra as it relates to these details is important for instance, if we want to understand other topics beyond basic OLS such as robust standard errors ( http://www.econometricsense.blogspot.com/2011/10/deriving-heteroskedasticity-consistent.html ), mixed models (http://econometricsense.blogspot.com/2011/01/mixed-fixed-and-random-effects-models.html ), or spatial regression ( http://econometricsense.blogspot.com/2010/09/spatial-econometrics.html ) for instance, and a command of matrix programming becomes even more important if we want to derive our own estimators when a canned routine just doesn't give us what we want.

    ReplyDelete
  3. Hello Mr. Bogard;

    I am a 30 year old person with a economics degree who now works at a bank. I personally believe that, without solid mathematical and statistical background ones economic knowledge is always useless. For that reason, I would like to learn more about econometrics so that I can have the opportunity to build, test and discuss new ways new approaches and perhaps new theories with people in an academic way.

    I am studying econometrics in my free time and if you don't mind I would like to ask some questions about econometrics (problems for example) to you from time to time is it okay for you?

    ReplyDelete