Monday, April 14, 2014

Perceptions of GMO Foods: A Hypothetical Application of SEM

Suppose we were interested in understanding consumer perceptions of GMO foods. People often make decisions and form opinions about issues based on abstract constructs like fears, goals, ambitions, values, political ideology etc. These are things that may not be so easy to quantify, despite being relevant to behavior and choice. When it comes to perceptions of GMO food, perhaps they are shaped by some degree of skepticism of ‘big ag’ or chemical companies. We could call this ‘monsantophobia.’ Perceptions of biotechnology could also be shaped by the extent of one’s knowledge of basic biology, agricultural science, and genetics.  This second construct could be referred to as a ‘science’ factor. A third ‘factor’ that may shape one’s views could be based on their ideals related to the role of government and political ideology. We’ll call this the ‘political’ factor.

So how can we best quantify these ‘latent’ constructs or ‘factors’ that may be related to perceptions of biotechnology, and how do we model these interactions?  This will require a combination of techniques involving factor analysis and regression, known as structural equation modeling. We might administer a survey, asking key questions that relate to one’s level of monsantophobia,  science knowledge, and political views.  To the extend that ‘monsantophobia’ exists and shapes views on biotechnology, it should flavor responses  to questions related to fears, skepticism, and mistrust of ‘big ag.’ Actual knowledge of science should influence responses to questions related to science etc. We also may want to quantify the actual flavor of perceptions of GMO food. This could be some index quantifying levels of tolerance or preferences related to policies concerning labeling, testing, and regulation or purchasing decisions and expenditures on related goods.  To the extent that perceptions are ‘positive’ the index would reflect that on some scale related to answers to survey questions about these issues. You could also include a set of questions related to policy preferences and try to model the interaction of the above factors and their impact on the support for some policy or the general policy environment.

Suppose we ask a range of questions related to skepticism of big ag and agrochemical companies and record the responses to each question as a value for a number of variables (Xm1…Xmn), and did the same for science knowledge (Xs1…Xsn), political ideology (Xp1…Xpn), and overall GMO perception (Yp1…Ypn) and policy environment (Ye1…Yen) . Given the values of these variables will be influenced by the actual latent constructs we are trying to measure, we refer to the X’s and Y’s above as ‘indicators’ of the given factors for monsantophobia, science, politics, gmo perception, and policy environment. They may also be referred to as the observable manifest variables.

Now, this is not a perfect system of measurement.  Given the level of subjectivity among other things, there is likely to be a non-negligible amount of measurement error involved.  How can we deal with measurement error and quantify the factors? Factor analysis attempts to separate common variance (associated with the factors) from unique variance in a data set. Theoretically, the unique variance in FA is correlated with the measurement error we are concerned about, while the factors remain ‘uncontaminated’ (Dunteman,1989).

Structural equation modeling (SEM) consists of two models, a measurement model which consists of deriving the latent constructs or factors previously discussed, and a structural model, which relates the factors to one another, and possibly some outcome. In this case, we are relating the factors related to monsantophobia, science, and political preferences to the outcome, which in this case would be the latent construct or index related to GMO perceptions and policy environment. By using the measured ‘factors’ from FA, we can quantify the latent constructs of monsantophobia, science, politics ,and GMO perceptions with less measurement error than if we simply included the numeric responses for the X’s and Y’s in a normal regression.  And then SEM lets us identify the relative influence of each of these factors on GMO perceptions and perhaps even their impact on the general policy environment for biotechnology. This is done in a way similar to regression, by estimating path coeffceints for the paths connecting the latent constructs or factors as depicted below.



Principle Components Analysis- SAGE Series on Quantitative  Applcations in the Social Sciences. Dunteman. 1989.

Awareness and Attitudes towards Biotechnology Innovations among Farmers and Rural Population in the European Union

Paper prepared for presentation at the 131st EAAE Seminar ‘Innovation for Agricultural Competitiveness and Sustainability of Rural Areas’, Prague, Czech Republic, September 18-19, 2012

A Structural Equation Model of Farmers Operating within Nitrate Vulnerable Zones (NVZ) in Scotland
Toma, L.1, Barnes, A.1, Willock, J.2, Hall, C.1
12th Congress of the European Association of Agricultural Economists – EAAE 2008

PLoS One. 2014; 9(1): e86174.
Published online Jan 29, 2014. doi:  10.1371/journal.pone.0086174
PMCID: PMC3906022
Determinants of Public Attitudes to Genetically Modified Salmon
Latifah Amin,1,* Md. Abul Kalam Azad,1,2 Mohd Hanafy Gausmian,3 and Faizah Zulkifli1

Sunday, April 13, 2014

Intuition for Fixed Effects

I've written about fixed effects before in the context of mixed models. But how are FE useful in the context of causal inference? What can we learn from a panel data using FE that we can't get from a standard regression with cross sectional data?  Let's view this through a sort of parable, based largely on a very good set of notes produced by J. Blumenstock, used in a management statistics course (link).

Suppose we have a restaurant chain and have gathered some cross sectional data on the pricing and consumption of large pizzas for some portion of the day for some period 1 across three cities, as pictured below:

Now, if we are trying to infer the relationship between price and quantity demanded using this data, we notice something odd. The theoretically implied negative relationship does not exist. In fact, if we plot the points, this seems more in line with a supply curve rather than a demand curve:
What's going on that could explain this? One explanation could be specific individual differences across cities related to taste and quality. Perhaps in Chicago, customer's tastes and preferences are for much more expensive and higher quality pizza, and they really like pizza a lot. They may be willing to pay more for more pizzas aligned with their specific tastes and preferences. Perhaps this is also true for San Francisco, but to a lesser extent, and in Atlanta maybe not so much.

What we have is unobserved heterogeneity related to these specific individual effects. How can we account for this? Suppose we instead collected the same data for two periods, essentially creating a panel of data for pizza consumption:
Now, if we look 'within' each city, the data reveals the theoretically implied relationship between price and demand. Take San Francisco for example:
This is essentially what fixed effects estimators using panel data can do. They allow us to exploit the 'within' variation to 'identify' causal relationships. Essentially using a dummy variable in a regression for each city (or group, or type to generalize beyond this example) holds constant or 'fixes' the effects across cities that we can't directly measure or observe. Controlling for these differences removes the 'cross-sectional' variation related to unobserved heterogeneity (like tastes, preferences, other unobserved individual specific effects). The remaining variation, or 'within' variation can then be used to 'identify' the causal relationships we are interested in.

See also: Difference-in-Difference models. These are a special case of fixed effects also used in causal inference.

Fixed Effects Models(Very Important Stuff)

Friday, April 11, 2014

Structural Equation Models, Applied Economics, and Biotechnology

Toma gives some nice descriptions of SEM methodology and application:

Awareness and Attitudes towards Biotechnology Innovations among Farmers and Rural Population in the European Union

Paper prepared for presentation at the 131st EAAE Seminar ‘Innovation for Agricultural Competitiveness and Sustainability of Rural Areas’, Prague, Czech Republic, September 18-19, 2012

SEM may consist of two components, namely the measurement model (which states the relationships between the latent variables and their constituent indicators), and the structural model (which designates the causal relationships between the latent variables). The measurement model resembles factor analysis, where latent variables represent ‘shared’ variance, or the degree to which indicators ‘move’ together. The structural model is similar to a system of simultaneous regressions, with the difference that in SEM some variables can be dependent in some equations and independent in others.

A Structural Equation Model of Farmers Operating within Nitrate Vulnerable Zones (NVZ) in Scotland
Toma, L.1, Barnes, A.1, Willock, J.2, Hall, C.1

12th Congress of the European Association of Agricultural Economists – EAAE 2008

To identify the factors determining farmers’ nitrate reducing behaviour, we follow the attitude-behaviour framework as used in most literature on agri- environmental issues. To statistically test the relationships within this framework, we use structural equation modelling (SEM) with latent (unobserved) variables. We first identify the latent variables structuring the model and their constituent indicators. Then, we validate the construction of the latent variables by means of factor analysis and finally, we build and test the structural equation model by assigning the relevant relationships between the different latent variables.

See also:

PLoS One. 2014; 9(1): e86174.
Published online Jan 29, 2014. doi:  10.1371/journal.pone.0086174
PMCID: PMC3906022
Determinants of Public Attitudes to Genetically Modified Salmon
Latifah Amin,1,* Md. Abul Kalam Azad,1,2 Mohd Hanafy Gausmian,3 and Faizah Zulkifli1

Wednesday, April 9, 2014

Quantile Regression and Healthcare Costs

I thought this was a nice statement that speaks to the utility of quantile regression (which holds to any distribution with these issues not just cost data):

The quantile regression framework allows us to obtain a more complete picture of the effects of the covariates on the health care cost, and is naturally adapted to the skewness and heterogeneity of the cost data.


Health care cost data are characterized by a high level of skewness and heteroscedastic variances…Most of the existing literature on health care cost data analysis have been focused on modeling the conditional mean (or average) of the health care cost given the covariates such as age, gender, race, marital status and disease status. The conditional mean framework has two important limitations. First, the application of the conditional mean regression model to health care cost data analysis is usually not straightforward. Due to the presence of skewness and nonconstant variances, transformation of the response variable is often required when constructing the mean regression model and retransformation is needed in order to obtain direct inference on the mean cost. Second, the conditional mean model focuses primarily on the marginal effects of the risk factors on the central tendency of the conditional distribution. When the marginal effects vary across the conditional distribution, focusing on the marginal effects at the central tendency may substantially distort the information of interest at the tails. For example, a weak relationship between a risk factor and the mean health care cost does not preclude a stronger relationship at the upper or lower quantiles of the conditional distribution….By considering different quantiles, we are able to obtain a more complete picture of the effects of the covariates on health care cost.


Weighted Quantile Regression for Analyzing Health Care Cost Data with Missing Covariates. Ben Sherwood, Lan Wang and Xiao-Hua Zhou Statistics in Medicine. 2012

 “heavy upper tails may influence the "robustness" with which some parameters are estimated. Indeed, in worlds described by heavy-tailed Pareto or Burr- Singh-Maddala distributions (Mandelbrot, 1963; Singh and Maddala, 1976) some traditionally interesting parameters (means, variances) may not even be finite, a situation never encountered in, e.g., a normal or log-normal world. Such concerns should translate into empirical strategies that target the high-end parameters of particular interest, e.g. models for Prob(y k | x) or quantile regression models.."
John Mullahy
Univ. of Wisconsin-Madison
January 2009
See also: Quantile Regression with Count Data

Tuesday, April 8, 2014

Ambitious vs. Ambiguous Modeling

Some people believe that a conclusion reached based on solid statistically sound principles is a gold standard. But, we seldom prove anything in applied empirical analysis. This is disappointing to those that desire definitive answers.Rather than seeking proof, or absolute truth, the best we can do is inform: 

"Social scientists and policymakers alike seem driven to draw sharp conclusions, even when these can be generated only by imposing much stronger assumptions than can be defended. We need to develop a greater tolerance for ambiguity. We must face up to the fact that we cannot answer all of the questions that we ask." (Manski, 1995) 

Manski, C.F. 1995. Identification Problems in the Social Sciences. Cambridge: Harvard University Press. 

Sunday, March 30, 2014

Ambitious Modeling?

"As Philip Dawid once said "a causal model is just an ambitious associational model". A carefully-considered regression model, with an appropriate set of potential confounders (possibly identified using a causal diagram – see below) measured and included as covariates, is the most appropriate causal model in many simple settings."

To paraphrase Angrist and Pischke:

To the extent that the population CEF that it is estimating is causal, so is linear regression. (And that includes LPMs)

Tuesday, March 25, 2014

Institutional Research Presentations at SAS Global Forum

I'm not attending #SASGF14, but some of my colleagues in higher ed are. Here is what they are doing. If you are not attending global forum, or can't make their talks, I encourage you to check out their papers via the online proceedings once they are posted.

Tuesday March 25

Paper 1448 - From Providing Support to Driving Decisions: Improving the Value of Institutional Research For almost two decades, Western Kentucky University's Office of Institutional Research (WKU-IR) has used SAS® to help shape the future of the institution by providing faculty and administrators with information they can use to make a difference in the lives of their students. This presentation provides specific examples of how WKU-IR has shaped the policies and practices of our institution and discusses how WKU-IR moved from a support unit to a key strategic partner. In addition, the presentation covers the following topics: How the WKU Office of Institutional Research developed over time; Why WKU abandoned reactive reporting for a more accurate, convenient system using SAS® Enterprise Intelligence Suite for Education; How WKU shifted from investigating what happened to predicting outcomes using SAS® Enterprise Miner™ and SAS® Text Miner; How the office keeps the system relevant and utilized by key decision makers; What the office has accomplished and key plans for the future.

Paper 1638 - Institutional Research: Serving University Deans and Department Heads Administrators at Western Kentucky University rely on the Institutional Research department to perform detailed statistical analyses to deepen the understanding of issues associated with enrollment management, student and faculty performance, and overall program operations. This paper presents several instances of analyses performed for the university to help it identify and recruit suitable candidates, uncover root causes in grade and enrollment trends, evaluate faculty effectiveness, and assess the impact of student characteristics, programs, or student activities on retention and graduation rates. The paper briefly discusses the data infrastructure created and used by Institutional Research. For each analysis performed, it reviews the SAS® program and key components of the SAS code involved. The studies presented include the use of SAS® Enterprise Miner™ to create a retention model incorporating dozens of student background variables. It shows an examination of grade trends in the same courses taught by different faculty and subsequent student behavior and success, providing insights into the nuances and subtleties of evaluating faculty performance. Another analysis uncovers the possible influence of fraternities and sororities in freshmen algebra courses. Two investigations explore the impact of programs on student retention and graduation rates. Each example and its findings illustrate how Institutional Research can support the administration of university operations. The target audience is any SAS professional interested in learning more about Institutional Research in higher education and how SAS software is used by an Institutional Research department to serve its organization.

Monday March 24

Paper 1689 - Simple ODS Tips to Get RWI (Really Wonderful Information) SAS® continues to expand and improve its reporting capability. With new SAS® 9.4 enhancements in ODS (Output Delivery System), the opportunity to create stunning reports has expanded even further. If you are charged with creating relevant, informative, easy-to-read reports for clients or administrators, then the ODS Report Writing Interface, ODS LAYOUT enhancements, and the new ODSTEXT procedure are important tools to use. These tools allow you to create reports in a smart, eye-catching format that can be turned around quite quickly and programmed to provide optimum flexibility. How many times have you worked hours to tweak and fine-tune a report directly in Microsoft Excel, Microsoft Word, Microsoft Power Point or some other similar software only to be asked for a “quick update”, which would then take hours to recreate because you are manually transferring data? Do you ever dread receiving the compliment, “This is really wonderful information!!!!” because you know it will be followed by “Can you run this for EVERY region?” Well, dread no more, because when you harness the power of SAS® ODS, you can create first-rate, flexible, fabulous reports! Join me as I share with you two real-world examples of ODS capabilities using (1) a marketing piece I designed to help the president of our university spotlight county- and region-specific data as he recruited across the state and (2) our academic program review form, a multi-page report that outputs to Word so that program coordinators can add personalized commentary to support their program’s effectiveness.