Monday, March 12, 2012

I found the following from Denver SAS Users Group presentation (link):

I use SAS in our advanced undergraduate statistics courses.

The advantages of using SAS over the common spreadsheet-based statistical packages,

such as Minitab or SPSS, are:

1.  The students are exposed to the logic of data management and data manipulations.

2.  It involves creative programming, while other packages are mostly menu driven.

3.  The wide scope of statistical capabilities enables us to work with more elaborate

     settings.

4.  Knowledge of SAS gives an advantage in the job market.

Yet, it is not too difficult for students to learn it  (At the basic level).


 I would agree, and this would just as easily apply to R as to SAS on all counts above. Most important, I think, is how spreadsheets and menu-driven packages overlook #1, which is just as important, I repeat just as important as the actual analysis. In the age of data science, not being equipped to manage and manipulate data (hacking skills) can leave you high and dry. Who cares if you can run a regression, the IT department probably isn't going to have time to get you the data exactly as you need it, and it might take several iterations of requests to get just what you need. And, without hacking skills, you may not even have the ability to recognize the fact that the data you have is not in fact the data you think it is.  And who says IT will always be able to hand you the data you want. Learning statistics with the an attitude that just assumes the data will be there and moves on with theory and analysis is OK as long as you take another course that follows that includes hacking/coding/data management. And guess what, that's not going to be done easily without some scripting language vs. pointing and clicking. This mindset I'll master stats and the data will just come on its own, might make a great statistician, but a poor data scientist.

 The importance of  these skills are illustrated very clearly in a recent Radar O'Reilly piece Building Data Science Teams.

"Most of the data was available online, but due to its size, the data was in special formats and spread out over many different systems. To make that data useful for my research, I created a system that took over every computer in the department from 1 AM to 8 AM. During that time, it acquired, cleaned, and processed that data. Once done, my final dataset could easily fit in a single computer's RAM. And that's the whole point. The heavy lifting was required before I could start my research. Good data scientists understand, in a deep way, that the heavy lifting of cleanup and preparation isn't something that gets in the way of solving the problem: it is the problem."

 I think that last statement says it all.

 And, besides valuable jobs skills, learning to code offers students a sense of empowerment. As quoted from a recent piece in Slate (HT Stephen Turner @ Getting Genetics Done):

“Learning to code demystifies tech in a way that empowers and enlightens. When you start coding you realize that every digital tool you have ever used involved lines of code just like the ones you're writing, and that if you want to make an existing app better, you can do just that with the same foreach and if-then statements every coder has ever used.”

This has been kind of a rant. I'm not sure what the solution is. I'm not sure how much of this can actually be taught in the classroom, and the time constraints can be binding.  I'm not sure data management and statistical analysis both need to be part of the same course.  I learned a lot of both on the job, and still have much to learn from a coding perspective. But I think at least making the effort to acquaint yourself with a language that is used industry wide (like SAS or R, or even  SPSS if the scripting language is introduced) as opposed to just any point and click interface with little data management capability seems to me to at least be a start.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.