Statistical Environments - R, Stata, Python / Pandas, and more (Topic)

From TaDa Wiki
Jump to: navigation, search

Understanding the Big 3

Most people in social sciences use one of three environments for their statistical analysis: R, Stata, and Python. These are three very different environments, each with their own strengths and weaknesses, but here's a brief summary of the tradeoffs:

Stata

For detailed information on Stata, see Stata (Level 1)

Stata is one of the most popular tools among applied economists, and for good reason. It is an extremely powerful tool for data manipulation and econometric analysis. Its syntax is simpler than that of R, has lots of in-built statistical tools, lots of good add-ones, is very good for data-cleaning, and has a strong support community (the statalist community can solve almost any problem you have).

Basically, Stata was designed to do econometric-type analyses (regression analysis, maximum likelihood, t-tests, non-parametric analysis, time series, etc.) very well, but not much else. That's why it has a simpler syntax than R -- they didn't build it to take on everything. But this does come at a bit of a cost. In particular:

  • It doesn't really work for anything BUT econometrics, so it may not be your best choice if your interest is in spatial analyses, network analysis, or text analysis.
  • The programming language is NOT object-oriented, which is the reason that it's a little easier to learn if you don't have a background in programming (e.g. if you don't know what "object-oriented" means), but also the reason that people with more programming experience and who love object-oriented languages don't like it.

But if you just want to do econometrics, and don't have a lot of programming background, Stata is hard to beat!


R

For detailed information on R, see R (Level 1)

R is by far the dominant environment for statistical analysis in most of the social sciences (with the possible exception of the economics field), as well as in much of the data science community more broadly.

R's biggest asset is the huge library of tools ("packages") that have been written by users and published online over the years to add new functionalities to R. There are packages for econometric analysis, geospatial analysis, network analysis, textual analysis, etc.

But R's flexibility also comes at a cost -- its scope can be overwhelming, and the language is not the most intuitive or accessible for people without a background in programming. It's syntax (the rules of the programming language) are also rather quirky, and have driven many a programmer crazy. Indeed, that's one of the big reasons that some people have been developing an alternative tool in Python that is meant to have a cleaner, more consistent syntax (see notes below).

The creator of R often says that his goal with R was to create a language that was easy for non-programmers to start using, but which would allow people who wanted to use more sophisticated functionality to dive deeper into the language. As a result, it's fundamentally a compromise language, designed to offer much of the flexibility of a full programming language while being more accessible to non-programmers. This compromise is both R's greatest strength and weakness -- Stata users often complain that the syntax is more complex than a simple language should be, while those used to full programming languages (like Python) don't like some of the compromises the language has made in favor of accessibility.

This compromise has also lead to variability in the quality of packages available for R. Because anyone can easily write and publish a package for R, there is a huge world of good libraries to choose from, but it is also the case that some libraries are poorly written -- making them a little hard to figure out -- and some are just wrong. Make sure to check with someone more familiar with R for their impressions of a given library before using it in your analysis.

So, in conclusion: R is an attempt to strike a balance between power and accessibility.

Python

Python is a full, general purpose programming language whose popularity among social scientists is on the rise. Many of python's supporters are people who feel that the compromises that R makes for accessibility are too costly, and that in some ways it makes R syntax more confusing.

Python programmers are obsessed with making their code intuitive and clear, and Python was built around this philosophy -- the worst thing you can say to a Python programmer is that their code is "not pythonic", meaning it isn't as clear as it could be.

However, because Python is a full general purpose language, there are some concepts you have to learn before using Python, like the idea of variables as pointers, and the idea of different data structures (if that doesn't mean anything to you, don't worry -- you can learn more in Python (Level 1)). But Python users tend to argue that if you plan to do lots of computational work, these are concepts worth learning, and make the language much more powerful, flexible, and once you understand these concepts, more intuitive.

The main reason that Python has been getting increased attention is the creation of pandas, one of the newest tools for manipulating data in Python. Pandas is modeled on R, but was created by someone who was frustrated by R syntax and memory use. Pandas is relatively young, but it is built on a well-established foundation (numpy and python), and some people (this author included) really its programming syntax. It is not as easy to use as, say, Stata, but it can be extremely flexible.

And finally, because Python is a general-purpose language, it's already used for lots of things inside and outside the realm of social science and statistics, it integrates easily with tools like ArcGIS, other geospatial tools, network analysis tools, webscrapers, text analysis libraries, etc.

Summary of the Big 3

So in quick summary, your choice of languages depends on how you plan to use them. If you only ever plan to do econometrics with tabular data (data in spreadsheets or matrices) and don't want to invest in learning a full programming language, Stata is hard to beat.

If you want to do other things -- like geo-spatial analysis, network analysis, web-scraping, etc. -- then it may make sense to learn R or Python. As to which of these to learn, the answer is relatively personal. R is easier to get into, but as noted above this does come at a cost of some flexibility. Thus the choice depends in large part on how much you want to invest up front in learning Python. If you kinda like programming, and plan to use these skills a lot in your career, or already have some familiarity with Python or another object-oriented programming language like Java, Python may sounds appealing; if you're more interested in just getting going, or have far more friends working in R you know you can ask for help, then R may be your best choice!

Additional Considerations

When choosing a software environment / language to use for you statistical work, there are a number of important considerations to take into account. Here are a few:

What do the people around you use?

You will inevitably run into problems working with a new language, and your best resource will likely be the people around you. Before picking a language, ask around to see what your friends and colleagues use. For example, development economists mostly use Stata; political scientists mostly use R; serious econometricians in some departments use Matlab.

What do you need to do?

Different environments have strengths in different areas. For example, if you just want to do data manipulation and standard regressions, Stata really shines. If you also want to do network analysis, you may want to consider R or Python, which have network analysis packages.

Mix and match?

Many people use a mix of environments for their projects. For example, one might do their data cleaning and organization in Stata, then move into R to run a statistical model that isn't support in Stata. This allows the user to use the program that is strongest for each task, but this strategy requires (a) learning multiple languages, and (b) keeping track of how you're moving data back and forth between programs, both of which introduce a lot of costs. With that in mind, I would recommend trying to limit this behavior.

How much data do you have?

Most of the environments / languages discussed on this page are for manipulating data that fits comfortably into RAM. If you have a data set that's quite large (more than 1/2 the amount of RAM you have on your computer), please read the page on Big Data

Other Languages

The Big-3 aren't the only languages out there -- here are a few other you may come across.

Matlab

For detailed information on Matlab, see Matlab (Level 1)

Note that Matlab is proprietary -- there's also an open-source version called Octave, and similar functionality is also available through the Python numpy module.

Matlab is an extremely stripped down language for statistical analysis. It is very powerful, but not very user friendly. Basically, it reduces everything to manipulation of matrices. Econometricians often like Matlab because it allows you to create your own estimators, but if you're doing applied work, it likely isn't for you.


SAS

For detailed information on SAS, see SAS (Level 1)

SAS is a statistical tool that is frequently used in industry, as well as academia. It is especially well-suited to working with large data sets: many operations do not require the entire data set to be read into memory, so analysts are limited only by the size of their hard drive.

Although SAS has powerful tools for cleaning data and preparing it for analysis, users may find that other languages, such as Stata and R, make statistical analysis easier. The documentation for SAS is also harder to use than that of some other languages, such as Stata and Matlab.

Example Code from Each Language

You can get a feel for different environments pretty quickly by looking at example code, although as with any language keep in mind that some things that may seem bizarre or foreign at first can become intuitive with some practice. Here, are code snippets for each language to: open a dataset from a file; create a new variable using existing variables; recode a variable; run a basic OLS regression, and run a slightly more complicated OLS regression. Since presumably you are reading this because you don't yet know these languages, don't worry too much if not everything makes sense -- these snippets are just provided to give you a glimpse of what the languages look like and how readable they are. Comments are included to explain a little of each step.

Stata

   * Open a dataset from a file: 
   use myDataFile.dta
   
   * Create new variable called age_squared equal to each person's age squared: 
   generate age_squared = age*age
   
   * Set variable youngMan values to 1 if male and under 25: 
   replace youngMan = 1 if gender == "male" & age < 25
       
   * Rename a variable from myVariable to myRenamedVariable: 
   rename myVariable myRenamedVariable
       
   * Run an OLS regression of income on age and height:
   regress income age height
    
  * Run an OLS regression of income and age and a categorical variable for level of education for men:
   xi: regress income age i.education if gender == "male"

R

A few examples of R commands:

   # Open a dataset from the file myfile.RData and name it myData: 
   myData <- load("myfile.RData") 
   
   # Create new variable called age_squared equal to each person's age squared: 
   myData$age_squared <- myData$age     # myData$age 
   
   # Set variable youngMan values to 1 if male and under 25: 
   myData$youngMan[gender == "male" & age < 25] <- 1 
   
   # Rename a variable from myVariable to myRenamedVariable: 
   myData <- rename(myData, c(myVariable="myRenamedVariable")) 
   
   # Run an OLS regression of income on age and height:
   olsResults <- lm(income ~ age + height, data = myData) 
   
   # Run an OLS regression of income and age and a categorical variable for level of education:
   myData$educationDummies <- factor(myData$education)
   myData_menOnly <- subset(myData, gender == "male")
   olsResults <- lm(income ~ age + educationDummies, data = myData_menOnly)

Python

A few Python commands (all from the pandas library):

   # Open a dataset from a file (pickle is a file format): 
   myData = pandas.read_pickle('myFile.pkl')
   
   # Create new variable called age_squared equal to each person's age squared: 
   myData['age_squared'] = myData['age'] * myData['age']
   
   # Set variable youngMan values to 1 if male and under 25: 
   myData.loc[(myData['gender'] == "male") & (myData['age'] < 25) ,'youngMan'] = 1
   
   # Rename a variable from myVariable to myRenamedVariable: 
   myData = myData.rename(columns = {'myVariable':'myRenamedVariable'})
   
   # Run an OLS regression of income and age and height:
   from pandas.stats.api import ols
   olsResults = ols(y=myData['income'], x=myData'age','height')
   
   # Run an OLS regression of income and age and a categorical variable for level of education for men:
   from pandas.stats.api import ols
   myData["educationDummies"] = df["education"].astype('category')
   myData_menOnly = myData.query('gender == "male"')
   olsResults = ols(y=myData_menOnly['income'], x=myData_menOnly[  [ 'age','educationDummies' ]  ])

Others

There are a number of other tools you may come across, like SPSS. As this author has limited experience with these tools (and has not found them to be used too frequently among colleagues), I leave this for later expansion.

Side-by-Side

Software Choices
Handles Dirty Data Strong Community Good for Econometrics Good for Other Uses Free?
R So-So Yes Yes Yes Yes
Stata Yes Yes Yes No No
Python / Pandas Yes Small but passionate Yes Yes Yes
Matlab No No Yes No No (But Octave is free)
SAS Yes No So-So No No