October 16, 2008

PSPP brings an industry standard statistical tool to Linux

Author: Andrew Choens

Today's information systems give organizations and governments the ability to collect and access metaphorical mountains of information. But, this information is completely useless unless we are able to find and understand the relationships and trends hidden in these mountains. For projects involving complex research protocols, high-end statistical analysis tools such as SPSS and SAS are useful, but they come with high price tags and proprietary licenses. PSPP is an open-source clone of SPSS, one of the most commonly used proprietary statistical packages.

Major distributions like Fedora and Ubuntu include PSPP in their package repositories, but today they include an outdated version. Upcoming versions of Ubuntu, Fedora, and openSUSE include 0.6.0. Until they are released, if you want to try PSPP, you can compile the current version from source, or look on PSPP's wiki to see if a volunteer has provided a binary for the distribution you're using.

SPSS: A proprietary standard

Before introducing you to PSPP, I want to introduce you to SPSS. SPSS was originally designed for researchers in the social sciences but is now used in many other fields, and by analysts working for federal and state agencies, large corporations, and academia. SPSS is a remarkable tool because it offers a robust programming language for the analysis of complex data and a user interface that gives less technical users unfettered access to the power of the underlying system.

SPSS's intuitive GUI makes it accessible to users who have little or no programming experience. Other statistical packages such as R (open source) and SAS (proprietary) are used almost exclusively by experienced programmers. SPSS provides users with an interface that is remarkably similar to a spreadsheet, with which analysts can design complex data transforms or build mind-bogglingly detailed cross-tabs.

Although the GUI is one of the package's killer features, SPSS also provides many opportunities for programmers to write scripts. SPSS Syntax is an easy to use functional programming language designed specifically for data analysis. As the expectations of programmers have evolved, SPSS has offered additional programmability through plugins and language enhancements.

Newer versions of SPSS work on Linux, thanks to the cross-platform magic of Java, but a fully enabled (non-student) license costs nearly $1,700, and annual maintenance costs an additional $425. Worse, the license is time-limited -- in 2011 my legally purchased license of SPSS 11 will expire. The cost and licensing provisions of SPSS create an opportunity for our community to develop an alternative.

PSPP: An open source alternative

As an open source alternative, PSPP 0.6 is an incomplete yet compelling product that should grab the attention of developers and end users alike. It gives Linux a general purpose data analysis tool with the accessibility advantages of its proprietary cousin. If you're in the market for an open source statistical package, there are two reasons PSPP should be on your short list: its new GUI, psppire, and its high degree of compatibility with SPSS syntax.

For many users, the newly introduced GUI is probably the most important new feature in the 0.6 series. Earlier versions of PSPP were command-line only, which limited the software's appeal to programmers. The new GUI mimics the familiar dialog boxes found in SPSS's interface, making the transition easier. As in SPSS, psppire's interface gives non-programmers full access to the power of the underlying system. The dialog boxes are clear and easy to use. For repetitive analysis, writing a script will always be easier, but psppire gives users access to the same tools available to the programmers.

I could only find one significant limitation in psppire. PSPP still lacks many statistical tools found in similar products. Naturally, the GUI is impacted by this limitation, and users familiar with SPSS will notice that psppire's menus are somewhat empty.

PSPP's compatibility with SPSS syntax is as important as the GUI. SPSS syntax is a widely understood standard in many companies and government agencies. In my job I often help state governments calculate a set of complex outcome measures developed by the federal government as part of the Child and Family Services Review. The Feds publish the thousands of lines of syntax necessary to compute the measures, but only for SPSS users. The code could be ported to another tool, but this task is decidedly non-trivial. More importantly, porting the syntax could easily introduce subtle errors in the calculation of the measures. It is ironic that most states use a proprietary product to run code that is available for free on the Internet.

Usable today

Although the 0.6 series of PSPP is not a finished product, it can perform data transformations and is able to perform basic statistical analysis. Users can also create tables of univariate statistics, or create complex cross-tabs of multiple variables. The ability to easily weight cases according to a variable works as expected. As PSPP continues to mature it will help more professionals and students who are comfortable with SPSS convert to open source.

There are some limitations to PSPP. Many advanced statistical analysis methods, such as MANOVA, are not yet implemented. Tables and charts produced by PSPP are less customizable than the output generated by SPSS or R. Most importantly, version 0.6.0 incorrectly calculates linear regressions. On October 10, the project released version 0.6.1 to fix the regression error. Improving the tables and charts generated by PSPP is a high priority for future releases, and the developers are working hard to expand the suite of tools.

The manual for PSPP is available at the GNU Web site, with detailed documentation for programmers for each implemented function. The manual also includes a full list of SPSS functions not yet implemented in PSPP. Unfortunately, the current manual is singularly focused on the implementation of the programming language. The PSPP community has not produced a similar manual for the psppire GUI.

PSPP has an active mailing list on which the developers participate. Discussions often focus on compiling PSPP, but other questions are welcome. The project also welcomes help; developers with a strong background in statistics are especially needed.

Categories:

  • Desktop Software
  • Reviews
Click Here!