October 8, 2010

Weekend Project: Piwik Supercharges Site Analytics with Open Source


The open source Web analytics program Piwik turned 1.0 in August, continuing its well-earned reputation as an easy-to-install, easy-to-use traffic monitoring solution. But if all you do is stick with the default options, you are not getting the full value that Piwik has to offer. This weekend, look into how you can extend Piwik's functionality to benefit your site.

For those that are new to Piwik, it is an open source framework for tracking site visits and automatically generating statistical reports, much in the same vein as the proprietary offerings like Google Analytics or WebTrends. The basic methodology is common to most other analytics systems: it inserts a few lines of JavaScript into each page, logs the results in a database, and generates aggregate statistics for page views, referrals, search engine-driven traffic, and other useful metrics.

The main distinction mentioned in most Piwik reviews is that you, the site administrator, are in complete control over the data. It is logged to a local MySQL database, never viewed by Google, Yahoo, or another third-party company with its own ax to grind. Just as important, however, is the fact that you can customize Piwik extensively, by installing feature-adding plugins, modifying the existing configuration, or even by writing your own software to access Piwik's APIs.


If you are just getting started, the Piwik project offers download packages that will run on any recent LAMP server. Version 1.0 requires PHP5.1 or greater and MySQL. Note that this does not mean that Piwik can only track PHP-driven sites; the requirement is just for the server on which Piwik itself runs.

The archive package provided at the site needs to be unpacked in place in the server's Web root; subsequently visiting the new Piwik URL will allow you to step through the installation process — checking dependencies, setting up databases and tables, and configuring the Piwik administration panel. To begin tracking, Piwik generates a JavaScript tag containing a token keyed to the site being tracked. You insert this tag into the <HEAD> section of each page you wish to track.

For small, static HTML sites, that is a relatively simple task, but if you use a content management system (CMS), it is much easier to install an integrated Piwik plugin that inserts the relevant tag for you automatically. Piwik plugins exist for Wordpress, Joomla, Drupal, MediaWiki, Typo3, Grails, Gallery, and many other open source CMSes.

Extending Piwik's tracking capabilities

JavaScript tags will not catch all site visits, of course. Some users run with JavaScript turned off, and some browsers (particularly in the mobile space) don't even use JavaScript. A recent addition to the Piwik core provides an alternate tracking method that you may want to look into if you think that JavaScript is undercounting your visitors.

The "Piwik Tracking APIs" (as they are officially known, to distinguish them from the JavaScript Tracking API), provide three new ways to monitor visits. The first is the Simple Image Tracker; rather than a JavaScript tag, it uses a small (preferably invisible or at least non-intrusive) <IMG> tag instead. The image links back to the Piwik server, and can be used in concert with your existing JavaScript-based tracking. For example, the basic Piwik JavaScript tag has this form:

... where $IDSITE is the token of the particular site being tracked. The corresponding image tracker tag is .


The second new tracking method is calling the Piwik Tracking API directly from code. The project provides a reference implementation calling the API from PHP, which looks like this:


The API is callable from any server-side language, including Ruby, Perl, Python, and many more. You can also combine the two approaches for Piwik's third new option, the "Advanced Image Tracker." In this tracking method, PHP code generates an image tag at execution time. There are a few shortcomings to this method, because the PHP execution environment cannot access all of the client-side data that the JavaScript tag can (such as screen resolution, local time, cookies, and so on), but it can catch far more visits than the JavaScript tag can.

These new tracking methods not only allow you to record visits from users without JavaScript, but they also allow you to place tracking tokens on pages hosted elsewhere where JavaScript is not permitted — such as eBay auctions or social networking sites. Whether that is acceptable under the sites' terms-of-service may vary, of course.

There are also special techniques you can use to track non-HTML file requests (such as downloads), to gather statistics on 404 and other error pages, and to record certain pages in more than one category when generating reports.

The download tracking is done by adding CSS attributes to the links in the page content, so it obviously will not catch direct-URL-requests, but is an improvement. Error page tracking is easy to do in a CMS that has error page "templates"; other sites can accomplish the same thing by adding the Piwik JavaScript tag to a custom ErrorDocumentdefined in Apache's httpd.conf. Recording a page view in two or more categories (such as "ProductCatalog" and "Espanol") requires using the piwikTracker.setDocumentTitle attribute in your JavaScript tag.

There are two Piwik plugins to add "click heatmap" tracking — a feature by which you can keep track of the amount of time the mouse pointer spends on various areas of the screen, which when aggregated tell you what portions of your site receive the most user attention. The older plugin is Piwik ClickHeat, which ties in to the third-party ClickHeat service. The HeatMap plugin is newer, still in development, but does not rely on an external party tracking the data for you, which may be of concern to some site administrators.

Collecting more data about visitors

An oft-heard complaint about Piwik is that it does not capture as much information as the older, proprietary analytics products. The code base is always growing, of course, but there are also several excellent community-written plugins that can extend Piwik's functionality.

One of the most requested additions is geographical data. Piwik 0.9 was the first release to include a map widget showing the national origins of site visitors' IP addresses. The map is very basic on its own, but you can extend it by installing the GeoIP plugin, which further hones in on visitor's IPs including sub-national regions such as states, cities, and allows sorting by continent. Free software die-hards will want to know that the maps are generated with Flash (as are many other Piwik visuals), but they can be exported to common image formats such as PNG.

The EntryPage plugin adds a straightforward-but-useful metric to Piwik's reports: logging what each unique site visitor's first page hit was. This is particularly useful when building Piwik Goal campaigns, which are concerned with tracking a visitor's progress through the site, rather than simply counting the number of pages he or she visits.

Piwik presents you with two types of analytics: historical statistics generated from archived logs over time, and quasi-real-time stats generated once every ten seconds. Despite this rapid turnover, however, it takes a different approach to reveal to you how many visitors are currently actively visiting your site. This is where the Live Widgets and Whois Live plugins can help. Live Widgets is the older of the two plugins, and thus may be more robust, but it only displays the "live visitor" count broken down by national origin. Whois Live adds more information, serving up live counts of visitors by IP, browser, OS, HTTP Referer, and more.

Piwik already records search keywords if they are accessible via HTTP Referer headers, but the Search Engine Position plugin takes this a step further, calculating the relative page-rank that the incoming link was on the search results page. Consequently, you can adjust your marketing if you discover that your site is drawing in visitors through a number one ranking on "How to uninstall Example.com software" searches and a #400 ranking on "Example.com rulez." To make this plugin work, you will need to adjust some of Piwik's parameters, namely increasing the datatable_archiving_maximum_rows_referers and datatable_archiving_maximum_rows_subtable_referers variables to large numbers.

Finally, the Community plugin allows you to track visitors by age, gender, and "user type" (meaning registered user or guest), by adding these columns to Piwik's database. Naturally, making this work requires adding custom glue code to your CMS to extract the relevant data from logged-in user's profiles, and perhaps to ask them for it in the first place. The plugin does track "unknown" users, though, so the privacy conscious can simply opt-out in their profile.

Generating better reports

Piwik's reports are another area where critics feel the application has some room for improvement. New with July's development release was one much-requested feature: the ability to generate PDF versions of the dashboard's reports, and automatically email them to site administrators on a user-defined schedule. There are several limitation's to the built-in PDF report generator, though, which led one developer to build a separate PDF Exporter plugin. This plugin includes one notable improvement over the built-in exporter; the ability to generate "trend graphs" of various metrics over time, rather than simple tables that would need to be exported and collated together in a spreadsheet.

A separate plugin called ImageGraph generates PNG graphics from Piwik's data, which can be downloaded for easy inclusion in presentations or other document formats. This is an improvement over Piwik's built-in functionality because, as was mentioned earlier, the application uses Flash to generate its charts and graphics. Those Flash-based graphics look great on the Web dashboard, but cannot easily be exported. Furthermore, ImageGraph supports PNG alpha-channels, which makes its images easier to integrate into any document template.

Finally, the UserSettingsExt plugin extends Piwik's graph visualization in a new way, creating multi-level "pie charts" that group together like data points. The example provided is for browser statistics; the current Piwik code lists all variants of a browser separately (e.g., Firefox 3.6, Firefox 4, Firefox 2.1), which creates a glut of tiny slivers in the resulting pie chart. The plugin groups the browser variants together, showing first-level percentages divided up only by browser (Firefox, IE, Safari), with a second level subdivided for different versions of each. This is new code that appears likely to work its way into other Piwik visualizations, but you can install the plugin now to take advantage of it for browser statistics.

All Piwik data can be exported in CSV form for easy import into another program; for some the ability to create a custom spreadsheet out of a particular set of metrics is a critical feature. Blogger Arthur Lee describes a better approach altogether, using the Piwik APIs to directly import data into spreadsheet cells. You can access an online example with this Google Docs spreadsheet (in read-only mode, of course).

The technique uses Google Docs' importXML function to fetch the value of particular cells directly from the Piwik API's XML feed. For example, the base URL of the Piwik site is defined in cell (C,4) of the "Setup" sheet. Using that as a reference, the "Dashboard" sheet pulls the number of direct visits from yesterday with the cell function =importXML(Setup!$C$4&"index.php?module=API&method=Referers.getRefererType&idSite="&Setup!$C$5&"&period="&C2&"&date=yesterday&format=xml&token_auth="&Setup!$C$6,"//nb_visits").

The Setup!$C$4 reference resolves as the base URL, with the remainder of the quoted string resolving as the URL for the API call in question. You can make a copy of the Google Docs spreadsheet and explore it more in person. Don't be alarmed at the use of Google Docs, however: it is only a convenience. Microsoft Excel uses the importXML function, so all major spreadsheets implement it the same way, including open source offerings like Gnumeric.

You will need to study the Analytics API if you wish to build your own spreadsheet based reports in the future, but at least you can do so without running the risk of interrupting your Piwik service, like you might to write and debug a custom plugin.

Also in the custom-data-view realm, it is worth noting that there are several desktop applications written just to connect to Piwik analytics. Unfortunately, even though they are open source, they both use the Adobe AIR framework, which is available for Linux but is proprietary and binary-only. Adobe AIR is the platform of choice because it allows easy integration with the Flash code already used to generate reports in Piwik's Web interface. For some real extra credit this weekend, see about porting it to a free-er platform. The hits generated on your site by releasing that code could pay for your hosting for a year, assuming you target your ads for your visitors accurately enough. And with Piwik, if you can't do that, it's your own fault.

Click Here!