March 10, 2005

Filesystem data visualization using JPGraph

Author: Glenn Mullikin

JPGraph
is a set of programs written in PHP
that plots data into a
wide range of graphs and formats the results. Licensed under the
Trolltech QPL
License
,
JPGraph is now at Version 1.17.
Whatever your data, JPGraph can help you to view it graphically,
letting you to see relations in more clearly. Such data visualization
may not be important to a computer, but, to a person, it can make a
lot of difference to analysis.

To see what JPGraph can do, let's look at the executable binary
files in the /usr/bin directory. I'll exclude the symbolic links.
I'll also omit the over 130 files in the /usr/bin/X11 sub-directory.
My purpose isn't to be comprehensive; just to show what
JPGraph can do. Specifically, I'll be using JPGraph to look at three
basic questions:

  • Is there a relation between a binary executable's file size
    and the number of shared libraries that it uses? I used ls -l to get file sizes. I also
    used ldd [filename] to count number of shared libraries used.

  • When was the last time each binary executable was accessed? To get the last access times,
    I used the command stat -c %X [filename]

  • How many files use the shared library files? I used the ldd command
    to get shared library listings and then counted up all the hits I got on each shared library.

I printed the data for the first two questions to plain text files. For the third, I used a
MySQL
database for more flexibility.

While looking at the graphs that result, I'll also comment on some
of the formatting features offered by JPGraph. I'll jump around a
bit, but seeing the features in action shows their usefulness better
than talking about them separately.

Graph 1: Is there any relation between file size and the
number of shared libraries used?

Figure 1 shows the results of running the ldd
command on the /usr/bin directory. I've also used this graph to
showcase some of the features JPGraph offers.


Figure
1
Click on graph to see fullsize image

The blue line represents the filesize, and you can see how
the filesize decreases. To see whether there is if the number of
shared libraries decreases as the file sizes decrease, I created a
second Y axis on the right-hand side of the graph. Once I collected
the data, I was able to graph those same files in /usr/bin
using the number of shared libraries they used.

But you'll notice that there are green, red and black circles, all
in slightly different sizes. What's going on there? Well, JPGraph
lets you do function callbacks in which you can alter the
color and sizes of your plot points according to the Y-axis value,
or, in this case, the Y2-axis value. The green circles represent
files that use between 10 and 20 shared libraries. Black circles
represent 0-9 links to shared libraries, and red circles over 20.

I could have simply used the same color (and size) for all the
Y2-axis data points, but then the results wouldn't be so obvious.
This way, you can immediately see that the green circled are heavily
outnumbered by the red circles. In turn, the red circles are heavily
outnumbered by the black circles.

Any circles in the pink cross-hatched area share 10-20 libraries.
As well, because of the way I defined the callback function, any
circles lying in the cross-hatched area are going to be green.
Circles lying above the cross-hatched area represent executables that
use more than 20 shared libraries -- although, of course, they don't
all use the same ones.

Notice the Y-axis and how it uses a logarithmic scale. That was
necessary because our filesizes range from less than 100 bytes all
the way up to somewhere between one and 10 megabytes. One megabyte is
10 to the sixth power. JPGraph uses 10^6 to represent 1,000,000
because it is easier to read.

So, after gathering the data and figuring out how to present it,
what did I learn? The first thing I learned is that there are about
1,253 files in my /usr/bin subdirectory -- excluding the /X11
sub-directory -- which are not symbolic links. It turns out that
around 450-500 of these files are not dynamic executable files, but,
presumably, text scripts that call other executable files. These
files are represented by the black circles on the Y2-axis zero value
line.

Perhaps I should have excluded such files from consideration. How,
they do not affect the main idea that I get from looking at the
graph. Although files linked to over 20 share libraries (the red
circles) are slighly more numerous for the first 600 files than they
are for the next 600, the pattern is not nearly consistent enough for
us to say that smaller files are generally linked to fewer libraries.
However, I can conclude that the majority of the binary executable
files lie under the pin cross-hatched area, which means that they use
less than ten share libraries.

Before moving on, note that the graph in Figure 1 also showcases
many of JPGraph's features, including:

  • Using True Type fonts. I used one named Bazooka for my X- and
    Y-axis titles.

  • Shading under the line graph from one x value to another x
    value. I shaded under the graph in a subdued yellow color to
    highlight the files that lie between 10^4 bytes and 10^5 bytes in
    size.

  • Shading an entire vertical strip from one x value to another
    x value. I shaded in very light brown for all the files that lied
    between 10^3 and 10^4 bytes and had the computer figure out where to
    start and stop.

  • Using a gradient color scheme for the margins, while leaving
    the plot area in a solid color. The blend colors I used were red and
    black, but you can specify other colors.

  • Using Alpha blending to specify a transparency percentage
    between zero and one. Typical values that I might use are .5 or so.
    In Figure 1, you can see how I used it to allow the circles to show
    even though the areas had vertical fills in two different sections.
    If the vertical fills simply covered the circles up, that would
    defeat the purpose of the graph.

Graph 2: When were binary executables last accessed?

To answer these questions, I decided that a scatter plot would
help us see when files were last accessed. I also decide to check
file sizes, since a multi-megabyte file that hasn't been accessed in
two years might be more of a candidate for deletion than one that
only uses 100 kilobytes. To plot this information, what was needed
was two Y-axes, one for the last access of each file (in Unix
timestamp format, seconds since the epoch) and one for its size in
bytes. To enhance to the graph, I added the Tux logo after tweaking it
slightly in the GIMP.


Figure
2
Click on graph to see fullsize image

The graph in Figure 2 is the result. It shows a large mass of
red squares on the right-hand side that are stacked on or about the x
value of January 6, 2005. This means that all of those files were
last accessed on that date. The majority of the files in /usr/bin
fall into this category. However, as we move to the left, another
small cluster of red squares centers on September 12, 2004. The next
large masses of red squares don't appear until in the interval
between Dec 15, 2002 and Feb 11, 2003.

How many such files are there in that last cluster? Our Y2-axis
was designed to answer that question. The blue triangles with the
number above them shows that 293 files are represented by all the red
squares stacked between Dec 15, 2002 and Feb 11, 2003.

More generally, one can see that the cumulative file count --
the orange area -- grows relatively slowly until the very far end of
the graph at January 6, 2005. The graph shows that 477 files were
last accessed before November 9, 2004. The remainder of the 1,253
files were accessed after that time.

What conclusions can be drawn from this graph? What strikes me is
how long those vertical strips of squares are. It doesn't appear to
make much of a difference what the filesize is (the y axis value).
All files have the same distribution of last access times. I am not
exactly sure why no files have a last access time earlier than Dec
15, 2002 but that may have been when I installed the system.

This particular graph allowed me to experiment with function
callbacks for formatting text labels on the X-axis. When doing this
graph, both the y and y2 axes were required to have the same x values
so that the plots could be overlayed. However, I didn't want the
dates in the form December 13, 2002 because JPGraph
couldn't figure out how to order them by time. I had to use the Unix
timestamp as the time value, and then use a callback function to
reformat them into a human-understandable formatted date.

Figure 2 also allowed a few other niceties, such as:

  • Using the text feature of JPGraph to place text at an
    arbitrary location on the graph while specifying color and
    transparency (I used a transparency setting of 0.4 to allow any red
    squares to show through. Cumulative File Count is the
    text that I placed in white using a custom true type font.)

  • Printing only the lines on the major divisions of the Y-axis
    and making them dotted lines. This formatting was useful to maintain
    a semblance of the Y-value at each power of 10 since the graph was
    logarithmic on the Y-scale

  • Using the tab title feature to display the title of the
    graph. The text color, background color and frame color can all be
    set -- I used magenta, black and green.

Graph 3: How many files use the shared library files?

This graph highlights more of JPGraph's abilities. The previous
graphs were 640 by 480 but for this graph, I needed more vertical
space so I opted to make it 480 by 740. Even so, I had to confine
myself to using only the top 50 shared libraries.


Figure
3
Click on graph to see fullsize image

Among the top 50, are two shared libraries that are used by
832 files in /usr/bin. That is probably all the binary executables.
Then we see libm.so.6 with 414 dependencies, libdl.so.2 with 250, and
then the rest.

I designed the graph so that it would be readable and
understandable without the need for X- and Y-axis titles. I chose a
rotated type graph, a horizontal bar graph. The blended bars going
from green to blue make the consecutive bars stand out from one
another. I decided to put the value inside each bar, instead of to
the side of it, because, the farther away you get from the top, the
less you know what exact X-value you are at. I almost tilted the text
names of the shared libraries at a five degree angle, but I decided
that hurt readability slightly. The SetLabelAngle method takes one
argument, the number of degrees, positive or negative.

If you look at the graph in Figure 3, you'll notice that, because
the bars are decreasing in size, there's empty space on the right
side. Rather than leave it blank, I placed a legend there. Red on a
yellow background is what the JPGraph documentation and examples use
and I saw no reason to change that.

One last thing: For the other two graphs, I used simple text files
for my data storage. For this one, I used MySQL. If I wanted to
change to graph more than the top 50 libraries, all I would have
needed to do is change my query. That's the type of power and
flexibility that MySQL can provide when working with JPGraph. With
text files, varying the display would be much harder.

Conclusion

Of course, JPGraph doesn't have everything. For example, I would
like the ability to do three-dimensional graphs. A three-dimensional
bar graph can show relationships that might be impossible to observe
in two-dimensional graphs, such as last access time versus filesize
versus number of shared libraries. Another feature that needs
enhancing is the callback function. In Figure 1, I used the callback
function to set the color and size of the filled circles. The problem
is that I was only able to use one number to determine both the size
and the color. It would be nice if those could be determined by other
arrays. Similarly, I would like to be able to use different colors
on the individual bars of a single bar plot instead of having to use
multiple bar plots. Overall, I am not disappointed with JPGraph's
function, but a few coupld offer more fine-grain control.

You may not be interested in filesystems and how many shared
libraries your system has, but you don't need my interests to
appreciate JPGraph. No matter what your data, you might like to take
advantage of JPGraph to perform your own data visualization. And with the comprehensive
documentation
and the hundreds of samples graphs that you can modify
with your own data (located in http://localhost/jpgraph/src/Examples),
JPGraph can have you up and running in no time.

Download
JPGraph and see for yourself.

Glenn Mullikin is a professional Linux journalist.

Category:

  • PHP
Click Here!