Linux.com

Feature

Fedora's metrics have ripple effect

By Lisa Hoover on January 29, 2007 (8:00:00 AM)

Share    Print    Comments   

Fedora announced this month that by using a tracking tool to monitor unique IP addresses, it was able to determine that Fedora Core 6 now has more than one million users. What does all this metric gathering mean for future Fedora releases? Moreover, what does it mean for the Linux community at large? The answer on both counts: plenty.

Fedora decided to track metrics with the release of Fedora Core 6 (FC6) because the lack of data from previous releases made it difficult to be sure what users value in Fedora Core packages. Focus groups like November's Fedora Summit help the team plan ahead for future releases but don't tell whether they've hit the mark once the distribution is released. According to Fedora Project Leader Max Spevack, the best way to serve the Fedora community is to understand what it is they're looking for and then deliver. Metrics help the team determine where they have succeeded and where they could do better.

The method

As the release of Fedora Core 6 drew near, team members knew they wanted to be able to track statistical information to better understand how people use Fedora. The team turned to the user community for ideas on what types of data to collect and what methods to use. Suggestions on the Fedora Metrics wiki range from user surveys and registration to embedding a file within the package that would send user data to a central server. The overriding concern from the team and user community alike is privacy, so invasive and sly data collection methods are not being considered.

"The different methods discussed ranged all the way to very intrusive registration with UUID," says newly-appointed Fedora Infrastructure Leader Mike McGrath. "We like to avoid being evil so we're going to do whatever we can to make sure people can not participate and to make sure stuff is submitted anonymously. The only way to track it back to a machine or user is for the user to actually give us the identifier and say, 'My sound doesn't work, here's its profile: 3333-44-2-22322223424.'"

Cacti, an open source data collection and graphing tool, was already monitoring other pieces of Fedora's infrastructure, so using it with FC6's release was a natural extension. Setting up and implementing Cacti was a group effort among several members of the infrastructure team who worked diligently to get it ready for FC6's release date, and most of the information about what they did "is in the heads of the infrastructure guys," Spevack says -- but anyone who would like to discuss how to implement something similar for their project is welcome to contact the team.

Cacti tracks the number of unique IP addresses that connect to yum with a new installation of FC6 in search of updates. Determining the number of unique IP address is the main focus of this metric, but McGrath says several other pieces of information, as yet to be determined, will be collected following the release of FC7.

According to Spevack, it's not enough to simply count how many times the distribution has been downloaded; it's also important to gather data that will help developers determine what to focus on for future releases. Spevack says knowing what packages are getting the most bugs filed, which are being installed most often, and so on gives the team a clearer understanding of what the user community likes and dislikes about Fedora.

While only minimal information was collected during the release of FC6, the team hopes to cull much more data with the release of Fedora Core 7 later this year. Spevack says that, as with the metrics collected from FC6, "We're going to put the results out regardless of what they show. If the numbers are good, that's nice. If not, well, then we have a benchmark and it tells us where to improve.

"Statistics is just one way of understanding how people are using our software. Any insight we can get into how folks are using Fedora helps us to make better decisions."

The metrics gleaned from Fedora's data collection amount to more than just a chance for developers to pat themselves on the back, however. They also provide the opportunity to show the growing number of Linux users within the computing community which, in turn, may goose hardware vendors into offering more Linux-friendly goods and services.

"This provides objective data that helps prove Linux is growing and it helps build a case for Linux in general" says Spevack. "Also, we always say we wish hardware vendors had more [Linux-capable] drivers. Well, if you can go to them and say, 'Hey, there's millions of people using this,' then maybe they will listen. In the real world, you need data to prove your case. Well, here it is."

Although neither Red Hat or Fedora have approached any vendors with the results of Fedora's metrics, Spevack says Red Hat remains committed to urging vendors to continue to be Linux-friendly, and "if Fedora's numbers can add another arrow to the quiver, then excellent."

Better software through better metrics

A final decision on what metrics will be collected and what methods will be used is still weeks away, but McGrath says end-user participation will not be mandatory. "Users who are highly concerned about security can simply not participate, though I'd like to note that while in the minority, they are a very vocal group," says McGrath. He goes on to say most people are in favor of gathering metrics as long as there is a purpose and a goal behind it, not just random, meaningless data collection.

Like Spevack, McGrath says that thorough data collection will ultimately lead to better Fedora packages. In addition to gathering metrics on user hardware, he would "also love to get a proper survey engine so we can flat out ask people what they're using our software for. We'd also be interested in getting a package list though that's down the road. This would be useful to see what packages are popular and which ones are just duds."

The team would like to collect more elaborate data for the release of Fedora Core 7, however, so discussions are currently underway to expand on the data collection method currently in place. Though there is still work to be done on the metrics-gathering tools that will be used during the release of Fedora Core 7, team members say they will be ready in plenty of time for its April release. Spevack says he is looking forward to sharing the results and getting community feedback. McGrath agrees and says he wishes more software vendors would collect and share similar information. "To my knowledge no one else is actually showing users the math and methods to estimate install base, popular architectures, etc. It's a shame."

Share    Print    Comments   

Comments

on Fedora's metrics have ripple effect

Note: Comments are owned by the poster. We are not responsible for their content.

much more than 1 million

Posted by: Anonymous Coward on January 30, 2007 06:31 AM
I and my friends got our copies of Fedora core from someone who downloaded and shared it over my college's LAN.

i think this happens in all LANs with a large user-base.

so, count it to be much more than 10^6

#

Re:much more than 1 million

Posted by: Anonymous Coward on January 30, 2007 07:09 AM
I think they are tracking usage of the Updates service not downloads to get around the problem you highlight.

#

Re:much more than 1 million

Posted by: Anonymous Coward on January 30, 2007 07:15 AM
Yes, my download has until now been installed at least 15 times. And yet another bunch is still sitting on FC5 and Debian.

Although I agree that it is important to get metrics of this kind, I do feel that the "community" of most distros at large has been able to provide a good balance and stability for a long time. It is doubtful whether inventory type analysis will actually provide the data you are looking for. Direct feedback has been working in favor of the developers for a long time already. Maybe slow, but steadily better.

#

Re:much more than 1 million

Posted by: Anonymous Coward on January 30, 2007 07:28 AM
I think you missed the some specifics about their statistics gathering since the article quite clearly states that they gather their statistics from unique IP update hits (yum) etc... Hence your sharing of ISOs has little influence.

But, I still see a flaw in their statistics gathering (i'm sure there are many more).

For instance I have 5 machines all running FC6 sitting behind a firewall/router that NATs all traffic. So my machines all count for 1 in their statistics (as far as I can see anyway, I may have missed some finer detail).

Cheers..

#

Re:much more than 1 million

Posted by: Anonymous Coward on January 30, 2007 07:57 AM
If you follow the link in the article that says "metrics collected from FC6" and points to <a href="http://fedoraproject.org/wiki/Statistics" title="fedoraproject.org">http://fedoraproject.org/wiki/Statistics</a fedoraproject.org> you will see that they are aware of this case (the section "Accuracy of metrics"). It somewhat balances out the group using dynamic IP, but they believe the group with NAT is (significantly) bigger than the group with dynamic IP.

#

Anonymous

Posted by: Anonymous Coward on January 30, 2007 07:12 AM
I don't mind them doing this as long as:
1. they fully disclose exactly what they are collecting, and how they are using the data.
2. It is totally anonymous. i.e They do not store my IP address along with my system details (preferably an data which uniquely identifies my machine is immediately deleted)

#

Re:Anonymous

Posted by: Anonymous Coward on January 30, 2007 09:59 AM
"totally anonymous" is relative. If they transmit any kind of GUID without encryption on a frequent basis, they are de-anonymizing your traffic to anyone listening for stuff that can be correlated between sessions (hi NSA dataminers!).

(IIRC, RHN sends identifying information encrypted, so it doesn't cause that particular problem.)

#

Comparison with Debian

Posted by: Anonymous Coward on January 30, 2007 03:02 PM
Debian has a package called popularity-contest for measuring this kind of thing. It currently shows about 26,700 submissions, which seems quite low.

Some reasons why the number is not higher include:
- popularity-contest has not yet been included in the default install of a stable release; and
- The default option when prompted whether to use popularity-contest is to disable it.

Popularity-contest's purpose seems to be more about knowing which packages are popular, rather than how many users/installations there are. In that case, if only 10% of users enable the package the data are still meaningful. For a total number of installations, all it provides is a minimum bound with no clear indication of how many installations there really are.

Stats available from <a href="http://popcon.debian.org/" title="debian.org">http://popcon.debian.org/</a debian.org>

#

I am fairly certain this is not right

Posted by: Anonymous Coward on January 30, 2007 09:32 PM
The fact is most people use dynamic IP's. So someone else using my ISP installs Fedora, they may have the same IP address I had when I installed it. Many people also use NAT routers. In fact there are some ISP's that use NAT's, so they share IP's among customers. Also, some people tend to go to a friends house to install a new operating system.

However, people also tend to reinstall or sometimes even install multiple versions (say 32 bit and 64 bit) on the same computer. If they are using dynamic IP addresses, each of theses installs may show-up multiple times.

I have also seen test labs, where the test machines are all fresh install images, restored for each test cycle. Each boot to a clean image might count as a separate install, or might not depending on the network configuration used...

It would probably be better to count the number of times a common update rpm is fully downloaded as a semi-reliable count.

Bill

#

Re:I am fairly certain this is not right

Posted by: Anonymous Coward on January 30, 2007 11:04 PM
It's like web statistics: interesting but too vague to do any real research on.

A bit of fun, some nice figures to chuck at clients but only a nodding acquaintance with reality.

#

Fedora seems to agree with you

Posted by: Anonymous Coward on January 31, 2007 12:40 AM
"Cacti tracks the number of unique IP addresses that connect to yum with a new installation of FC6 in search of updates"

They are using update hits plus a unique identifier to measure how many Fedora installations are being updated rather than how many times the install<nobr> <wbr></nobr>.iso is downloaded. The end result being a count of how many installs since each one is expected to update at least once.

I may be reading this wrong but either way, I'm on a different distro so it's just news in passing.

#

(non) unique differentiating identifier

Posted by: Administrator on January 31, 2007 02:33 PM
One way to differentiate between machines behind a NAT would be to use a non-unique identifier to separate machines -- e.g. using the last 1, 2 or 3 bytes of the first Ethernet card.

That way you could tell if there are 7 different machines behind a NAT, but you wouldn't be able to use the number to uniquely identify a user. -- and if you have a dynamic IP, the next time your ISP gives you a new IP address, you disappear into the morass.

#

much less than a million?

Posted by: Administrator on January 30, 2007 11:17 AM
is dynamic ip addressing presumed and factored into this figure? if a major portion of internet access is dialup, this would drastically compromise any claim not accounting for multiple addresses pinged from single sources. how about distributed loading of hundreds of customers across a few dozen dynamic numbers? that warps it in the other direction. i seriously doubt any tally based on "hits" has any veracity whatsoever.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya