The Case for Data-Driven Open Source Development


The lack of standardized metrics, datasets, methodologies and tools for extracting insights from Open Source projects is real.

Open Source Metrics That Actually Matter

Let’s take a look at the first part of the problem: the metrics. OSS Project stakeholders simply don’t have the data to make informed decisions, identify trends and forecast issues before they arise. Providing standardized and language agnostic data to all stakeholders is essential for the success of the Open Source industry as a whole. Everyone can query the API of large code repositories such as GitHub and GitLab and find interesting metrics but this approach has limitations. The data points you can pull are not always available, complete or structured properly. There are also some public datasets such as GH Archive but these datasets are not optimized for exhaustive querying of commits, issues, PRs, reviews, comments, across a large set of distributed git repositories.

Retrieving source code from a mono-repository is an easier task, but code retrieval at scale is a pain point for researchers, Open Source maintainers or managers who want to track individual or team contributions. 

Read more at The New Stack