CryptOSS

Frequently Asked Questions

What is this site about?

This site reports historic open source development activity of popular cryptocurrency projects on GitHub. The current data runs from Jan 21, 2018 to Feb 4 2019. The data was used in an econometric study of cryptocurrencies on GitHub. A CSV copy of the data is available on Zenodo, which includes cryptocurrency prices over time (not shown on this site). The dataset is currently archival and not actively updated, though it may expand as data collection continues.

Why is there no data for cryptocurrency X?

We tracked roughly the top 200 cryptocurrencies starting on Jan 21, 2018. The list of cryptocurrencies lives in a database file. We are not actively updating and tracking data for additional currencies at this time, but you may be able to do so yourself by updating the DB and using the tooling.

Why is data missing for some days?

Collecting large amounts of data over time is generally hard. Various issues and limited resources means we lost data for some days. The GHTorrent project also collects data from GitHub, and may be one way to recover missing values.

What about cryptocurrencies that don't release open source software?

We recognize that open source software may not make sense for some cryptocurrency models. Our aim is to make information available for projects that do embrace open source software development.

What do the symbols mean?

Stars ( ) generally denote the popularity of software repositories. GitHub users star projects of personal interest. When users star a repository, it can act as a personal bookmark. A GitHub user may star any particular repository only once.

A fork ( ) is a copy of another repository of code, managed by a different user or organization than the original repository. Forks are typically created when a user or organization wishes to change an existing codebase. Sometimes the changes are incorporated (merged) back into the original repository. In other cases, the forked code becomes an entity of its own. An example of the last case is LiteCoin, which is a fork of Bitcoin. The number of forks can indicate how many developers interact with a repository's code, either for personal use or for extending the original codebase.

Subscribers () watch repositories to receive realtime notifications about new activity in a codebase. Subscribers tend to be actively interested in new developments in the codebase.

How do you calculate aggregate metrics? Aren't forks problematic?

The main page shows aggregate metrics over multiple repositories managed by a single user or organization. The aggregates include: (a) code additions and deletions, (b) commit activity, (c) contributors, (d) stars, (e) forks, (f) subscribers, and (g) the last updated time. For contributors (c), GitHub provides up to 100 contributors per repository. When we see a repository with 100+ developers, we sum 100 developers to the total count and add a '+' to indicate 'possibly more'. Since a single contributor may contribute to multiple repositories, it is important to remove duplicate cases in our aggregate count (which we do). So, the number of developers is conservatively reported as being at least that amount of unique developers. For (g), we use the most recent repository of all repositories for a particular organization, including forks.

Inclusion and exclusion of forks in the aggregate data generally depends on the time window that we aggregate over. More recent aggregates include forks (i.e., the last 24 hours or 7 days), while we exclude forks for older historic data. This is to ensure that, in general, we do not include forked activity (e.g., a year's Bitcoin activity) for recent cryptocurrencies forking Bitcoin. At the same time, we want to include new, independent developments for forked repositories of derivative cryptocurrencies. As another example, it would be unfair to exclude code additions or stars that are specific to LiteCoin's forked BitCoin repositories. While imperfect, selective inclusion gives a more balanced view of development activity on an individual codebase than either including all or none of fork data in the aggregate.
Including some metrics only for a recent time window makes sense where the effect of including historic, unrelated activity is minimized, while attributing fresh activity of a forked repository to the appropriate project.

Commits (b): By default, we include commits to forks in the aggregate only for the last 24 hours and 7 days, and exclude commits to forks within the last year.

Contributors (c): Similar to commits, we include contributor activity on forks only in the last 7 days, and exclude the all time count of developers.

Code changes (a), stars (d), and subscribers (f) include forks. Code additions and deletions (a) include forks since these concern only the most recent 7 days of development; unless a repository fork is fresh (i.e., within a given week) it does not contribute to the aggregate counts. Stars (d) and subscriber counts (f) of forks are included, since these reset to 0 when a repository is forked. The current counts, as reported on GitHub, are unique to the owner.

Forks of forks (e) are excluded: the number of forks of a fork is not reset to 0 when the repository is forked. Instead, the number of forks is carried over from the original forked project. Since GitHub does not expose the number of forks for a project outside the original one, we do not to include it in the aggregate.

Do you include commit data, changes, etc., in feature branches?

Currently, we do not: we only consider the default branch of a repository. It is currently prohibitively expensive to process all branches of all repos, but this may change in the future.

I see some repositories are updated less than a day ago, but no commits or changes were made. What gives?

An update can be things like a change in a repository's description or wiki pages. We count such activities as updates.

Do these metrics reveal anything about software quality?

Many commits a good project does not (necessarily) make. Research in evaluating software quality is a longstanding and open problem [1, 2, 3, 4]. Metrics such as lines of code and commit history may impact qualitative attributes such as software maintainability [5], but do not speak directly to quality attributes. Quantitative measures (such as those on this site) also miss language-specific attributes (e.g., one line of Haskell may be more expressive than one line of Javascript). However, research shows that quantiative metrics can be useful as covariates for software quality predictors [6]. At the least, our metrics are a good measure of active developer activity for budding open source cryptocurrency developments.

Can't these code metrics be artifically influenced?

We use independent metrics to give a high level view of a project's activity. This helps to avoid data skew and makes it harder to influence the numbers artificially. For example, a single commit could be broken up into 10 smaller commits, but the number of lines would stay the same. Taking it up a level, using a variety of metrics (including number of developers and stars) gives greater confidence that the numbers reflect organic activity.

Some crypto currencies, like Bitcoin Cash, don't have a reference implementation. What do you do about that?

It is challenging to collect distributed projects that implement a protocol. For example, Bitcoin Cash has at least four (1, 2, 3, 4) related client implementations. It also prompts challenges to include projects relating to a protocol, but not strictly part of the reference organization (e.g., should we include the MetaMask project for Ethereum?) Due to these challenges, we (currently) only include projects with a dedicated reference organization or user on GitHub, and are investigating improvements.