First Monday


The Orbiten Free Software Survey

Free software is supposedly developed by a loosely organised "community" of programmers. However it has been quite unknown until now who, except for some well-known celebrities, belongs to this community, and more importantly how contributions are distributed. The authors present a first survey of free software authorship, with the emphasis not on building a census or even a "hall of fame", but on identifying patterns of concentration and distribution of contribution. The sample code size is not necessarily representative and there are several errors due to the automated and vast nature of the task of identifying and crediting authors. Nevertheless, comprehensive data is collated for the first time, and can be scrutinised in detail on the survey Web site.

Contents

Introduction
Findings
Data
Scope and Method

Introduction

The free software (or open source) "community" is much talked about, though little hard data on this community and its activities is available. Here, for the first time, Orbiten Research provides a body of empirical data and analysis to explain and describe this community.

Simple facts, such as the number of developers contributing to free software projects, the number of such projects, and their size, have been until now unknown. The Orbiten Free Software Survey answers some of these questions, and aims to provide a foundation for empirical research on the free software community.

Building on the release of CODD over a year ago, the Survey measures and tracks over time several aspects of the free software economy. Factors analyzed include the concentration (or diversity) ofcontributions and contributors; the degree of intersection between projectsand sharing of code; the participation of developers in different projects; and, the volatility of changes to the code base and the developer base.

Important basic statistics and data were gained during the survey process, such as total size of free software available, amount of free software being released and/or modified each month, and a compendium of developers.

The survey over time will become more comprehensive, providing an important source of information for academic researchers, free software users, and developers.

Findings

The primary findings of OFSS01 were basic: the number of developers authoring projects included in the survey (12,706), the size of the free software code base (1.04 Gigabytes, or roughly 25 mil lines), the number of identifiable free software projects (3,149). Given the total lack of data on the free software economy, rough indicators as to its size (limited by the initial scope of the survey) are, we believe, a good start.

Secondary findings relate to the degree of contribution to the code base by individual authors, defined for the purposes of this survey as the smallest identifiable grouping claiming credit for development of a software project. Unsurprisingly, the Free Software Foundation came out well ahead of anyone else by far, credited with 11% (124 Mb) of the entire surveyed code base and involved in 17% (546) of all identifiable projects. However, as with some other well-known (and highly ranked in the survey) Unix authors, such as Sun Microsystems and the Regents of the University of California, the FSF's position in our charts stems largely from the lack of credit given to individual programmers. A list of the top few contributors sorted by code and involvement in projects is given below (see Data).

Further findings relate to the distribution of authors among projects, and code base contribution. The top 1,271 authors, 10% of the total, accounted for 72.3% of the total code base. The top 10 authors alone (0.08% of the total) are credited for 19.8% of the code base. Free software development may be distributed, but it is most certainly very top heavy.

What goes for lines of code written goes for involvement in projects too. Only the top 25 authors (0.19% of the total) were credited with participation in more than 25 projects. The top 250 authors were credited with participation in over five projects, and the vast majority (over 77%) of authors were only involved in a single project. Our conclusion: Free software development is less a bazaar of several developers involved in several projects, more a collation of projects developed single-mindedly by a large number of authors.

Data

Number of identifiable authors 12706
Uncredited/unidentifiable authors 790
% of code base uncredited 8.37%
Size of code base +1116500467 Bytes or 1067 Mb.
Number of identifiable projects 3149

Table 1: Top 10 authors ranked by contribution of code
Author % of total
free software foundation 11.231
sun microsystems 1.848
regents of the university of california 1.359
gordon matzigkeit 1.216
paul houle 1.042
thomas g. lane 0.782
massachusetts institute of technology0.762
ulrich drepper 0.559
lyle johnson 0.528
peter miller 0.525
more...


Table 2: Author contribution by decile
Authors% of total
top 10 authors 19.854
top decile (1271) 72.320
2nd decile 8.928
3rd decile 4.062
4th decile 2.384
5th decile 1.515
6th decile 1.008
7th decile 0.672
8th decile 0.440
9th decile 0.239
10th decile 0.060





Table 3: Top 10 authors ranked by participation in projects
Author Projects
free software foundation 546
gordon matzigkeit 267
regents of the university of california156
ulrich drepper 142
roland mcgrath 99
sun microsystems 66
rsa data security 59
martijn pieterse 50
eric young 48
login-vern 47
more...


Table 4: Author participation in projects
Projects Authors
> 25 25
6 - 24 211
3 - 5 928
Only 2 1924
Only 1 9617


Note: 211 authors participated in 6 to 24 projects, etc.



Scope and Method

The first Orbiten Free Software Survey has been prepared based on over 18 months of work in identifying, tracking, and modeling interaction in the free software economy. Clearly this was not enough time, and the scope and methodology of the first survey is far from ideal.

The technical task of identifying credits in poorly documented source code was complex, especially given the vast and changing nature of the code base. Credits are often not available, they rarely follow a set format, and various heuristics have been applied and "policy" decisions made on, for example, how to divide credit among multiple listed authors. Details can be found in the documentation for CODD.

The code base itself was limited. Although far from being a complete set of all code ever released without payment on the Internet - our ideal, eventual goal - we believe we have used a fairly representative sample of software projects (released under the GNU Public Licence and its variants) developed in recent years.

The source code base for OFSS01 is:

For each module or package analysed, source code is broken into projects identified according to the package distribution. Source code and some documentation files are scanned for authorship, credit or copyright information, from which author names are identified. Data collected includes, for each identified author, number of bytes of code authored, number and names of projects authored. From this the degree of contribution, in terms of bytes of code can be calculated for any given project. Project data is collated to can be examined at several levels.

In this survey, very basic analysis has been performed. The next survey will broaden the scope of analysis to include features such as the degree of cross-participation between projects and groups of authors.

The next survey - planned for June, 2000 - will also use a bigger code base. At the very least the code base will expand to include Sourceforge [ http://sourceforge.net], OpenBSD [ http://openbsd.org] and Perl CPAN libraries[http://cpan.org].

As the survey continues and becomes more frequent, we plan to track changes in the code base over time (including historical perspectives using older versions of, say, the Linux kernel) and monitor movement between projects and groups.

About the Authors

Rishab Aiyer Ghosh Rishab is Co-Programme Leader of the e-Basics Research Unit at the International Institute of Infonomics, a venture of Maastricht University supported by the European Commission. He has been programme leader at the Institute since its founding in January 2000, though he is currently still working out of New Delhi, India.

In 1994 Rishab developed the "Cooking-Pot Market" model of Internet Economics, a system of non-formal, transaction-less barter. He was invited by the European Commission to speak on information economics at the first Information Society Conference under their IST programme in 1998. Since January 1999, as an advisor to the EC Brussels headquarters for the Universal Information Ecosystems/Future Emerging Technologies programme, he has provided inputs on their policy for promoting research in European companies.
E-mail: rishab@dxm.org

Vipul Ved Prakash is a freelance hacker and network consultant. He has several years of experience developing and deploying mission-critical networked applications on the Unix platform. Vipul developed remote administration software for Silicon Graphics' failsafe servers, a high-availability Internet server solution. In 1996, Vipul developed Sense/NET, a proxy sockets meta-ISP, to provide non-censorable TCP/IP access over unclean telnet links. Presently, Vipul is heading the development of an e-commerce engine at sixcones.com, an online hypermall startup funded by Vinod Khosla.

Vipul is a network security expert and cryptography enthusiast. He has implemented various cryptographic primitives and algorithms as open-source perl software. He publishes and maintains munitions, a comprehensive archive of cryptography software for the linux operating system, which is redundantly distributed over more than 10 servers around the world.

In 1998, Vipul developed ricochet, an automated agent for tracking and reporting email spam. Ricochet is used by thousands of people and organizations to combat spam throughout the net. Working towards a proactive solution against spam, Vipul has created Razor, a distributed, collaborative, spam detection and filtering network designed to exploit the broadcast characteristic of spam distribution to throttle its propagation. Razor will be released in May 2000.

Apart from fighting spam, Vipul is interested in more general issues of network ostracism, collaborative filtering, and content rating. He is fascinated by emergent value systems and self-organizing dynamics of networked communities.
E-mail: mail@vipul.net

References

1. Orbiten. CODD documentation.

2. Rishab Aiyer Ghosh, 1998. "Cooking Pot Markets: An Economic Model for the Trade in Free Goods and Services on the Internet," First Monday, volume 3, number 3 (March), at http://www.firstmonday.org/issues/issue3_3/ghosh/index.html

3. Rishab Aiyer Ghosh. " Identifying, tracking and measuring activity in cooking-pot networks."


Editorial history

Paper received 1 May 2000; accepted 10 May 2000.


Contents Index

Copyright ©2000, First Monday

The Orbiten Free Software Survey by Rishab Aiyer Ghosh and Vipul Ved Prakash
First Monday, volume 5, number 7 (July 2000),
URL: http://firstmonday.org/issues/issue5_7/ghosh/index.html