Strategic Open Sourcing: Observations on Open Source Development of Machine Learning Toolkits on GitHub

Abstract

This study represents the first part of a multi-stage research project investigating the consequences of releasing important intellectual property as open source software. In this stage, we constructed an extensive data set and conducted an initial round of social network analysis on six of the most prominent machine learning (ML) projects, all of which are hosted in the GitHub software ecosystem. We expect our results to inform subsequent stages of the project, which will focus on understanding the relationships between these projects and their corporate sponsors. More generally, the process used to conduct these analyses should be of interest to scholars looking to extract insights from GitHub and other software development platforms.

Theoretical Background

The study of open source software communities has provided insight on the motivation of individual contributors (Lakhani and Wolf, 2005; Shah, 2006; Von Krogh et al., 2012) and the process of community development (Bagozzi and Dholakia, 2006; Bonaccorsi and Rossi, 2003). Scholars have established that firms practice “selective revealing” and that doing so infrequently results in adverse consequences (Henkel, 2006; Henkel et al., 2014).

In recent years, this practice has become both widespread and increasingly consequential, as some of the largest technology firms released some of their most strategically important projects to the wider world. In this study we examine the consequences of such behavior, by performing social network analysis on six of the most prominent machine learning platforms on the GitHub open source repository. Three of these platforms were initially developed by private corporations, one by an academic institution, and two have remained in the open source domain since their inception. Machine learning, also known as artificial intelligence, is widely regarded as an important arena for computer science and the technology firms looking to use it for commercial means. The decision for these firms to release such important intellectual property to GitHub, and to look in part to the community for its continued development, is a significant strategic decision.

Empirical Setting

In the first phase of this project, we performed social network analysis on six machine learning projects using source code repository data from GitHub. GitHub is a website that allows software developers to host and share their own coding projects as well as contribute to the projects of others. This system of sharing and contributing code creates a social network of project owners, project contributors, and followers. As a result, many large open source projects have risen through or had primary development moved to the GitHub platform, such as NASA’s Mission Control technology, the Bootstrap web development framework, and programming languages such as Node.js and Ruby on Rails. These projects can have dozens to hundreds of independent contributors.

Although GitHub is popular among independent hobbyists and developers who contribute to community-driven projects, corporate giants Google and Microsoft respectively chose GitHub to host two of the leading machine learning projects, TensorFlow and CNTK, within the past year. For this study, we analyzed the relationships among contributors to these projects and four other widely used machine learning toolkits, with the aim of better understanding the development communities surrounding them.

We focused on six focal projects, which include the aforementioned TensorFlow and CNTK, as well as Theano, Torch7, Caffe, and Deeplearning4j. These span a range of development periods and corporate sponsors. TensorFlow is Google’s ML toolkit, which they brought to GitHub in November 2015. CNTK was released on an open source academic development platform in April 2014, but truly came unto its own in the international developer community when Microsoft moved it to GitHub in January 2016. The others are smaller, independent projects that have been hosted on GitHub for years longer.

Data Extraction and Analysis

The process of data extraction, analysis and the ultimate creation of network and ego graphs required the team to develop novel methods of “big data” computation.

We constructed a dataset of the communities associated with our focal projects, using the GitHub API to pull data about projects and their contributors. This dataset includes information about all of the contributors to the six machine learning projects we are studying (a total of 675 individuals), as well their contributions to other projects on GitHub (almost 23,000 different projects). To create this dataset, we wrote a Python script to retrieve all of the contributors to the six focal projects, thus creating a two-mode affiliation network of projects and contributors. We then identified the full set of projects to which these individuals contributed, and retrieved information about each of these projects. This required many tens of thousands of calls to the GitHub application programming interface (API), which was a challenge, since calling the API is resource-intensive for computers, and GitHub limits the number of API calls a single user can make per hour. To facilitate this process, we created a Python script that could switch to another user’s API key when one user had run out of calls. Using multiple keys on a high-performance server with 96 cores and 1 terabyte of RAM, constructing our dataset took several days.

Using Mathematica, we then derived six additional datasets corresponding to the ego networks of the six focal projects. To generate the ego networks, we deleted all project nodes that did not directly link with the “ego” project, as well as the edges connected to those nodes. Finally, we computed a variety of social network metrics on each ego network, including the number of edges and vertices, degree distributions, clustering coefficients, and graph assortativity. (This work is ongoing, and expect to be able to present a full set of results at the conference.)

Next Steps

The companies that shared their machine learning platforms on GitHub were able to tap into a wider network of volunteers to advance and promote their work. In the next phase of this project, we will examine these social networks to explore how characteristics of community and collaboration are related to both the growth and performance of these platforms. Ultimately, our aim is to clarify the consequences and potential benefits of strategic open-sourcing.

Authors

Jared Briskman (Olin College)
Serena Chen (Olin College)
Anne Ku (Olin College)
Ian Paul (Olin College)
Jonathan Sims (Babson College)
Jason Woodard (Olin College)

Topic Area

Communities: User Innovation and Open Source

Session

TATr2B » Communities: User Innovation & Open Source (Papers & Posters) (15:45 - Tuesday, 2nd August, Room 112, Aldrich Hall)

Paper

OUI_2016_Submission.pdf

Presentation Files

The presenter has not uploaded any presentation files.

Email Support • Blog • Privacy Policy • Cancellation Policy