Open Science
I have worked on a variety of open-source and open-science efforts since my days as an undergraduate at MIT, when Richard Stallman founded the Free Software Foundation and wrote the GNU General Public License. I opened patentable code in my PhD dissertation (1995, a non-interactive image stitcher called Jiggle), and I participated in the development of the first versions of Red Hat Linux as a bug hunter and reporter, contributing sufficiently to be invited to participate in their IPO, which was (perhaps) the first significant return on the massive volunteer effort of free and open-source software developers and testers.
I installed one of the first Linux workstations at NASA’s Goddard Space Flight Center in 1996, demonstrating to the researchers in my building that old hardware could get new life with Linux, and that scientific workstations could be had for much less than commercial Unix workstations. I built desktops and Beowulf supercomputers running Linux as a consultant in the late 1990s and early 2000s.
At this point, I began to look at the feasibility of moving all of my research to open-source tools, and discovered that we lacked an open-source competitor to the Interactive Data Language and Matlab. NOAO’s Image Reduction and Analysis Facility was open, but dated. So, I convened, with Paul Barrett, a panel at one of the very early Astronomical Data Analysis Software and Systems meetings, with a representative from each of the free and commercial platforms, and Python emerged as the likely future platform. However, what was then called Numeric lacked certain necessary features for astronomers and there was a dispute that led to a schism. Numarray was written for astronomers, statisticians stuck with Numeric, and the field stagnated.
Travis Oliphant reunited the two worlds, creating SciPy and abstracting its core routines into a package for priority development, NumPy. In Fall 2007, I taught AST 4762 Astronomical Data Analysis and AST 5765, its advanced graduate version. I decided not to pursue funding for an IDL license and instead taught the class in Python, using NumPy and SciPy. It was a disaster! For the 2,000+ routines in NumPy there were just 8,000 words of documentation – an average of four words per function. Students were lost.
So, in summer 2008, I founded the SciPy Documentation Project. I convened a team of developers, who wrote an interface that would let contributors write docs, comment on them, put them through a peer-review system I had defined, enter them directly into the SciPy and NumPy source code, and track progress. We saw what routines were in what status (untouched, being written, ready for review, reviewed, etc.) and who had written how much each week. I defined the standard documentation template, still in use today, that requires code authors to define every input and every output, state their defaults, explain the methodology, and even provide references to literature.
I hired a documentation editor whose job it was not just to write docs but to convene weekly meeting of contributors, keep the mailing list active so that people would join the effort and write docs, and perform the integration of docs into the code. I wrote docs and met with the team weekly once things were going, presented status reports at the annual SciPy Conference, and wrote proceedings articles describing progress.
By the Fall of 2008, we had documented the most frequently used routines and my class ran without a hitch. By the time I handed the project off in 2011, over 75 PhD-level volunteers from around the world had written over 100,000 words, and all the external routines (those used by normal users) were well documented. These docs automatically formatted as help pages in NumPy and SciPy, as PDF manuals, and as web pages, all from the same sources within the code of each routine. Ultimately, over 150 authors contributed. The only compensation received by anyone other than the two editors was a T-shirt reading “SciPy Documentation Marathon 2008: I WTFM”, which we gave to anyone who wrote 1000 words or more. Thus, we crowd-sourced the NumPy and SciPy documentation before that term was in use. The two documentation editors and I are co-authors on the seminal Virtanen et al. (2020) paper on SciPy, which has garnered over 4,000 citations in two years.
I have taught my NumPy/SciPy/Matplotlib-based Astronomical Data Analysis classes nearly every year since 2007, save two. However, many students lacked the basic programming knowledge required. So, I developed PHZ 3150 Introduction to Numerical Computing, a first-exposure-to-programming class taught entirely in Python, with a substantial Linux command-line component. UCF is the third-largest university in the US, with over 70,000 students. As a Hispanic-serving institution, over 25% of students are Hispanic and nearly half are non-White. There is a strong value in the State University System and especially at UCF to providing educational access through the second-lowest tuition in the nation (after Wyoming, which is smaller than Orlando) and low course costs. This is a zero-cost course, using only open textbooks and open-source software. Several faculty are now trained or in training to teach the course, which is in high demand and is being integrated into the core courses of the physics BS degree.
I am an astronomer known for the early characterization of exoplanet atmospheres. I was on one of the two teams that detected the first light from an exoplanet and I first separately detected day and night on one. In 2013, I led a team in writing a retrieval model, a code that infers the chemicals and thermal structure in an atmosphere from measurements of its spectrum. Even though we had open-source languages, I had been unhappy with the small number of open-source data analysis tools. I was also confounded by my inability to reproduce work I found in papers. The models were too complex to reimplement, and even the settings were too many to describe in a paper. Yet, if we wrote our code in the open, surely some enterprising researcher would poach it before we had written our user manual and would publish with it ahead of us. They could even write a description paper and claim citations!
So, I wrote the Reproducible Research Software License, which is a permissive open-source license wrapped in two conditions: you cannot publish a paper using the code until the authors have. When you do publish in a reviewed journal, you must publish everything needed to run the code, any modifications you made, and the outputs, all in machine-readable form. It was first applied to a component code, Thermochemical Equilibrium Abundances, in 2015. Journals have enforced the license, requiring authors to publish a reproducible-research compendium with all the code settings, data, the code itself, and outputs. It also applies to our Bayesian Atmospheric Radiative Transfer (BART), the retrieval code (https://github.com/ExOSPORTS/BART/).
I was aware from the beginning that this would be a controversial step, and I knew that coercion is not the best way to change practice. So, while the RRSL provides instructions for applying it to other codes, I do not expect many will. Rather, the point, carried out through my advocacy at conferences. was to start the reproducible-research conversation in the exoplanet community, and to some degree in the wider planetary science and astrophysics communities. We now see some researchers voluntarily making RR compendia for non-licensed codes, and some reviewers now requesting them for analyses that are particularly challenging to describe in a paper. A few authors are starting to sell grant proposals with the promise that results will be RR, and, even before NASA began to require it in some programs (see below), some used the promise of open-sourcing software developed under a grant to sell the proposal.
Similarly, when I applied to the Spitzer Space Telescope to run a target-of-opportunity program for newly discovered exoplanets, I knew my competition would put in a similar proposal. Since he was at a much better-known institution, I proposed to waive my one-year proprietary period, and I won the program. Short or no proprietary periods are now becoming common. This makes science inherently more open.
In 2017, NASA and NSF began thinking about moving toward open source and open science, under mandate from Congress and from some in the community. They contracted a National Academies study to provide open-source options for grant programs.
I was on the panel, which discussed a wide range of topics. Would open-sourcing the software of community centers of excellence in Earth modeling destroy the resident communities that made all the advances? Fields with younger practitioners were much more eager for open-source software than established fields with older people. How do we overcome inertia among the older population? Is it safe to take expert-level software, not crafted by engineers but cobbled together by scientists, and put it in the hands of anyone with a CPU but no expertise? Would reviewers and readers be able to tell the difference between an expertly used code and something used crudely? What about vital community codes with so many contributions over the years that copyright is unknown for much of them, making them unshareable? How about proprietary codes of consultants used in NASA projects? Should we continue to fund software work at NASA centers, which typically make OSS publication very hard?
The 2018 report, Open-Source Software Policy Options for NASA Earth and Space Sciences, discusses multiple paths, many of which have been implemented. We put two OSS in-jokes on the cover. Can you find them?
Isaac Newton’s famous quote about standing on the shoulders of giants only works if we all share. We all stand on giants’ shoulders; it’s the only reason science has ever worked. It is selfish to put spikes on our own shoulders to prevent others from standing on them, and it is unethical to do it with public money.
Paper
Authors | Title | Published Date | Journal |
---|---|---|---|
Virtanen, Pauli; Gommers, Ralf; Oliphant, Travis E.; and 32 coauthors | SciPy 1.0: fundamental algorithms for scientific computing in Python | Feb 2020 | Nature Methods 17, 261 |