Prof. Joseph Harrington
Planetary Sciences Group
Department of Physics
University of Central Florida
Orlando, Florida 32816-2385
USA
Version 0.3
8 March 2016
(C) Copyright 2016 Joseph Harrington
The goal of this license is to restore reproducibility and synergy to
science by enforcing the publication of the computer codes behind
published scientific work.
Preamble
“If it’s not reproducible, it’s not science.”
With the advent of computers, calculations such as data analyses and
models have become so complex that written descriptions in research
articles cannot adequately describe them. There are too many
calculations in even an average code to describe in a normal-length
journal article. Most of the computational decisions to be documented
would interest just a few of the hundreds or thousands of readers.
Researchers describe what is important to them, omitting minor but
crucial details requires to reproduce the calculation. It is likely
that the paper would describe what the researcher thought was in the
code, rather than what was really there. That is, there are bugs in
any large code that may alter the calculation from what was intended.
The ultimate documentation is the source code itself, though this is
no excuse for not writing good documentation, both as comments in the
code and in separate documents.
There are now numerous calculations that, when others attempted to
implement them from the descriptions in papers, turned out to be
irreproducible, despite being from respected researchers writing in
top, peer-reviewed journals. Often, one cannot attempt to reproduce a
calculation simply because it would take years to write a similar code
from scratch.
Further, the progress of science has been slowed dramatically because
of the time required to re-implement complex calculations already
coded by earlier researchers. In science, we stand on the shoulders
of giants. In the past, if a researcher invented a model for a
physical situation that was instantiated in a sequence of equations,
those equations had to be published for the researcher to get credit
for their discovery. Then, anyone could use them.
This is no longer true if the discovery is instantiated in code: the
model cannot be evaluated by peers, nor can anyone build upon it
without spending months or even many years writing a similar code.
Thus, the current practice of not publishing codes allows researchers
to get credit for their discoveries while simultaneously preventing
serious peer review of methods and use of the discoveries by potential
competitors.
In an ideal world, journals and the peer-review system would require
the disclosure of codes used in research, as they have traditionally
required the disclosure of the equations in a model. Until this comes
to pass, the research community can enforce the publication of some
codes through licensing.
One might ask, why not use an existing free/open-source software
license? License proliferation is a big problem already. We have
sought such a solution, but existing licenses ignore the use case of
science. Those licenses all refer to the propagation of code. The
GNU licenses, for example, enforce the propagation of source code when
object code is propagated, or when a server containing the software is
offered online as a service. In the case of science, we are dealing
with the case where there is no propagation. For example, consider a
scientist who downloads a code, modifies it in a creative way, and
publishes a series of results, receiving great credit. This
researcher has now built a moat: others cannot see the code to check
its correctness, nor can they carry out calculations for themselves
that benefit from the researcher’s value-added work. The giant stands
on the shoulders of another giant, but the top giant’s shoulders are
now covered in spikes!
There is one existing license that does attempt to force this example
researcher to share: the Community Research and Academic Programming
License (http://matt.might.net/articles/crapl/). This rather
humorously written license served as an inspiration to the current
license, some of its definitions are included, and we encourage others
to study it. However, some of its terms do not match our needs. We
feel that it applies much better to casually written research codes
and less well to codes designed from the outset to be community
efforts (i.e., true open-source codes). The present license is
designed for both kinds of codes.
Licensing is at best an unfriendly solution, as it ties the user’s
hands legally. Some choose not to use codes with restrictive
licenses, and some licenses have conflicting terms, preventing the
fruitful combination and remixing of codes licensed under each. It is
my hope that, someday, publishing codes will become our common
expectation, enforced socially and required by journals. Then,
licenses like this one will be unnecessary, and we can return to
standard licenses that do not mention how a code or its output are
used.
In the mean time, the terms below are designed to ensure that all may
both examine and reap the maximum benefit from the codes that produce
research results. We recognize that there will be legitimate use
cases where this license will be a hindrance. We remind the reader
that the owners of the code’s rights may always make independent
agreements on a case-by-case basis that allow use under different
licensing arrangements from those offered to the general public.
Please contact a code’s owners to discuss such arrangements.
Finally, this license is a work in progress, and future versions with
updated terms will no doubt be released. If you wish to discuss this
license or the topic of reproducible research, please join the
discussion group by visiting:
https://physics.ucf.edu/mailman/listinfo/reproducible
–jh–
Prof. Joseph Harrington
University of Central Florida
Orlando, Florida
USA
I. DEFINITIONS
A. Location Identifier
A small data item in a common, standard format that enables
appropriate software to retrieve information from computer networks.
B. Permanent
Intended and reasonably expected to remain available and unchanged
indefinitely.
C. Easily Discoverable Public Archive (EDPA)
A service that provides permanent storage and free retrieval of data.
The service must issue Permanent Location Identifiers to the data sets
it stores. It must be broadly available on popular public computer
networks. It must make available summary information about the data
sets it stores that popular search services use to help people
identify and locate data sets relevant to their interests.
D. Online Revision Control System (ORCS)
A service that provides permanent storage and free retrieval of
software, including tracking of changes, attribution of changes to
their authors, and labeling and retrieval of prior versions. The
service must issue Permanent Location Identifiers to the software it
stores. It must be broadly available on popular public computer
networks.
E. Program
The main set(s) of computer instructions to which this license
applies, and all source code, executable code, build files, scripts,
configuration files, data, documentation, licenses, and other files
associated with those instructions and normally distributed with them.
The Program does not include data supplied originally by the user or
third parties, even if such data are required for the program to
operate.
F. Unmodified Program
Any Program version that has appeared in an EDPA or an ORCS.
G. Modified Program
Any Program version that does not meet the criteria for an Unmodified
Program.
H. Third-Party Data
Information not included in the Program and not originating with the
user that is required by the Program.
I. Scientific Claim
Any statement purported to be supported by evidence.
J. Peer
One who has made similar Scientific Claims to those being made by an
individual.
K. Peer-Reviewed Literature
The set of publication media normally used for making Scientific
Claims for the first time, in which Peers evaluate and comment on the
quality of new claims as a precondition for those claims appearing in
that medium.
L. Published Scientific Description
One or more items in the Peer-Reviewed Literature describing any
portion of the Program and/or justifying any of its algorithms.
M. Top Level
The folder or directory of the Program containing all the Program
files, when listed recursively.
N. You
Anyone who does any of the following with the Program: uses it,
modifies it, copies it, downloads it, uploads it, shares it, assigns
employees or advisees to work with it, publishes it.
O. Reproducible-Research Compendium (RRC)
A data collection under one Top Level directory or folder containing:
1. The exact version of the Program (including modifications) used to
support a Scientific Claim, or each version if more than one was used.
It must be made clear which version applies to which claim. For any
version that is an Unmodified Program, an easily navigated reference
to the item in an EDPA or an ORCS may be substituted for the actual
Program. The Program inclusion or referenced archive must contain the
full source code, documentation, and all other relevant files.
Removal, obscuration, or obfuscation of the source code,
documentation, or other files is not permitted.
2. The input data and all configurations (whether stored in files,
command-line arguments, settings made through a graphical user
interface, or otherwise) used to produce the program behavior and
output offered in support of a Scientific Claim. If random processes
are used by the Program and it is possible to determine and set the
random number seed(s) used in support of a Scientific Claim, all such
seeds must be reported as well. Otherwise, the supplied configuration
must produce results that are statistically convergent to the Program
run used to produce the support or data. Standard, publicly available
Third-Party Data may be omitted if the RRC provides a clear reference,
easily followed by Peers without cost, to a permanent and unchanging
archive of the exact version of the Third-Party Data used to support a
claim. If You used proprietary Third-Party Data for which a
good-faith effort failed to obtain permission to place it in an RRC,
You must identify those data and their source(s), and include
information by which others can obtain them.
3. All the output produced by the Program in the Program runs used to
support a Scientific Claim, including status and error messages, in
their original forms.
4. Any spreadsheets, codes, or other calculations implemented by You
and used to process the data from the Program into display or tabular
form used to support a Scientific Claim, and any processed forms of
the program output produced by the same, whether published or not.
The intention of this requirement is that aggregation, statistical
analysis, and plotting codes written or configured by investigators be
included in the RRC. This requirement explicitly does NOT apply to
codes that take Program output as input to further calculations of the
general type under study, PROVIDED that: a. The data passing from the
Program to such codes is archived in human- and machine-readable
format in the RRC, and that b. You identify the code(s) and their
source(s), describe how they used the data, and include information by
which others can obtain the code(s).
5. Machine- and human-readable form of information deriving from the
Program and presented in the peer-reviewed publication as evidence
supporting a Scientific Claim, including without limitation values on
or in plots, graphs, videos, other data visualizations, tables,
auditory tones, or any other forms presented, even if they are from
code(s) falling under the exception in the prior paragraph.
6. A clear set of instructions for how to run the Program with the
supplied and referenced settings and data to support each Scientific
Claim covered by the RRC.
II. GENERAL REQUIREMENTS
A. Any publication of the Program, a Modified version of the Program,
information produced by the Program, or Scientific Claims supported
by the Program must acknowledge the Author(s) and, if appropriate
to the medium of publication, must reference the Published
Scientific Description of the Program in the usual manner for that
medium. If instructions for doing so are included in a file in the
Top Level of the Program, they must be followed and may not be
altered.
B. If You Modify the program, You must clearly identify Yourself,
including postal and electronic contact information, and You must
briefly summarize Your modifications to the Program, in a Top-Level
file recording changes. If You fail to do so, You forfeit all
rights with respect to Your Modifications.
C. This license, as applied to the Program, may not be altered.
Modified versions of the Program must also use this license, except
as noted below, and must include this license file as it is
included in the Unmodified Program.
D. If any Modifications include pre-existing, modular software
licensed under another license and for which a good-faith effort
does not obtain permission to include that software under this
license, then that portion of the Modified Program may be excluded
from this license. If the terms of its license allow its free
redistribution, it must be included in RRCs produced for this
Program. If the terms of its license do not allow free
redistribution, then You must identify that software and its
source, and include information by which others can obtain it.
E. You must not make false, misleading, or exaggerated statements
regarding the Program, its authors and contributors (including any
contact information), Your or others’ Modifications to the Program,
or the credit due to You or others for the Program and its
Modifications.
III. PERMITTED USES AND THEIR RESTRICTIONS
Subject to the General Requirements above:
A. You are permitted to use, modify, distribute, and make derivative
works from the Published Program or any portion of it for any
purpose unrelated to producing or validating Scientific Claims.
B. You are permitted to use and/or modify the Published Program to
validate Scientific Claims.
C. If You publish, in the peer-reviewed literature, any Scientific
Claims or data that were supported or generated, in whole or in
part, by the Program, Modified or not, You must publish an RRC
documenting those data or the support for those claims.
D. If the version of the Program You received includes Unpublished
Modifications made by others whose Unpublished Scientific Claims
You are evaluating, You are permitted to use the Program as
Modified, and to make further Modifications, to validate those
Claims, under the condition that You keep all Modifications
including Yours confidential until those Claims have been Published
in the Peer-Reviewed Literature or the Modifications You received
have been Published. You may not use the Modified Program for
other purposes without written permission from the authors of any
Unpublished Modifications. Once all Modifications have been
Published, this paragraph no longer applies. If You can apply only
Your Modifications to any Published version of the Program, this
restriction does not apply to that Modified Program.
E. If the version of the Program You received includes Unpublished
Modifications, You may only publish Claims or Data in the
Peer-Reviewed Literature supported by or derived from that version
of the Program if You legally may, and do, publish those
Modifications as specified in III.C. above.
F. If You, or a service that You own or operate, make results from the
program available to other users, You must first obtain the
agreement of those other users to be party to this license. You
must tell them where You received the Program and what version You
received. You must describe to them any Modifications You made to
the Program. If there are Modifications, You must advise them
before You give them the results that they may not publish the
results You have given them in the Peer-Reviewed Literature until
those Modifications have been published.
IV. Disclaimer of Warranty
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING, THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM “AS IS”, WITHOUT
WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND
PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE
DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR
CORRECTION.
V. Limitation of Liability
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR
CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES
ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING, BUT
NOT LIMITED TO, LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR
LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM
TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER
PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.