Repositories


In which repositories should research software be stored?

First, the purpose of “storing” should be clarified:

  • Is it about permanent “archiving” to document the state of the software at a specific point in time (e.g., the publication of a paper)? Then research data repositories such as Edmond or Zenodo come into consideration, or also the Software Heritage service.
  • Or is it about storing the source code of the software in a suitable software repository for the purpose of version control and further development? Then specific software repositories are suitable, including github.com. Both variants can be combined so that selected snapshots (“Tags”, “Releases”) are automatically archived on Zenodo or Edmond and made referenceable under a Persistent Identifier (DOI) (see also the FAQs on PIDs and citations).

Then, a choice must be made between a globally open but commercially operated service or one that is only open to a certain group of users. Both have advantages and disadvantages. Globally open services usually have a larger user base and thus facilitate commenting, expanding, and maintaining the code in this community. On the other hand, the commercial interest of the service providers must be considered, who will benefit from the published content and, in the case of laxer data protection policies, will benefit from it. In contrast, institutionally hosted services offer a greater degree of data protection and protection of copyrights, but are only open to a limited circle of users, making it more difficult to disseminate and involve colleagues. Often, it can be sensible to start development on an institutionally hosted platform to reduce the risk of premature dissemination before publication. After successful publication, the code can then be moved to a commercial platform to take advantage of the increased visibility and community involvement. The common platforms like github.com support importing existing repositories via a simple input mask.

Software repositories

1. Gitlab

Members of the MPG have access to the Gitlab instance operated by the GWDG, and users of the MPCDF also have access to https://gitlab.mpcdf.mpg.de. Instances operated by individual institutes are usually only open to the respective institute members.

GWDG Gitlab
  • Access: https://gitlab.gwdg.de
  • Access: Via the AcademicCloud, i.e., logging in with the uniform username and password used for all GWDG services (e.g., also owncloud, overleaf, matrix chat, video conferencing big blue button, etc.). In most cases, the username is identical to the primary email address.
  • Version control: Gitlab uses the GIT version control system. Users can synchronize their local repositories with Gitlab and edit the source code together with colleagues. In addition, contents stored in Gitlab can also be edited completely online.
  • Data types: Primarily intended for software (source code), but supports all text-based file formats, such as LaTeX manuscripts, Markdown texts, or tables in CSV format. Less suitable for binary data like compiled software, PDF files, image files, Excel, etc.
  • Further information under https://docs.gwdg.de/doku.php?id=de:services:email_collaboration:gitlab:start
MPCDF Gitlab
  • Website: https://gitlab.mpcdf.mpg.de
  • Operator: Max Planck Computing and Data Facility MPCDF
  • Access: Initially, the use of MPCDF services must be requested, usually via the IT service unit of the respective institute. Gitlab can then be activated in the self-service area.
gitlab.com
  • Website gitlab.com
  • Operator: Gitlab Inc.
  • Access: Globally open. Login with Google, GitHub, and other accounts via OAuth is possible.

2. Github

Under “GitHub” is usually understood the internet service https://github.com. However, like Gitlab, GitHub can also be hosted by a single institution.

Github of the MPI for Molecular Genetics
github.com
  • Website: https://github.com
  • Operator: github inc, a company of the Microsoft group
  • Access: Globally open. Login with Google and other accounts via OAuth is possible.

3. Software Heritage

The web service https://www.softwareheritage.org is not a repository like GitHub or Gitlab, where software can be actively developed and maintained, but rather understands itself as an archive. As such, it automatically searches the common services like github.com, gitlab.com, and archives the software projects found there as snapshots. Software projects that are not indexed on platforms like institutionally hosted GitLab instances can be manually archived under https://archive.softwareheritage.org/save/. A PID is created, a Software Heritage Intrinsic Identifier (SWHID). This PID can then be specified in a scientific publication to reference the software. The advantage of SWHIDs is that even fragments of the source code can be referenced. More about SWHIDs under https://www.softwareheritage.org/save-and-reference-research-software/.

Generic repositories

  1. Edmond
  1. ZENODO
  • Website: https://zenodo.org
  • Operator: European Organization for Nuclear Research (CERN, main location is Switzerland)
  • Data types: Datasets, presentations, manuscripts, video or audio material, software
  • Limitation: 50GB per upload, unlimited number of uploads (as of 11/2023)
  • Registration: Email address or federated via ORCID (https://orcid.org), GitHub, OpenAIRE
  • PID: Each upload receives a DOI. Versioned DOIs for each update, special DOI always links to the current version
  • Collections: So-called “Communities” can be set up to collect datasets, e.g., a research collaboration, a network, or an institute.
  • API: REST-API via https://zenodo.org/api, see documentation under https://developers.zenodo.org/.
  • GitHub integration: A GitHub software repository can be configured so that each release automatically creates a new version of an upload on Zenodo. More on this under https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content.
  • GitLab integration: gitlab2zenodo is a Python package to automatically publish snapshots of a software repository on Zenodo.

Other research institutions, such as the Helmholtz Association, universities, but also individual federal states, now offer their own research data repositories that also support research software.

Under what circumstances should research software (source code) not be made publicly accessible?

(see also the FAQs on Security and Ethics and the FAQs on Legal Aspects.)

Hard criteria

The following criteria clearly speak against publishing research software:

  • The supervisor has not agreed to the publication. How the supervisor’s consent is obtained should be clearly regulated. One possibility is that the supervisor is a member of the software repository and (possibly as the only one) has the right to make a release.
  • The software has dual-use or military application possibilities (see OHB MPG VII.02.01 “Foreign Trade Law – Establishing MPG-wide Export Control”
  • The software is not under an open-source license
  • The code contains personal or personal data (e.g., usernames, passwords, file paths)
  • The software is created within a collaboration whose contract contradicts the publication of the software. This is often the case with industrial collaborations. Even with other collaborations (e.g., special research areas), the publication of software should be contractually regulated.
  • The software uses, links, or further develops code that is not under a license compatible with publication.

Soft criteria

Apart from the above hard criteria, it is up to the developers to decide whether and when to make the research software publicly accessible. Some developers prefer to make the code publicly available from the beginning to give the community the opportunity to participate. Others want to achieve a certain level of product maturity or fear that their ideas will be copied prematurely and they will lose their development lead. Another reason against publishing the software is the sometimes expressed concern that the code could be used improperly or incorrectly, produce false results, and thus damage the reputation of the authors.

The journal “Demographic Research” has created a checklist (pdf) for this purpose.

Where can I make my software publicly accessible?

Under the FAQ In which repositories should research software be stored? various specific and generic repositories are listed where software source code can be stored. All mentioned services allow regulating access (“Private Repository”) or making it publicly accessible.

Which development platforms promote communicative, joint software development that allows for respectful mutual criticism?

The above-mentioned software repositories Gitlab and Github include various functionalities to enable interested parties to comment, report bugs and issues, suggest extensions, and even participate directly in the project. Comments, bug reports, etc. are usually recorded in a so-called “Issue”, discussed, and completed. Through the function of Merge Request (Gitlab) or Pull Requests (Github), developers can suggest their own extensions, and the repository owner can comment, reject, or accept them. A “good” repository should make clear guidelines on how feedback, bug reports, or Merge/Pull Requests should be submitted (e.g., using templates and examples) and emphasize respectful interaction. Usually, these guidelines are documented in a file “Contributing.md”, “Community.md”, etc., located in the root directory of the repository (example: the file https://github.com/Bios4Biol/intronSeeker/blob/master/CONTRIBUTING.md in the repository https://github.com/Bios4Biol/intronSeeker).

What infrastructure must be provided by an institute and/or by the Max Planck Society to develop research software in accordance with Good Scientific Practice?

Technical infrastructure

  • Repository: A software repository (Gitlab, Github) is always beneficial but does not necessarily have to be provided by the institute on-site, as various MPG-wide solutions are available (see Software repositories).

Personnel infrastructure

  • More important is that the institute has employees with in-depth knowledge of research software development, a good overview of the institute’s research focuses, and who are familiar with the rules of Good Scientific Practice.

Culture

However, it is crucial that there is an open culture towards open-source software development. Only the open-source availability of software ensures verifiability and reproducibility of results and is thus a basis for Good Scientific Practice.

How are the different roles in the software development process distributed and recorded? Who takes on project ownership, quality assurance, testing?

Whenever possible, different people should perform different tasks:

  • Creating the requirements for the program
  • Sketching the components and their relationships (which classes, functions, etc.)
  • Writing the program code and the associated tests
  • Reviewing new and changed program parts
  • Authorizing Merge/Pull Requests into the Main/Master Branch
  • Authorizing releases

In the scientific context, these distributed roles are usually not feasible. Where possible, the tasks should then be divided between the PI and the employees according to their qualifications in software development. See also the FAQ on Under what circumstances should research software (source code) not be made publicly accessible?.

Where can I search for possibly existing software for my specific problem?

Generic (research) software collections

Humanities

Medical and Life Sciences

Physics

Mathematics

NIST Guide to available mathematical software (GAMS): https://gams.nist.gov/cgi-bin/serve.cgi/Packages (last update in 2010)

Machine Learning

Nanotech

The above list is partly taken from the “Awesome Research Software Registries”.

Is it necessary to develop new software (“from scratch”), or can I adapt and specifically extend other code projects for my needs?

  • Both are legitimate, but the latter would be more resource-efficient if possible (the principle of economic efficiency applies very strongly to the MPG (§ 1 BHO))

How do I archive software in the Max Planck Archive?

Not at all. Archiving software is not part of the tasks of the Max Planck Archive (see https://www.archiv-berlin.mpg.de/41320/ueberlieferung)

How to archive software, especially beyond the 10-year retention period?

The following services are available for archiving software, especially beyond the 10-year retention period:

Does the runtime environment (hardware/software) need to be archived as well?

Practicality, long-term readability, and usability must be considered here. The following aspects can argue for a comprehensive complete archiving (including all dependencies and hardware):

  • Results depend heavily on the versions of the integrated software and/or hardware used.
  • Software or dependencies may only run efficiently on a specific type of hardware that may not be available in the foreseeable future.

Even if there are arguments for archiving the complete runtime environment of a software, this does not mean that a complete PC or laptop needs to be archived. Runtime environments can be efficiently co-stored in software containers, such as Docker or Singularity. These containers include all components of the software to be archived, dependent libraries, as well as components of the operating system or the programming environment required to run the program. Of course, the prerequisite is that these containers themselves are still readable and executable in “>10 years”. A newer development is so-called FAIR Digital Objects (FDOs), which include not only the software and operating system but also the research data. More information can be found at https://fairdo.org/.

How do I ensure that my source code remains readable for 10 years?

Currently, it is not expected that today’s common file formats will become unreadable in 10 years. In particular, source code is stored in simple text formats (ASCII), for which a simple text editor program is sufficient for readability. Additionally, the source code could be printed and adequately stored.