CBIO (CSCI) 4835/6835: Introduction to Computational Biology
"Open Science" is something of an umbrella term encompassing everything related to reproducible, transparent science.
There have been some complaints that Open Science is amorphous and ambiguous, that its prescription for reproducibility is not, in itself, reproducible.
However, Open Science is broadly defined and meant to appeal to every area of science, from life sciences to computational sciences to theoretical sciences. What Open Science looks like is field-specific, but there are general principles that cut across all fields of science.
By the end of this lecture, you should be able to
Simply put, Open Science is the movement to make all scientific data, methods, and materials accessible to all levels of society.
Why is this a good thing?
Nonetheless, there are some downsides to making everything openly available.
My opinion--as you've probably guessed by the title of the lecture--is that the benefits of good Open Science practices outweigh the drawbacks, for the following reason:
I learn best when I can dig in and get my hands dirty.
Reading a paper or even a blog post that vaguely describes a method is one thing. Actually seeing the code, changing it, and re-running it to observe the results is something else entirely and, so I believe, is vastly superior in educational terms.
The scientific deluge is legitimate, though this was already happening even without the addition of open data, open access, and open source. And it would seem that, while we do absolutely need to exercise caution in our research and not pursue the ends by any means necessary, science is the pursuit of knowledge for its own sake and that should also be respected to the highest degree.
To that end, there are six main themes that comprise the Open Science guidelines.
If you had to pick the "core" of Open Science, this would probably be it. All of the data used in your study and experiments are published online.
This is definitely a shift from prior precedent; most raw data from scientific experiments remain cloistered.
The situation is further complicated by Terms of Service agreements that prohibit the sharing of data collected.
Repositories and online data banks have sprung up around this idea. Many research institutions host their own open data repositories, as do some large tech companies.
This is probably the part you're most familiar with. Any (and all) code that's used in your project is published somewhere publicly for download.
There are certainly conditions where code can't be fully open sourced--proprietary corporate secrets, pending patents, etc--but to fully adhere to Open Science, the code has to be made completely available for anyone.
Like with open data, there are numerous repositories across the web that specialize in providing publicly-available versioning systems for both maintaining and publishing your code.
This is probably the trickiest item. How does one make methods reproducible?
Open source code is part of it, but even more important is the effort put into making the methods in the code understandable. This takes several forms:
The cornerstone of the scientific process is that of peer review: your peers, your colleagues, your fellow researchers should vet your work before it's officially included as part of the scientific literature.
However, this process is fraught with ambiguity and opacity. Conflicts of interest can potentially lead to biased reviews (if you're reviewing the paper of a competitor, it isn't exactly in your best interest to go easy on them), and it can be difficult to assess a published paper in the public sphere without a trail of edits from which to begin.
Online open review journals such as The Journal of Open Source Software (JOSS) have begun proliferating to address this shortcoming.
Reviews essentially take the form of GitHub tickets, and researchers in the field can discuss and debate the merits of the work in an open forum.
There's also a site, called "Open Review", where conferences can elect to have their papers openly and publicly reviewed, threaded-comment style.
Once a project is published, the paper should be made publicly available for anyone, anywhere to download and read for themselves.
Easily the most popular open access paper repository is arXiv (pronounced "archive"...geddit?). Other repositories modeled directly after its success, such as bioRxiv, have already started springing up.
arXiv is already hugely popular.
In fact, so many papers are archived here on a regular basis, that someone created their own open source "aggregator" service: http://www.arxiv-sanity.com , which collates the papers you want to read and helps filter out all the others.
Preprints have become so popular, that versions of arXiv have popped up for almost every area!
This and Open Methods are closely related. In this sense, any course materials that come from research that is done are made available for others.
MIT OpenCourseWare is probably the best example of open education at work, but these types of sites are proliferating.
Note: This does not by default include MOOCs. If MOOCs make their materials freely available online, then and only then would it fall under this section.
So that's all great! But...
What can I do? Where do I start?
In the previous section we went over the absolute essentials for an Open Science project or research endeavor. It included some examples of real-world services and tools to help expedite the process.
Here, we'll go into a bit more detail as to exactly how you should tweak your projects to be truly Open Science. It's not really something you can just "tack on" at the end; rather, you'll need to design your projects from the onset to be part of the Open Science initiative.
When you invent the next TensorFlow, we'll all be thanking you for following these best practices!
This is a good starting point for any project.
Version control keeps track of changes within your code base.
It's a lot like Google docs, but for code.
You all know Dropbox? Has it ever saved your life? If so, the next time you start a coding project, use some form of version control.
There are several different version control systems out there for code, depending on what you like.
I personally prefer git
, but I've used all of these before.
There are some common files and folder hierarchies to implement at the very start of a new project. The shablona GitHub repository has all this documented very well, but here are the highlights:
src
folder, containing all your source code. Surprise!tests
for your code. Always always always always always write tests!README
file of some kind. This is your introduction that summarizes your project, gives a basic overview of how to use it, what the prerequisites are, links to further (more detailed) documentation, and any problems that are known.doc
folder containing detailed documentation about every aspect of your project, including any examples.LICENSE
file. This is often overlooked (like clicking through a EULA), but in the age of open source and Open Science it is absolutely essential to have a license of some sort.This is drilled into computer science students from day 1, and yet one constant across every software project is that the documentation sucks. To meet basic adequacy standards, documentation should include
# loop through the array
, but useful comments like # Because the array is already sorted, we only loop through parts of it to search for the value we want
.What does this code do?
def do(a, b):
c = a + " " + b
return c
What if I'd written it this way:
def full_name(first_name, last_name):
name = first_name + " " + last_name
return name
Or even this way:
def full_name(first_name, last_name):
"""
Helper function, to concatenate the first name and the last name
into a single full name.
"""
# Concatenates the two names with a space in between.
name = first_name + " " + last_name
# Return the full name.
return name
Python (and any other programming language) will have a "style" of documenting the code whereby other utilities can take those comments and turn them into documentation.
Java pioneered this approach in the 1990s with "Javadocs" and every language since has emulated that in some way.
Python's version is "Sphinx" http://www.sphinx-doc.org/en/master/
Right after the README
, this is the next place the user will go: examples for how to run the code. Is it command line? What options are there? What kind of data can I run through it? What should I expect for output? These sorts of questions can be addressed with
If you're an R or Python user, you've probably heard of "scientific notebooks".
They're interactive, usually web-based applications that blur the lines between
by interleaving all three within the same medium.
**These very slides were built entirely in Jupyter, a multi-language scientific notebook platform!
While these slides were built in Jupyter, they were exported into HTML and PDF, so the code isn't technically live anymore.
However, through platforms like mybinder, you can turn any GitHub repository into a collection of active notebooks! You can do this yourself with these slides--all you need is a web browser!
WARNING: This is super-advanced.
Has anyone ever
Did it run?
Here's a great example: Python 2 versus Python 3.
Python 2 was the de facto Python version for over a decade (it's kind of the Windows XP of Python). When Python 3 was introduced in the late 2000s, nobody upgraded.
What should this code output?
division = 5 / 2
print(division)
2.5
This notebook is running Python 3. Let's run it in Python 2, shall we?
Containers, such as Docker, allow you to create bash-script-like recipes that build the precise environment you want.
This has huge implications in reproducibility: you can provide others with an exact set of instructions--even a pre-built image!--of the environment in which you conducted your work.
PyPI is the Python Package Index, and is where the vast majority of external Python packages are hosted (over 85,000 as of this lecture!). In addition to publishing your code on GitHub, you can also package your code to be automatically installed via a package manager like pip
or easy_install
.
Provided you're already adhering to the anatomy of an open source project, there are only a few other steps needed to get your code ready for publishing on PyPI: in particular, writing the setup.py
script that will tell the package managers of people who want to install your package exactly how to install your package.
Then you just upload the code to PyPI! http://peterdowns.com/posts/first-time-with-pypi.html
Pre-registration means publishing your methods publicly BEFORE any work has been done.
Why is this important?
Sites like Protocols.io https://www.protocols.io/ and OSF https://osf.io/prereg/ have pre-registration modules, sometimes linked with certain journals who agree to publish the results of the methods, regardless of whether they're positive or negative.
This makes sense for wet lab / life sciences or even theory-driven statistics, but what about empirical or computational studies like applied machine learning?
Turns out, there are a number of pre-registration sites for those too, including OpenML.
Yes, this is super boring, I agree. Fortunately, the wonderful folks at GitHub have created an awesome resource for you to use.
Choose a License: It provides a handy choose-your-adventure flowchart for picking the perfect open software license for your project. I highly recommend it.
However, if you're just looking for a tl;dr version and want the basic lay of the land, GitHub's default licenses for new projects include
In my view, Open Science and its practices will only become more important.
Six major areas of Open Science
Already lots of great open source tools and repositories to encourage Open Science practices