CSCI 1360E: Foundations for Informatics and Analytics
At this point, you're ready to dive into data science headfirst. One of the best ways to learn more about specific problems is to see how those problems have been addressed so far; reading through existing code can help immensely. But this touches on an important and emerging area of science called "Open Science." By the end of this lecture, you should be able to
Simply put, Open Science is the movement to make all scientific data, methods, and materials accessible to all levels of society.
Why is this a good thing?
Nonetheless, there are some downsides to making everything openly available.
My opinion--as you've probably guessed by the title of the lecture--is that the benefits of good Open Science practices outweigh the drawbacks, for the following reason:
I learn best when I can dig in and get my hands dirty.
Reading a paper or even a blog post that vaguely describes a method is one thing. Actually seeing the code, changing it, and re-running it to observe the results is something else entirely and, so I believe, is vastly superior in educational terms.
The scientific deluge is legitimate, though this was already happening even without the addition of open data, open access, and open source. And it would seem that, while we do absolutely need to exercise caution in our research and not pursue the ends by any means necessary, science is the pursuit of knowledge for its own sake and that should also be respected to the highest degree.
To that end, there are six main themes that comprise the Open Science guidelines.
If you had to pick the "core" of Open Science, this would probably be it. All of the data used in your study and experiments are published online.
This is definitely a shift from prior precedent; most raw data from scientific experiments remain cloistered.
The situation is further complicated by Terms of Service agreements that prohibit the sharing of data collected. For example: if you hooked up a Python client to listen to and capture public Twitter posts, you are forbidden from sharing the Twitter data publicly. Which seems odd, given that the data are public anyway, but there you go.
Repositories and online data banks have sprung up around this idea. Many research institutions host their own open data repositories, as do some large tech companies.
This is probably the part you're most familiar with. Any (and all) code that's used in your project is published somewhere publicly for download.
There are certainly conditions where code can't be fully open sourced--proprietary corporate secrets, pending patents, etc--but to fully adhere to Open Science, the code has to be made completely available for anyone.
Like with open data, there are numerous repositories across the web that specialize in providing publicly-available versioning systems for both maintaining and publishing your code.
This is probably the trickiest item. How does one make methods reproducible?
Open source code is part of it, but even more important is the effort put into making the methods in the code understandable. This takes several forms:
One good example of making sure your code documentation is up to par is Continuum IO's new automated README generator, kapsel.
The cornerstone of the scientific process is that of peer review: your peers, your colleagues, your fellow researchers should vet your work before it's officially included as part of the scientific literature.
However, this process is fraught with ambiguity and opacity. Conflicts of interest can potentially lead to biased reviews (if you're reviewing the paper of a competitor, it isn't exactly in your best interest to go easy on them), and it can be difficult to assess a published paper in the public sphere without a trail of edits from which to begin.
Online open review journals such as The Journal of Open Source Software (JOSS) have begun proliferating to address this shortcoming.
Reviews essentially take the form of GitHub tickets, and researchers in the field can discuss and debate the merits of the work in an open forum.
Once a project is published, the paper should be made publicly available for anyone, anywhere to download and read for themselves.
Easily the most popular open access paper repository is arXiv (pronounced "archive"...geddit?). Other repositories modeled directly after its success, such as bioRxiv, have already started springing up.
arXiv is already hugely popular.
In fact, so many papers are archived here on a regular basis, that someone created their own open source "aggregator" service: http://www.arxiv-sanity.com , which collates the papers you want to read and helps filter out all the others.
This and Open Methods are closely related. In this sense, any course materials that come from research that is done are made available for others.
MIT OpenCourseWare is probably the best example of open education at work, but these types of sites are proliferating.
Note: This does not by default include MOOCs. If MOOCs make their materials freely available online, then and only then would it fall under this section.
In the previous section we went over the absolute essentials for an Open Science project or research endeavor. It included some examples of real-world services and tools to help expedite the process.
Here, we'll go into a bit more detail as to exactly how you should tweak your projects to be truly Open Science. It's not really something you can just "tack on" at the end; rather, you'll need to design your projects from the onset to be part of the Open Science initiative.
When you invent the next TensorFlow, we'll all be thanking you for following these best practices!
There are some common files and folder hierarchies to implement at the very start of a new project. The shablona GitHub repository has all this documented very well, but here are the highlights:
src
folder, containing all your source code. Surprise!tests
for your code. Always always always always always write tests!README
file of some kind. This is your introduction that summarizes your project, gives a basic overview of how to use it, what the prerequisites are, links to further (more detailed) documentation, and any problems that are known.doc
folder containing detailed documentation about every aspect of your project, including any examples.LICENSE
file. This is often overlooked (like clicking through a EULA), but in the age of open source and Open Science it is absolutely essential to have a license of some sort.This is drilled into computer science students from day 1, and yet one constant across every software project is that the documentation sucks. To meet basic adequacy standards, documentation should include
# loop through the array
, but useful comments like # Because the array is already sorted, we only loop through parts of it to search for the value we want
.Right after the README
, this is the next place the user will go: examples for how to run the code. Is it command line? What options are there? What kind of data can I run through it? What should I expect for output? These sorts of questions can be addressed with
PyPI is the Python Package Index, and is where the vast majority of external Python packages are hosted (over 85,000 as of this lecture!). In addition to publishing your code on GitHub, you can also package your code to be automatically installed via a package manager like pip
or easy_install
.
Provided you're already adhering to the anatomy of an open source project, there are only a few other steps needed to get your code ready for publishing on PyPI: in particular, writing the setup.py
script that will tell the package managers of people who want to install your package exactly how to install your package.
Then you just upload the code to PyPI! http://peterdowns.com/posts/first-time-with-pypi.html
Yes, this is super boring, I agree. Fortunately, the wonderful folks at GitHub have created an awesome resource for you to use.
Choose a License: It provides a handy choose-your-adventure flowchart for picking the perfect open software license for your project. I highly recommend it.
However, if you're just looking for a tl;dr version and want the basic lay of the land, GitHub's default licenses for new projects include
Some questions to discuss and consider:
1: What are your thoughts on Open Science?
2: Have you ever developed open source code before? If so, what projects? If not, are there any projects you'd be interested in working on?
3: Have you ever dealt with open source licenses? Have you heard of any before this lecture?
4: Have you ever written unit tests before? If so, what did you test? If not, what do you think the "unit" refers to?