Welcome to The Multilingual Quantitative Biologist!#

“It is hard for me to say confidently that, after fifty more years of explosive growth of computer science, there will still be a lot of fascinating unsolved problems at peoples’ fingertips, that it won’t be pretty much working on refinements of well-explored things. Maybe all of the simple stuff and the really great stuff has been discovered. It may not be true, but I can’t predict an unending growth. I can’t be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on, it’s at that level.”

—Donald Knuth


These notes have emerged from the development of content for modules on Biological Computing taught in various past and present courses at the Department of Life Sciences, Imperial College London. These courses include Year 1 & 2 Computational Biostatistics modules at the South Kensington Campus, the Computational Methods in Ecology and Evolution (CMEE) Masters program at the Silwood Park Campus, the Quantitative Methods in Ecology and Evolution Centre for Doctoral Training (QMEE CDT), and the training workshops of the VectorBiTE RCN.

Different subsets of these notes will be covered in different courses. Please look up your respective course guidebooks/handbooks to determine when the modules covered in these notes are scheduled in your course. You will be given instructions about which sections are covered in your course.

All the chapters of these notes are written as jupyter notebooks. Each chapter/notebook is accompanied by data and code on which you can practice your skills in your own time and during practical sessions. These materials are available (and will be updated regularly) at a git repository. We use git for hosting this course’s materials because we want to version-control this course’s content, which is constantly evolving to keep up with changing programming/computing technologies. That is, we are treating this course as any computing project that needs to be regularly updated and improved. Changes to the notes and content will also be made based upon student feedback. Blackboard is just not set up to handle dynamic updating and version control of this sort!

If you do not use git, you may download the code, data, these notes, and other course materials from the repository at one go look for the green “Clone or Download” button and then clicking on the “Download repository” link. You can then unzip the downloaded .zip and grab the files you need.

xkcd on programming

Fig. 1 Logical workflows are important, but don’t get married to yours!

(Source: xkcd)#

It is important that you work through the exercises and problems in each chapter. This document does not tell you every single thing you need to know to perform the exercises in it. In programming and computing, you learn faster by trying to solve problems (including computer crashes!) on your own, often by liberally googling the problem!

Learning goals#

The goal of these notes is to teach you to become (or at least show you the path towards becoming) a competent quantitative biologist. A large part of this involves learning computer programming. Why do biologists need to write computer programs? Here are some (hopefully compelling!) reasons:

  • Short of fieldwork, programs can do anything (that can be specified). In fact, even fieldwork, if you could one day program a robot to do it for you 1!

  • As such, no software is typically available to perform exactly the analysis you are planning. You should be unhappy if you are trying to shoehorn your data into methods that don’t quite seem right.

  • Biological problems and datasets are some of the most complicated imaginable. Programming permits success despite complexity through precise specification and modularization of complicated analyses.

  • Modularity – programming allows you to break up your complex analysis in smaller pieces, yet keep all the pieces in a single, functional analysis.

  • Reproducibility – you (or someone else) can just re-run the code to reproduce your analysis. This is also the key to maintaining scientific accountability, integrity, and accuracy.

  • Organized thinking – writing code requires you to do this!

  • Career prospects – good, scientific coders are in short supply in all fields, but most definitely in biology!

Why Multilingual?#

There are several hundred programming languages currently available – which ones should a biologist choose? These notes are built on the philosophy that quantitative biologists can significantly benefit from being multilingual programmers, knowing:

  1. A modern, easy-to-write, versatile, interpreted (or semi-compiled) language that is “reasonably” fast, like Python

  2. Mathematical and statistical software with programming and graphing capabilities, like R

  3. A compiled (or semi-compiled) ‘procedural’ language, like C

And all these because one language doesn’t fit all purposes. Something like C, the last item in the list above, is a “procedural” language that forces you to deal with the real “under the hood” workings of your computer (especially, memory management). Without an understanding of these ‘low-level’ aspects of computer programming, you will be limited in your ability to develop applications that either intrinsically require you to optimize performance, or need to be run in a memory- or performance-constrained environment (combination of computer hardware and operating system). Languages like Python and R intentionally obscure a lot of details of the underlying computer science, trading-off performance in favor of ease of programming and running code. However, they are sufficient for the purposes of most research and industry programming requirements.

You will learn Python and R (along with the bash language) on this course. These two are among the most popular languages currently (also see this), and with good reasons. We will not learn any procedural languages here, but it may be necessary for some of you to learn something like C in certain lines of research or industry jobs. Just be aware if this, and keep your mind open to the possibility of learning yet another language!

R vs. Python#

We will use R mainly for data analysis and visualization because it a great one stop solution for these purposes. If you are keen on trying data analyses in Python, see this Appendix. In general, R will do the job for most of your purposes. There not much between these two languages for data science. Python is somewhat more computationally efficient, and is a multi-purpose programming language with a very clean and easy-to-learn syntax. It is generally used by data scientists in the industry to for exploratory data analysis and machine learning in team-driven production environments. R, on the other hand, has been built mainly by by academic researchers and statisticians, and has a wider range of inbuilt (not requiring additional packages) statistical analysis capabilities. Learn more about R vs Python for data science here and here.

Some guidelines, conventions and rules#

Our goal is to teach you not just programming, but also good computing practices. In this course, you will write plenty of code, deal with different data files, and produce text and graphic outputs. You will learn to keep your project and coursework organized in logical, efficient, error-free, and reproducible workflows (that’s a mouthful, but an important mouthful).

Beware the dark forces#

You will NOT be using spreadsheet software (e.g., Excel) on this course. There are times when you will feel the pull of the dark side (ahem!), and imagine a more “comfortable” world where you are mouse-clicking your way happily though Excel-based data manipulations and analyses. NO! You will be doing yourself a disservice. On the long-ish run you will be much better off visualizing and manipulating data on your computer using a programming language like R. This is something you will learn, young padawan!

Keep your workflow organized#

In the following chapters, you will practice many examples where you are required to write large blocks of code. Please get into the habit of writing code into text files with an appropriate extension (e.g., .R for R code, .py for python code, etc.). Furthermore, please keep all your code files organized in one or more directories (e.g., named Code!). Similarly, some of these scripts will take data files as inputs, and output some results in the form of text or graphics. Please keep these inputs and outputs organized as well, in separate directories (e.g., named Data and Results) respectively. Your instructor(s) will help you get set up and abide by this “workflow”.

xkcd on workflows

Fig. 2 Logical workflows are important, but don’t get married to yours!

(Source: xkcd)#