Data analyses with Python & Jupyter

Introduction

You can do complex biological data manipulation and analyses using the pandas python package (or by switching kernels, using R!)

We will look at pandas here, which provides R-like functions for data manipulation and analyses. pandas is built on top of NumPy. Most importantly, it offers an R-like DataFrame object: a multidimensional array with explicit row and column names that can contain heterogeneous types of data as well as missing values, which would not be possible using numpy arrays.

pandas also implements a number of powerful data operations for filtering, grouping and reshaping data similar to R or spreadsheet programs.

Installing Pandas

pandas requires NumPy. See the Pandas documentation. If you installed Anaconda, you already have Pandas installed. Otherwise, you can sudo apt install it.

Assuming pandas is installed, you can import it and check the version:

import pandas as pd
pd.__version__
'0.17.1'

Also import scipy:

import scipy as sc

Reminder about tabbing and help!

As you read through these chapters, don’t forget that Jupyter gives you the ability to quickly explore the contents of a package or methods applicable to an an object by using the tab-completion feature. Also documentation of various functions can be accessed using the ? character. For example, to display all the contents of the pandas namespace, you can type

In [1]: pd.<TAB>

And to display Pandas’s built-in documentation, you can use this:

In [2]: pd?

Pandas dataframes

The dataframes is the main data object in pandas.

importing data

Dataframes can be created from multiple sources - e.g. CSV files, excel files, and JSON.

MyDF = pd.read_csv('../data/testcsv.csv', sep=',')
MyDF
Species Infraorder Family Distribution Body mass male (Kg)
0 Daubentonia_madagascariensis Chiromyiformes Daubentoniidae Madagascar 2.700
1 Allocebus_trichotis Lemuriformes Cheirogaleidae Madagascar 0.100
2 Avahi_laniger Lemuriformes Indridae America 1.030
3 Avahi_occidentalis Lemuriformes Indridae Madagascar 0.814
4 Avahi_unicolor Lemuriformes Indridae America 0.830
5 Cheirogaleus_adipicaudatus Lemuriformes Cheirogaleidae Madagascar 0.200
6 Cheirogaleus_crossleyi Lemuriformes Cheirogaleidae Madagascar 0.400
7 Cheirogaleus_major Lemuriformes Cheirogaleidae Madagascar 0.450
8 Cheirogaleus_medius Lemuriformes Cheirogaleidae Madagascar 0.217

Creating dataframes

You can also create dataframes using a python dictionary like syntax:

MyDF = pd.DataFrame({
   'col1': ['Var1', 'Var2', 'Var3', 'Var4'],
   'col2': ['Grass', 'Rabbit', 'Fox', 'Wolf'],
   'col3': [1, 2, sc.nan, 4]
})

MyDF
col1 col2 col3
0 Var1 Grass 1
1 Var2 Rabbit 2
2 Var3 Fox NaN
3 Var4 Wolf 4

Examining your data

# Displays the top 5 rows. Accepts an optional int parameter - num. of rows to show
MyDF.head()
col1 col2 col3
0 Var1 Grass 1
1 Var2 Rabbit 2
2 Var3 Fox NaN
3 Var4 Wolf 4
# Similar to head, but displays the last rows
MyDF.tail()
col1 col2 col3
0 Var1 Grass 1
1 Var2 Rabbit 2
2 Var3 Fox NaN
3 Var4 Wolf 4
# The dimensions of the dataframe as a (rows, cols) tuple
MyDF.shape
(4, 3)
# The number of columns. Equal to df.shape[0]
len(MyDF) 
4
# An array of the column names
MyDF.columns 
Index(['col1', 'col2', 'col3'], dtype='object')
# Columns and their types
MyDF.dtypes
col1     object
col2     object
col3    float64
dtype: object
# Converts the frame to a two-dimensional table
MyDF.values 
array([['Var1', 'Grass', 1.0],
       ['Var2', 'Rabbit', 2.0],
       ['Var3', 'Fox', nan],
       ['Var4', 'Wolf', 4.0]], dtype=object)
# Displays descriptive stats for all columns
MyDF.describe()
col3
count 3.000000
mean 2.333333
std 1.527525
min 1.000000
25% 1.500000
50% 2.000000
75% 3.000000
max 4.000000

OK, I am going to stop this brief intro to Jupyter with pandas here! I think you can already see the potential value of Jupyter for data analyses and visualization. As I mentioned above, you can also use R (e.g., using tidyr + ggplot) for this.