CSCI 1360E: Foundations for Informatics and Analytics
We've previously covered the basics of exploring data. In this lecture, we'll go into a bit more detail of some of the slightly more formal strategies of "data munging," including introducing the pandas
DataFrame for organizing your data. By the end of this lecture, you should be able to
One particularly important skill that all data scientists must have is the ability to explore your data.
If I told you to go build me a Facebook friend recommendation system, you would (rightfully) look at me like I'd gone crazy, not the least of which is because I hadn't given you any of the data you would be using to actually make recommendations to users.
It may seem trite, but it's incredibly important: you have to understand your data before you can ever begin to start thinking about how to put it to use.
What kinds of patterns exist in the data that you can take advantage of? What unexpected properties do the data have that you'll have to account for? What assumptions can you make? What assumptions can't you make?
These are all points that require you to explore your data--doing some basic poking and prodding to get a feel for your data.
This is about as simple as it gets: your data consist of a list of numbers. We saw in previous lectures that you can compute statistics (mean, median, variance, etc) on these numbers. You can also visualize them using histograms. We'll reiterate that point here, using a particular example.
import numpy as np
np.random.seed(3908544)
# Generate two random datasets.
data1 = np.random.normal(loc = 0, scale = 58, size = 1000)
data2 = 200 * np.random.random(1000) - 100
# What are their means and variances?
print("Dataset 1 average: {:.2f} (+/- {:.2f})".format(data1.mean(), data1.std()))
print("Dataset 2 average: {:.2f} (+/- {:.2f})".format(data2.mean(), data2.std()))
Dataset 1 average: 1.60 (+/- 57.68) Dataset 2 average: 1.88 (+/- 57.92)
Both datasets contain 1000 random numbers. Both datasets have very nearly the same mean and same standard deviation.
But the two datasets look very different!
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure().set_figwidth(12)
plt.subplot(121)
plt.title("Dataset 1")
_ = plt.hist(data1, bins = 20, range = (-100, 100))
plt.subplot(122)
plt.title("Dataset 2")
_ = plt.hist(data2, bins = 20, range = (-100, 100))
Behold: the importance of viewing your data! Dataset 1 is drawn from a Gaussian / Normal distribution (our good friend, the bell curve), while Dataset 2 is uniform--meaning the data are spread evenly between two values (-100 and 100, in this case), rather than clustered around the middle like the bell curve.
Two (and even three) dimensions? Scatter plots are your friend. Consider the following fake datasets.
np.random.seed(8493248)
X = np.random.normal(size = 1000)
Y1 = (X + np.random.normal(size = 1000) / 2)
Y2 = (-X + np.random.normal(size = 1000) / 2)
If you plotted Y1
and Y2
using the histograms from the previous strategy, you'd get two datasets that looked pretty much identical.
plt.figure().set_figwidth(12)
plt.subplot(121)
plt.title("Dataset Y1")
_ = plt.hist(Y1, bins = 50, range = (-4, 4))
plt.subplot(122)
plt.title("Dataset Y2")
_ = plt.hist(Y2, bins = 50, range = (-4, 4))
Maybe slightly different shapes, but qualitatively (and statistically) identical.
But what if we visualized the data in 2D using a scatter plot?
plt.scatter(X, Y1, marker = ".", color = "black", label = "Dataset 1")
plt.scatter(X, Y2, marker = ".", color = "gray", label = "Dataset 2")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend(loc = 0)
plt.title("Joint Distribution")
<matplotlib.text.Text at 0x110691c50>
DIFFERENT, again! But it required a different visualization / summarization technique to discover.
These two datasets are anticorrelated. To see what this means, we can derive the correlation coefficients for the two datasets independently:
print(np.corrcoef(X, Y1)[0, 1])
print(np.corrcoef(X, Y2)[0, 1])
0.896816214735 -0.895177590207
"Correlation" means as we change one variable (X), another variable changes by a similar amount (Y). Positive correlation means as we increase one variable, the other increases; negative correlation means as we increase one variable, the other decreases.
Anticorrelation, then, is the presence of both positive and negative correlation, which is what we see in this dataset: one has a correlation coefficient of 0.9 (1.0 is perfect positive correlation), while the other is -0.9 (-1.0 is perfect negative correlation).
This is something we'd only know from either visualizing the data or examining how the data are correlated.
Simpler strategies--means, medians, modes, standard deviations, and histograms--are all very useful data exploration strategies, and you should definitely keep them handy!
But they have their limits, as we've already seen. Also exploring correlation, and using scatter plots, in combination with the simpler strategies, will help you get a firmer handle on the behavior of the data.
If you have 3D data, matplotlib is capable of displaying that. But beyond three dimensions, it can get tricky. A good starting point is to make a correlation matrix, where the $i^{th}$ row and $j^{th}$ column of the matrix is the correlation coefficient between the $i^{th}$ and $j^{th}$ dimensions of the data.
Another strategy is to create 2D scatter plots of every pairwise combinations of dimensions. For every $i^{th}$ and $j^{th}$ dimension in the data, create a 2D scatter plot like we did in the last slide. This way, you can visualize each dimensions relative to each other dimension and easily spot any correlations.
These are pretty advanced techniques that we won't explicitly cover here (though possibly incidentally in later lectures). The upshot here is to find a way to visualize your data.
There's an awesome article about data visualization that demonstrates precisely why summary statistics, on their own, can be viciously misleading.
I particularly like the "Age" and "Family Size in Household" plots, because the averages shown (the single dots) aren't even the widest parts of the full plot.
Moral of the story: summary statistics are great and absolutely essential, but they almost always require further visualization of the details!
Many data science analysis techniques can be sensitive to the scale of your data. This is where normalization or scaling your data can help immensely.
Let's say you're interested in grouping together your friends based on height and weight. You collect the following data points:
personA = np.array([63, 150]) # 63 inches, 150 pounds
personB = np.array([67, 160]) # 67 inches, 160 pounds
personC = np.array([70, 171]) # 70 inches, 171 pounds
plt.scatter(personA[0], personA[1])
plt.scatter(personB[0], personB[1])
plt.scatter(personC[0], personC[1])
<matplotlib.collections.PathCollection at 0x11025c390>
And you compute the "distance" between each point (we'll just use standard Euclidean distance):
import numpy.linalg as nla
print("A to B: {:.2f}".format( nla.norm(personA - personB) ))
print("A to C: {:.2f}".format( nla.norm(personA - personC) ))
print("B to C: {:.2f}".format( nla.norm(personB - personC) ))
A to B: 10.77 A to C: 22.14 B to C: 11.40
As you can see, the two closest data points are person A and person B (the distance of 10.77 is the smallest).
But now your UK friend comes to you with the same dataset but a totally different conclusion! Turns out, this friend computed the heights of everyone in centimeters, rather than inches, giving the following dataset:
personA = np.array([160.0, 150]) # 160 cm, 150 pounds
personB = np.array([170.2, 160]) # 170.2 cm, 160 pounds
personC = np.array([177.8, 171]) # 177.8 cm, 171 pounds
plt.scatter(personA[0], personA[1])
plt.scatter(personB[0], personB[1])
plt.scatter(personC[0], personC[1])
print("A to B: {:.2f}".format( nla.norm(personA - personB) ))
print("A to C: {:.2f}".format( nla.norm(personA - personC) ))
print("B to C: {:.2f}".format( nla.norm(personB - personC) ))
A to B: 14.28 A to C: 27.53 B to C: 13.37
Check it out--according to these measurements, the smallest distance (i.e. most similar pair) is 13.37, which is persons B and C! Oops...?
It can be very problematic if a simple change of units completely alters the conclusions you draw from the data. One way to deal with this is through scaling--we've actually done this before in a homework assignment.
By rescaling the data, we eliminate any and all units. We remove the mean (subtract it off) and divide by the standard deviation, so if you had to include a unit, it would essentially be units of "standard deviations away from 0."
# We'll write a function to help!
def rescale(data):
# First: subtract off the mean of each column.
data -= data.mean(axis = 0)
# Second: divide by the standard deviation of each column.
data /= data.std(axis = 0)
return data
Now we'll generate some sample data, and then scale it.
np.random.seed(3248)
X = np.random.random((5, 3)) # Five rows with three dimensions.
print("=== BEFORE ===")
print("Means: {}\nStds: {}".format(X.mean(axis = 0), X.std(axis = 0)))
=== BEFORE === Means: [ 0.4303258 0.53938706 0.52770194] Stds: [ 0.17285592 0.27353295 0.23789391]
Xs = rescale(X)
print("=== AFTER ===")
print("Means: {}\nStds: {}".format(Xs.mean(axis = 0), Xs.std(axis = 0)))
=== AFTER === Means: [ -2.66453526e-16 -6.66133815e-17 -4.44089210e-17] Stds: [ 1. 1. 1.]
Of course, like anything (everything?), there are still caveats.
DataFrames are a relatively new data structure on the data science scene. Equal parts spreadsheet, database, and array, they are capable of handling rich data formats as well as having built-in methods for dealing with the idiosyncrasies of unstructured datasets.
As Jake wrote in his book, Python Data Science Handbook:
NumPy's
ndarray
data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks. While it serves this purpose very well, its limitations become clear when we need more flexibility (such as attaching labels to data, working with missing data, etc.) and when attempting operations which do not map well to element-wise broadcasting (such as groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us. Pandas [...] builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy most of a data scientist's time."
What exactly is a DataFrame, then?
Well, it's a collection of Series! </unhelpful>
import pandas as pd # "pd" is the import convention, like "np" is for NumPy
data = pd.Series([0.25, 0.5, 0.75, 1])
print(data)
0 0.25 1 0.50 2 0.75 3 1.00 dtype: float64
Think of a Series
as a super-fancy 1D NumPy array. It's so fancy, in fact, that you can give a Series
completely custom indices, sort of like a dictionary.
data = pd.Series({2:'a', 1:'b', 3:'c'})
print(data)
1 b 2 a 3 c dtype: object
If a Series
is essentially a fancy 1D NumPy array, then a DataFrame is a fancy 2D array. Here's an example.
# Standard Python dictionary, nothing new and exciting.
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict) # Oh right: you can feed dicts to Series!
area_dict = {'California': 423967,
'Texas': 695662,
'New York': 141297,
'Florida': 170312,
'Illinois': 149995}
area = pd.Series(area_dict)
# Build the DataFrame!
states = pd.DataFrame({'population': population,
'area': area})
print(states)
area population California 423967 38332521 Florida 170312 19552860 Illinois 149995 12882135 New York 141297 19651127 Texas 695662 26448193
DataFrames are really nice--you can directly access all the extra information they contain.
print(states.index) # Our row names
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
print(states.columns) # Our Series / column names
Index(['area', 'population'], dtype='object')
You can also directly access the property you're interested in, rather than having to memorize the index number as with NumPy arrays:
print(states['population'])
California 38332521 Florida 19552860 Illinois 12882135 New York 19651127 Texas 26448193 Name: population, dtype: int64
But you can also access the same information almost as you would with a NumPy array:
print(states.iloc[:, 1])
California 38332521 Florida 19552860 Illinois 12882135 New York 19651127 Texas 26448193 Name: population, dtype: int64
Note the use of the .iloc
attribute of DataFrames.
This is to handle the fact that you can assign entirely customized integer indices to DataFrames, resulting in potentially confusing behavior when you slice them--if you slice with 1:3
, are you referring to the first and third items in the DataFrame, or the items you specifically indexed as the first and third items? With DataFrames, these can be two different concepts!
.iloc
if you want to use implicit ordering, meaning the automatic Python internal ordering (a la NumPy arrays).loc
if you want to use explicit ordering, or the ordering that you set when you built the DataFrame (the order in which you gave the items to the DataFrame).ix
if you want a hybrid of the two (this is super-confusing)If you just want the whipper-snappers to get off your lawn, don't worry about this distinction. As long as you don't explicitly set the indices yourself when you build a DataFrame, just use iloc
.
So what do DataFrames have to do with data exploration?...besides making it really easy, of course.
pandas has some phenomenal missing-data capabilities built-in to Series and DataFrames. As an example by comparison, let's see what happens if we have a None
or NaN
in our NumPy array when we try to do arithmetic.
x = np.array([0, 1, None, 2]) # Trust me when I say: this happens a LOT.
print(x.sum())
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-28-9533350d1013> in <module>() 1 x = np.array([0, 1, None, 2]) # Trust me when I say: this happens a LOT. ----> 2 print(x.sum()) /opt/python/lib/python3.5/site-packages/numpy/core/_methods.py in _sum(a, axis, dtype, out, keepdims) 30 31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False): ---> 32 return umr_sum(a, axis, dtype, out, keepdims) 33 34 def _prod(a, axis=None, dtype=None, out=None, keepdims=False): TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
Welp, that crashed and burned. What about using NaN
instead?
x = np.array([0, 1, np.nan, 2])
print(x.sum())
nan
Well, it didn't crash. But since "NaN" specifically stands for "Not A Number", it makes arithmetic difficult since any operation involving a NaN
will return NaN
.
A Series has a bunch of tools available to us to sniffing out missing values and handling them gracefully.
isnull()
: generate a boolean mask which indicates where there are missing values.notnull()
: opposite of isnull()
.dropna()
: return a version of the data that drops all NaN
values.fillna()
: return a copy of the data with NaN
values filled in with something else or otherwise imputed.data = pd.Series([1, np.nan, 'hello', None])
print(data)
0 1 1 NaN 2 hello 3 None dtype: object
print(data.isnull()) # Where are the null indices?
0 False 1 True 2 False 3 True dtype: bool
new_data = data[data.notnull()] # Use the boolean mask to pull out non-null indices.
print(new_data)
0 1 2 hello dtype: object
This is but a tiny taste of the majesty that is the pandas package. I highly recommend checking it out further.
Some questions to discuss and consider:
1: What are the advantages and disadvantages of using pandas DataFrames instead of NumPy arrays?
2: Name three strategies for visualizing and exploring 5-dimensional data. What are the pros and cons of each?
3: You're putting your data science skills to work and writing a program that automatically classifies web articles into semantic categories (e.g. sports, politics, food, etc). You start by counting words, resulting in a model with 100,000 dimensions (words are dimensions!). Can you come up with any kind of strategy for exploring these data?