Index ¦ Archives

Exploring data with Python & Pandas

After obtaining a new set of data, it is important to perform certain techniques that will help you identify key variables, discover anomalies and test out any assumptions that you might have.

Queue the Indiana Jones theme tune

This collection of techniques is know as exploratory data analysis, or EDA for short.

For this example, we'll be exploring the Titanic data set on Kaggle.. Specifically, we'll be exploring the train.csv file (the other files are for machine learning.. a topic for later). Simply sign-up for an account and download the dataset to follow along.

Importing the data into Python

The first step is to load the data into Python, luckily for us, the Pandas module has several tools which come built in that make things easy. One of these is the read_csv() function which will read any comma seperate value file (csv) into a Pandas dataframe.

raw_data = pd.read_csv('./Data/train.csv')

Now if we query raw_data it will look like this:

titanic_dataframe

Here we can see the column headers along the top (features). And rows of data underneath (observations).

You can also view the first few rows by running raw_data.head(), and the last few rows by running raw_data.tail().

Statistical descriptions

Something you may wish to know is the details of each column, such as the number of observations, the min and max values etc.

These stats can easily be seen by typing raw_data.describe() Note: you can get information related to the field type and other information by typing raw_data.info().

Dataframe_Describe

As you can see, the number of observations across most of the columns is 891. However, the Age column is missing some values.

Interestingly we can see that the max price paid to sail on the Titanic was a whopping $512!

I wonder if they survived?

Filtering

To filter a row, we first have to select a label and then pass it a condition to filter on. We can do this by using the .loc feature of Pandas.

Run the below code to see if our high payer survived.

raw_data.loc[(raw_data['Fare'] == 512.329200)]

Titantic_Filtering

Interesting! A total of three people paid top dollar for a seat aboard the doomed ocean voyage. Luckily though, they all survived (as noted by the 1 in the Survived column. If it was 0 then that would indicate they did not).

There are many other things you can do with Pandas dataframes, such as advanced filtering, renaming columns, checking for missing data, using matplotlib for visualisations and many other cool things.

I highly recommend checking out the official documentation for more details.

© Kevin Tuck. Built using Pelican.