After obtaining a new set of data, it is important to perform certain techniques that will help you identify key variables, discover anomalies and test out any assumptions that you might have.
Queue the Indiana Jones theme tune
This collection of techniques is know as exploratory data analysis, or EDA for short.
For this example, we'll be exploring the Titanic data set on Kaggle.. Specifically, we'll be exploring the
train.csv file (the other files are for machine learning.. a topic for later). Simply sign-up for an account and download the dataset to follow along.
Importing the data into Python
The first step is to load the data into Python, luckily for us, the Pandas module has several tools which come built in that make things easy. One of these is the
read_csv() function which will read any comma seperate value file (csv) into a Pandas dataframe.
raw_data = pd.read_csv('./Data/train.csv')
Now if we query
raw_data it will look like this:
Here we can see the column headers along the top (features). And rows of data underneath (observations).
You can also view the first few rows by running
raw_data.head(), and the last few rows by running
Something you may wish to know is the details of each column, such as the number of observations, the min and max values etc.
These stats can easily be seen by typing
raw_data.describe() Note: you can get information related to the field type and other information by typing
As you can see, the number of observations across most of the columns is 891. However, the Age column is missing some values.
Interestingly we can see that the max price paid to sail on the Titanic was a whopping $512!
I wonder if they survived?
To filter a row, we first have to select a label and then pass it a condition to filter on. We can do this by using the
.loc feature of Pandas.
Run the below code to see if our high payer survived.
raw_data.loc[(raw_data['Fare'] == 512.329200)]
Interesting! A total of three people paid top dollar for a seat aboard the doomed ocean voyage. Luckily though, they all survived (as noted by the
1 in the
Survived column. If it was
0 then that would indicate they did not).
There are many other things you can do with Pandas dataframes, such as advanced filtering, renaming columns, checking for missing data, using matplotlib for visualisations and many other cool things.
I highly recommend checking out the official documentation for more details.