Pandas, the clutch type library

While working on my script from my last post, I discovered the Pandas library for Data science in python. I ran acrosss this while trying to handle the data that my script was pulling in from both mysql and dynamodb so that I could collate the data better. Turns out that pandas is actually a pretty clutch library.

To explain why I love pandas so much, we’re going to follow an example from Towards Data Science

In their post, they have this dataset Pandas Sample Dataset from Towardsdatascience

Thankfully neither they nor I are going to make you recreate that by yourself, as they provided the CSV file for us.

Using that CSV file, we’re able to load it into a pandas dataframe using read_csv(), one of the several IO tools available. I’ve taken the liberty of renaming the file to property_data.csv just to remove the space.

1
2
3
4
#!/usr/bin/python
import pandas

df = pandas.read_csv("property_data.csv")

Now let’s say that we wanted to see the first 7 or the last 7 rows in the data frame, we would accomplish that with

1
df.head(7)
         PID      ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS  NUM_BATH SQ_FT
0  100001000.0   104.0      PUTNAM            Y            3        1  1000
1  100002000.0   197.0   LEXINGTON            N            3      1.5    --
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4       NaN      203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
1
df.tail(7)
       PID      ST_NUM     ST_NAME OWN_OCCUPIED NUM_BEDROOMS  NUM_BATH SQ_FT
2  100003000.0     NaN   LEXINGTON            N          NaN        1   850
3  100004000.0   201.0    BERKELEY           12            1      NaN   700
4          NaN   203.0    BERKELEY            Y            3        2  1600
5  100006000.0   207.0    BERKELEY            Y          NaN        1   800
6  100007000.0     NaN  WASHINGTON          NaN            2   HURLEY   950
7  100008000.0   213.0     TREMONT            Y            1        1   NaN
8  100009000.0   215.0     TREMONT            Y           na        2  1800

One thing to note is that both head and tail default to 5 rows. Of course, we may not always know what exactly our data looks like so to get an idea of it, we can use

1
df.shape
(9, 7)

Which tells us that we have 9 rows and 7 columns.

“But Lance!” I hear you say, “You’ve only shown us how to some rows, what about getting other pieces of information?” Fear not friends! Pandas also has the ability to understand and manipulate text.

To explain that, look back at the original dataset, There are a number of NaNs in it which in pandas is a Null value or Not available. Suppose for a moment that you needed to send that CSV to a production machine but because the CSV you were given had those values, sending it would cause massive issues. Forunately, there is a solution to this, isnull()

1
df.isnull()
     PID  ST_NUM  ST_NAME  OWN_OCCUPIED  NUM_BEDROOMS  NUM_BATH  SQ_FT
0  False   False    False         False         False     False  False
1  False   False    False         False         False     False  False
2  False    True    False         False          True     False  False
3  False   False    False         False         False      True  False
4   True   False    False         False         False     False  False
5  False   False    False         False          True     False  False
6  False    True    False          True         False     False  False
7  False   False    False         False         False     False   True
8  False   False    False         False         False     False  False

That’s pretty nice looking for sure, but wouldn’t it be better if we could just have a single value?

1
df.isnull().values.any()
True

Now that’s a bit more like it! Before we kick it up a notch emeril style, let’s take a step back and examine some other ways of looking at data with pandas. In particular, I want to show the loc() and iloc() methods off.

loc() and iloc() are two methods for pulling out specific pieces of data from a panda dataframe. loc() gets based on labels allowing one to specify a column name such as ST_NAME and a row number to get that piece of data.

1
df.loc[1,'ST_NAME']
'LEXINGTON'

iloc() on the other hand works based on intergers.

1
df.iloc[1,2]
'LEXINGTON'

Now back to our scenario, not every row would break our production system, just the ones with Null chracters, so using pandas, it’s entirely possible to check through the row to make sure it has the informaiton we need.

1
df.loc[1].isnull().values.any()
True

And to add that extra niceness, let’s say we knew at least one column could be empty, we can account for that using pandas.

1
df.loc[3, df.columns != 'NUM_BATH'].isnull().values.any()
False

As you can see, pandas is a pretty powerful library and while it’s intended for Data Science and Big Data, that doesn’t mean that you can’t take advantage of it in your own programs. What I’ve shown here is only a drop in the bucket for the capabilities this library has. Hopefully this has given you an idea of how useful it can be for your projects!