Home » Machine Learning » Pandas » Grouping

Grouping

We can group Pandas dataframes by one or more columns. Rather than a DataFrame, this returns a DataFrameGroupBy object.

You can then call aggregate methods on this to summarise the data; for example, mean(), sum() and count(). You can also retrieve particular columns via square brackets and run aggregate functions on them.

Here we load the iris dataset, add a species column by mapping the target IDs to species names with Numpy choose, then we group on species.

Finally, we calculate mean petal lengths for each species.

from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

iris = load_iris(as_frame=True)

df = iris['data']

"""
Map the target IDs to the species names
and assign to a new column.
"""
df['species'] = np.choose(iris['target'], iris['target_names'])

# Select just two columns, petal length and species.
df = df[['petal length (cm)', 'species']]

# Group by species.
grouped = df.groupby(by='species')

print(grouped.group())
            petal length (cm)
species                      
setosa                  1.462
versicolor              4.260
virginica               5.552

Grouping On Multiple Columns

For this example we’ll use the Mall Customers dataset. This surveys customers of a shopping mall. Let’s take a look.

import pandas as pd

df = pd.read_csv('mall_customers.csv')

print(df.head(10))
   CustomerID  Gender  Age  Annual Income (k$)  Spending Score (1-100)
0           1    Male   19                  15                      39
1           2    Male   21                  15                      81
2           3  Female   20                  16                       6
3           4  Female   23                  16                      77
4           5  Female   31                  17                      40
5           6  Female   22                  17                      76
6           7  Female   35                  18                       6
7           8  Female   23                  18                      94
8           9    Male   64                  19                       3
9          10  Female   30                  19                      72

Notice we’re only displaying the first ten rows here. There are actually 200 rows.

Let’s group by sex and age. In many cases there are multiple people that have the same age in the dataset. Then we’ll calculate average annual income (in thousands of dollars) for each group.

import pandas as pd

df = pd.read_csv('mall_customers.csv')

grouped = df.groupby(by=['Gender', 'Age'])

print(grouped.mean())
            CustomerID  Annual Income (k$)  Spending Score (1-100)
Gender Age                                                        
Female 18   115.000000               65.00                   48.00
       19   114.000000               64.00                   52.00
       20    21.500000               26.50                   40.50
       21    64.750000               44.75                   63.25
       22    47.000000               37.00                   65.50
...                ...                 ...                     ...
Male   66   110.000000               63.00                   48.00
       67    65.666667               45.00                   38.00
       68   109.000000               63.00                   43.00
       69    58.000000               44.00                   46.00
       70    66.000000               47.50                   55.50

[87 rows x 3 columns]

Composite Keys

If we group on multiple keys, the values are then indexed via a composite key, which is a tuple.

In this example we group on sex and age again, retrieve the income column and then retrieve one specific income value using a composite key.

import pandas as pd

df = pd.read_csv('mall_customers.csv')

grouped = df.groupby(by=['Gender', 'Age'])

# Retrieve only the income column.
mean_incomes = grouped.mean()['Annual Income (k$)']

print(mean_incomes[('Male', 66)])
63.0

Leave a Reply

Blog at WordPress.com.

%d