We can group Pandas dataframes by one or more columns. Rather than a DataFrame, this returns a DataFrameGroupBy object.
You can then call aggregate methods on this to summarise the data; for example, mean(), sum() and count(). You can also retrieve particular columns via square brackets and run aggregate functions on them.
Here we load the iris dataset, add a species column by mapping the target IDs to species names with Numpy choose, then we group on species.
Finally, we calculate mean petal lengths for each species.
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
iris = load_iris(as_frame=True)
df = iris['data']
"""
Map the target IDs to the species names
and assign to a new column.
"""
df['species'] = np.choose(iris['target'], iris['target_names'])
# Select just two columns, petal length and species.
df = df[['petal length (cm)', 'species']]
# Group by species.
grouped = df.groupby(by='species')
print(grouped.group())
petal length (cm)
species
setosa 1.462
versicolor 4.260
virginica 5.552
Grouping On Multiple Columns
For this example we’ll use the Mall Customers dataset. This surveys customers of a shopping mall. Let’s take a look.
import pandas as pd
df = pd.read_csv('mall_customers.csv')
print(df.head(10))
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
5 6 Female 22 17 76
6 7 Female 35 18 6
7 8 Female 23 18 94
8 9 Male 64 19 3
9 10 Female 30 19 72
Notice we’re only displaying the first ten rows here. There are actually 200 rows.
Let’s group by sex and age. In many cases there are multiple people that have the same age in the dataset. Then we’ll calculate average annual income (in thousands of dollars) for each group.
import pandas as pd
df = pd.read_csv('mall_customers.csv')
grouped = df.groupby(by=['Gender', 'Age'])
print(grouped.mean())
CustomerID Annual Income (k$) Spending Score (1-100)
Gender Age
Female 18 115.000000 65.00 48.00
19 114.000000 64.00 52.00
20 21.500000 26.50 40.50
21 64.750000 44.75 63.25
22 47.000000 37.00 65.50
... ... ... ...
Male 66 110.000000 63.00 48.00
67 65.666667 45.00 38.00
68 109.000000 63.00 43.00
69 58.000000 44.00 46.00
70 66.000000 47.50 55.50
[87 rows x 3 columns]
Composite Keys
If we group on multiple keys, the values are then indexed via a composite key, which is a tuple.
In this example we group on sex and age again, retrieve the income column and then retrieve one specific income value using a composite key.
import pandas as pd
df = pd.read_csv('mall_customers.csv')
grouped = df.groupby(by=['Gender', 'Age'])
# Retrieve only the income column.
mean_incomes = grouped.mean()['Annual Income (k$)']
print(mean_incomes[('Male', 66)])
63.0
Leave a Reply