Adding totals and subtotals rows with pandas or the tidyverse

An easy problem without a simple solution

Today, we are definitely not talking about complex Data Science. We have a dataset like this one:

	Group	Subgroup	Value
0	X	a	6
1	X	a	19
2	X	a	14
3	X	z	10
4	Y	a	7
5	Y	a	6
6	Y	a	18
7	Y	z	10
8	Y	z	10
9	Y	z	3

And we would like to programmatically generate a summary, like this one:

	Group	Subgroup	Mean of Value
0	X	a	13
1	X	z	10
2	X	Total	12.25
3	Y	a	10.3333
4	Y	z	7.66667
5	Y	Total	9
6	Total	Total	10.3

It may seem like something very easy to achieve. Well, I was the first to be surprised when I realized that there is currently no trivial way to do this with both pandas and tidyverse libraries. Even the almighty Stack Overflow does not give a clear answer about how to do this, as far as I know. Let us see how to proceed.

HOW TO

The first idea that comes to mind is to compute the “Total” row on the side, and to bind it to the original dataset. Unfortunately, this works if the total is needed only for the whole population and not for subgroups. Indeed, this method is not appropriate for subtotals because reordering the rows after the concatenation is a real pain.
To get around this problem, we concatenate the original dataset to itself as much as needeed, after having made “Total” a new level for the features we want to include in the aggregate data.

Pandas

We will work on the following dataset, presented above:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


import pandas as pd
import numpy as np

np.random.seed(42)

df = pd.DataFrame({
"Group": 4*['X']+6*['Y'],
"Subgroup": 3*['a']+['z']+3*['a']+3*['z'],
"Value": np.random.randint(20, size=10)
})

Concatenate

We build the two following datasets:

1

df.assign(Subgroup=lambda x: "Total")

	Group	Subgroup	Value
0	X	Total	6
1	X	Total	19
2	X	Total	14
3	X	Total	10
4	Y	Total	7
5	Y	Total	6
6	Y	Total	18
7	Y	Total	10
8	Y	Total	10
9	Y	Total	3

and

1

df.assign(Group=lambda x: "Total", Subgroup=lambda x: "Total")

	Group	Subgroup	Value
0	Total	Total	6
1	Total	Total	19
2	Total	Total	14
3	Total	Total	10
4	Total	Total	7
5	Total	Total	6
6	Total	Total	18
7	Total	Total	10
8	Total	Total	10
9	Total	Total	3

Then we concatenate these dataframes to the original dataset with pandas.concat():

1
2
3
4
5


df_subtotal = pd.concat([
    df,
    df.assign(Subgroup=lambda x: "Total"),
    df.assign(Group=lambda x: "Total", Subgroup=lambda x: "Total")
])

Let us compute the mean for each group/subgroup:

1
2
3
4
5


pd.concat([
    df,
    df.assign(Subgroup=lambda x: "Total"),
    df.assign(Group=lambda x: "Total", Subgroup=lambda x: "Total")
]).groupby(by=['Group', 'Subgroup'], observed=True).mean()

	Group	Subgroup	Value
0	Total	Total	10.3
1	X	Total	12.25
2	X	a	13
3	X	z	10
4	Y	Total	9
5	Y	a	10.3333
6	Y	z	7.66667

The results are there, but the order of the rows is clearly not satisfying.

Order levels

To circumvent this issue, we cast the grouping variables as dtype category. Calling pandas.Series.astype("category") is not good enough because by default, categories are unordered. Yet, we want the category “Total” to be the last. We use instances of CategoricalDtype for this purpose. Get more information here.

1
2
3
4
5


cat_Group = CategoricalDtype(categories=list(df['Group'].unique()) + ['Total'], ordered=True)
cat_Subgroup = CategoricalDtype(categories=list(df['Subgroup'].unique()) + ['Total'], ordered=True)

df_subtotal['Group'] = df_subtotal['Group'].astype(cat_Group)
df_subtotal['Subgroup'] = df_subtotal['Subgroup'].astype(cat_Subgroup)

Aggregate

Now the groupby() command returns the expected result:

1

df_subtotal.groupby(by=['Group', 'Subgroup'], observed=True).mean()

	Group	Subgroup	Value
0	X	a	13
1	X	z	10
2	X	Total	12.25
3	Y	a	10.3333
4	Y	z	7.66667
5	Y	Total	9
6	Total	Total	10.3

Hooray !

R’s Tidyverse

In this case, we use dplyr for the concatenation and forcats to reorder the levels:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


# Import necessary packages
library(tibble)
library(dplyr)
library(forcats)

# Generate the data
df <- tibble(
  Group = c(rep("X", 4), rep("Y", 6)),
  Subgroup = c(rep("a", 3), "z", rep("a", 3), rep("z", 3)),
  Value = sample(20, 10, replace = TRUE)
)

df %>%
  # Concatenate
  bind_rows(
    df %>% mutate(Subgroup = "Total"),
    df %>% mutate(across(c(Group, Subgroup), ~ "Total"))
  ) %>%
  # Aggregate
  group_by(
    Group,
    Subgroup
  ) %>%
  summarise(
    Mean = mean(Value)
  ) %>%
  ungroup() %>%
  # Reorder factor levels
  mutate(
    Group = Group %>% as.factor() %>% fct_relevel("Total", after = Inf),
    Subgroup = Subgroup %>% as.factor() %>% fct_relevel("Total", after = Inf)
  ) %>%
  # Arrange
  arrange(
    Group,
    Subgroup
  )

Plots

This technique is also useful for data visualization. Below is an example based on the mtcars dataset:

totals_histogram

As usual, the full scripts are available on my GitHub.

python r pandas tidyverse total row subtotals aggregate groupby