An easy problem without a simple solution
Today, we are definitely not talking about complex Data Science. We have a dataset like this one:
Group | Subgroup | Value | |
---|---|---|---|
0 | X | a | 6 |
1 | X | a | 19 |
2 | X | a | 14 |
3 | X | z | 10 |
4 | Y | a | 7 |
5 | Y | a | 6 |
6 | Y | a | 18 |
7 | Y | z | 10 |
8 | Y | z | 10 |
9 | Y | z | 3 |
And we would like to programmatically generate a summary, like this one:
Group | Subgroup | Mean of Value | |
---|---|---|---|
0 | X | a | 13 |
1 | X | z | 10 |
2 | X | Total | 12.25 |
3 | Y | a | 10.3333 |
4 | Y | z | 7.66667 |
5 | Y | Total | 9 |
6 | Total | Total | 10.3 |
It may seem like something very easy to achieve. Well, I was the first to be surprised when I realized that there is currently no trivial way to do this with both pandas and tidyverse libraries. Even the almighty Stack Overflow does not give a clear answer about how to do this, as far as I know. Let us see how to proceed.
HOW TO
The first idea that comes to mind is to compute the “Total” row on the side, and to bind it to the original dataset. Unfortunately, this works if the total is needed only for the whole population and not for subgroups. Indeed, this method is not appropriate for subtotals because reordering the rows after the concatenation is a real pain.
To get around this problem, we concatenate the original dataset to itself as much as needeed, after having made “Total” a new level for the features we want to include in the aggregate data.
Pandas
We will work on the following dataset, presented above:
|
|
Concatenate
We build the two following datasets:
|
|
Group | Subgroup | Value | |
---|---|---|---|
0 | X | Total | 6 |
1 | X | Total | 19 |
2 | X | Total | 14 |
3 | X | Total | 10 |
4 | Y | Total | 7 |
5 | Y | Total | 6 |
6 | Y | Total | 18 |
7 | Y | Total | 10 |
8 | Y | Total | 10 |
9 | Y | Total | 3 |
and
|
|
Group | Subgroup | Value | |
---|---|---|---|
0 | Total | Total | 6 |
1 | Total | Total | 19 |
2 | Total | Total | 14 |
3 | Total | Total | 10 |
4 | Total | Total | 7 |
5 | Total | Total | 6 |
6 | Total | Total | 18 |
7 | Total | Total | 10 |
8 | Total | Total | 10 |
9 | Total | Total | 3 |
Then we concatenate these dataframes to the original dataset with pandas.concat()
:
|
|
Let us compute the mean for each group/subgroup:
|
|
Group | Subgroup | Value | |
---|---|---|---|
0 | Total | Total | 10.3 |
1 | X | Total | 12.25 |
2 | X | a | 13 |
3 | X | z | 10 |
4 | Y | Total | 9 |
5 | Y | a | 10.3333 |
6 | Y | z | 7.66667 |
The results are there, but the order of the rows is clearly not satisfying.
Order levels
To circumvent this issue, we cast the grouping variables as dtype category. Calling pandas.Series.astype("category")
is not good enough because by default, categories are unordered. Yet, we want the category “Total” to be the last. We use instances of CategoricalDtype
for this purpose. Get more information here.
|
|
Aggregate
Now the groupby()
command returns the expected result:
|
|
Group | Subgroup | Value | |
---|---|---|---|
0 | X | a | 13 |
1 | X | z | 10 |
2 | X | Total | 12.25 |
3 | Y | a | 10.3333 |
4 | Y | z | 7.66667 |
5 | Y | Total | 9 |
6 | Total | Total | 10.3 |
Hooray !
R’s Tidyverse
In this case, we use dplyr for the concatenation and forcats to reorder the levels:
|
|
Plots
This technique is also useful for data visualization. Below is an example based on the mtcars dataset:
As usual, the full scripts are available on my GitHub.