Python Pandas – descriptive stats from a frequency table

Suppose you are a teacher, and want to keep track of and do some simple descriptive stats on your student’s scores in a number of subjects. So you have recorded scores in a frequency table as below:

Left-most column has the scores, and the next four columns show the frequency of those scores for the four subjects.

By summing the columns we can find the number of students having obtained these scores in each subject:

subject
math         11
physics      14
chemistry    11
biology      15

Now, suppose you’re interested in the average score (median & mean) of the students, in each subject.

How would you go about obtaining those descriptive stats…? You can’t simply apply a mean() or median() function on each column, because that would give you the average frequency for the subject, not the average score.

The problem is that we have weights (the frequencies) on each score.

Pandas makes the problem a breeze. Below code with just a handful lines to set up the dataframe for the table, and to massage that dataframe to produce the descriptive stats we want:

import numpy as np
import pandas as pd

np.random.seed(666)

vals = [0,1,2,3,4,5]
nr_categories = 5

df_freq = pd.DataFrame({'math' : np.random.choice(vals,nr_categories,replace=False),
                   'physics' : np.random.choice(vals,nr_categories,replace=False),
                   'chemistry' : np.random.choice(vals,nr_categories,replace=False),
                   'biology' : np.random.choice(vals,nr_categories,replace=False)},
                        index=range(1,nr_categories + 1))

df_freq.index.name='points'
df_freq.columns.name='subject'
df_freq.sort_index(inplace=True)

df_freq_subgrps = pd.DataFrame(df_freq.unstack()) 
df_freq_subgrps.columns=['freq']
df_freq_subgrps['points'] = df_freq_subgrps.index.get_level_values(1)
df_freq_subgrps = df_freq_subgrps[['points','freq']]
df_freq_subgrps = df_freq_subgrps.droplevel(1)

serialized_points = df_freq_subgrps.groupby('subject',group_keys=False).apply(
         lambda row: np.repeat(row['points'],row['freq'])) 

stats = serialized_points.groupby('subject').agg(['median','mean','count','min','max'])
stats

The output:

About swdevperestroika

High tech industry veteran, avid hacker reluctantly transformed to mgmt consultant.
This entry was posted in Data Analytics and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s