## Python Pandas – descriptive stats from a frequency table

Suppose you are a teacher, and want to keep track of and do some simple descriptive stats on your student’s scores in a number of subjects. So you have recorded scores in a frequency table as below:

Left-most column has the scores, and the next four columns show the frequency of those scores for the four subjects.

By summing the columns we can find the number of students having obtained these scores in each subject:

```subject
math         11
physics      14
chemistry    11
biology      15```

Now, suppose you’re interested in the average score (median & mean) of the students, in each subject.

How would you go about obtaining those descriptive stats…? You can’t simply apply a mean() or median() function on each column, because that would give you the average frequency for the subject, not the average score.

The problem is that we have weights (the frequencies) on each score.

Pandas makes the problem a breeze. Below code with just a handful lines to set up the dataframe for the table, and to massage that dataframe to produce the descriptive stats we want:

```import numpy as np
import pandas as pd

np.random.seed(666)

vals = [0,1,2,3,4,5]
nr_categories = 5

df_freq = pd.DataFrame({'math' : np.random.choice(vals,nr_categories,replace=False),
'physics' : np.random.choice(vals,nr_categories,replace=False),
'chemistry' : np.random.choice(vals,nr_categories,replace=False),
'biology' : np.random.choice(vals,nr_categories,replace=False)},
index=range(1,nr_categories + 1))

df_freq.index.name='points'
df_freq.columns.name='subject'
df_freq.sort_index(inplace=True)

df_freq_subgrps = pd.DataFrame(df_freq.unstack())
df_freq_subgrps.columns=['freq']
df_freq_subgrps['points'] = df_freq_subgrps.index.get_level_values(1)
df_freq_subgrps = df_freq_subgrps[['points','freq']]
df_freq_subgrps = df_freq_subgrps.droplevel(1)

serialized_points = df_freq_subgrps.groupby('subject',group_keys=False).apply(
lambda row: np.repeat(row['points'],row['freq']))

stats = serialized_points.groupby('subject').agg(['median','mean','count','min','max'])
stats
```

The output: 