GroupBy Operations in Pandas

GroupBy operations are one of the most powerful features in pandas, allowing you to split your data into groups, apply functions to each group, and combine the results. If you're familiar with SQL's GROUP BY or Stata's collapse/bysort commands, pandas groupby provides similar functionality with much more flexibility.

What is GroupBy?

The GroupBy operation follows a split-apply-combine pattern:

Split: Divide data into groups based on one or more keys
Apply: Perform a calculation or transformation on each group independently
Combine: Merge the results back into a single data structure

Think of it as answering questions like:

"What's the average salary by department?"
"What's the total sales for each region and product?"
"How many customers are in each age group?"

Basic GroupBy Syntax

import pandas as pd
import numpy as np

# Create sample employee data
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'department': ['Sales', 'Engineering', 'Sales', 'Marketing', 'Engineering', 'Sales', 'Marketing', 'Engineering'],
    'salary': [50000, 75000, 55000, 60000, 80000, 52000, 58000, 70000],
    'years_experience': [2, 5, 3, 4, 7, 2, 3, 6],
    'performance_score': [85, 92, 78, 88, 95, 80, 90, 87]
}

df = pd.DataFrame(data)
print(df)

Simple GroupBy Operation

# Group by department and calculate mean salary
dept_salary = df.groupby('department')['salary'].mean()
print("\nAverage Salary by Department:")
print(dept_salary)

Output:

department
Engineering    75000.0
Marketing      59000.0
Sales          52333.3
Name: salary, dtype: float64

GroupBy with Stata Commands Comparison

For Stata users, here's how pandas groupby relates to Stata commands:

# Stata: collapse (mean) salary, by(department)
df.groupby('department')['salary'].mean()

# Stata: collapse (sum) salary (mean) experience, by(department)
df.groupby('department').agg({'salary': 'sum', 'years_experience': 'mean'})

# Stata: bysort department: egen mean_salary = mean(salary)
df['mean_salary'] = df.groupby('department')['salary'].transform('mean')

# Stata: table department, c(mean salary count)
df.groupby('department')['salary'].agg(['mean', 'count'])

Common Aggregation Functions

Single Aggregation

# Different aggregation functions
print("Count by department:")
print(df.groupby('department').size())

print("\nSum of salaries:")
print(df.groupby('department')['salary'].sum())

print("\nMedian salary:")
print(df.groupby('department')['salary'].median())

print("\nMin and max salary:")
print(df.groupby('department')['salary'].min())
print(df.groupby('department')['salary'].max())

print("\nStandard deviation:")
print(df.groupby('department')['salary'].std())

Multiple Aggregations with agg()

# Apply multiple aggregation functions at once
salary_stats = df.groupby('department')['salary'].agg([
    'count',
    'mean',
    'median',
    'std',
    'min',
    'max'
])

print("\nSalary Statistics by Department:")
print(salary_stats)

Output:

             count    mean  median       std    min    max
department                                                
Engineering      3  75000   75000  10000.0  70000  80000
Marketing        2  59000   59000   2828.4  58000  60000
Sales            3  52333   52000   2516.6  50000  55000

Different Aggregations for Different Columns

# Apply different functions to different columns
summary = df.groupby('department').agg({
    'salary': ['mean', 'median', 'std'],
    'years_experience': ['mean', 'max'],
    'performance_score': ['mean', 'min']
})

print("\nMulti-column Aggregation:")
print(summary)

Grouping by Multiple Columns

# Create more detailed sample data
sales_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'product': np.random.choice(['A', 'B', 'C'], 100),
    'sales': np.random.randint(100, 1000, 100),
    'units': np.random.randint(1, 20, 100)
})

# Group by multiple columns
regional_product_sales = sales_data.groupby(['region', 'product'])['sales'].agg([
    'sum',
    'mean',
    'count'
])

print("\nSales by Region and Product:")
print(regional_product_sales)

# Access multi-level index
print("\nNorth region sales:")
print(regional_product_sales.loc['North'])

Transform: Keep Original DataFrame Size

The transform() method applies a function but returns a result with the same shape as the input:

# Add group statistics as new columns
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean')
df['dept_max_salary'] = df.groupby('department')['salary'].transform('max')
df['salary_vs_dept_avg'] = df['salary'] - df['dept_avg_salary']

print("\nDataFrame with Department Averages:")
print(df[['name', 'department', 'salary', 'dept_avg_salary', 'salary_vs_dept_avg']])

Custom Transform Functions

# Custom transformation: standardize within groups
def standardize(x):
    return (x - x.mean()) / x.std()

df['salary_zscore_dept'] = df.groupby('department')['salary'].transform(standardize)

print("\nStandardized Salaries within Departments:")
print(df[['name', 'department', 'salary', 'salary_zscore_dept']])

Filter: Select Entire Groups

The filter() method selects entire groups based on a condition:

# Keep only departments with more than 2 employees
large_depts = df.groupby('department').filter(lambda x: len(x) > 2)

print("\nEmployees in Large Departments:")
print(large_depts)

# Keep only groups where average salary > 60000
high_salary_depts = df.groupby('department').filter(
    lambda x: x['salary'].mean() > 60000
)

print("\nEmployees in High-Paying Departments:")
print(high_salary_depts[['name', 'department', 'salary']])

Apply: Custom Group Operations

The apply() method provides maximum flexibility for custom operations:

# Rank employees within each department by salary
def rank_within_group(group):
    group['dept_rank'] = group['salary'].rank(ascending=False)
    return group

df_ranked = df.groupby('department').apply(rank_within_group)
print("\nEmployees Ranked Within Departments:")
print(df_ranked[['name', 'department', 'salary', 'dept_rank']])

More Complex Apply Examples

# Get top 2 earners per department
def top_n_earners(group, n=2):
    return group.nlargest(n, 'salary')

top_earners = df.groupby('department').apply(top_n_earners, n=2)
print("\nTop 2 Earners per Department:")
print(top_earners[['name', 'department', 'salary']])

# Calculate custom statistics
def custom_stats(group):
    return pd.Series({
        'count': len(group),
        'salary_mean': group['salary'].mean(),
        'salary_range': group['salary'].max() - group['salary'].min(),
        'high_performers': (group['performance_score'] >= 90).sum()
    })

dept_stats = df.groupby('department').apply(custom_stats)
print("\nCustom Department Statistics:")
print(dept_stats)

Named Aggregations (Pandas 0.25+)

Create more readable aggregations with custom column names:

# Named aggregations for cleaner output
summary = df.groupby('department').agg(
    total_employees=('name', 'count'),
    avg_salary=('salary', 'mean'),
    median_salary=('salary', 'median'),
    total_payroll=('salary', 'sum'),
    avg_experience=('years_experience', 'mean'),
    avg_performance=('performance_score', 'mean')
).round(2)

print("\nNamed Aggregations:")
print(summary)

Practical Examples

Example 1: Sales Analysis

# Create comprehensive sales dataset
np.random.seed(42)
sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=365, freq='D'),
    'store_id': np.random.choice(['S001', 'S002', 'S003', 'S004'], 365),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Home'], 365),
    'sales_amount': np.random.uniform(100, 5000, 365),
    'units_sold': np.random.randint(1, 50, 365),
    'customer_count': np.random.randint(1, 100, 365)
})

sales['month'] = sales['date'].dt.month
sales['quarter'] = sales['date'].dt.quarter

# Comprehensive sales analysis
monthly_sales = sales.groupby(['month', 'store_id']).agg({
    'sales_amount': ['sum', 'mean'],
    'units_sold': 'sum',
    'customer_count': 'sum'
}).round(2)

print("\nMonthly Sales by Store:")
print(monthly_sales.head(10))

# Product performance by quarter
quarterly_products = sales.groupby(['quarter', 'product_category']).agg(
    total_sales=('sales_amount', 'sum'),
    avg_transaction=('sales_amount', 'mean'),
    total_units=('units_sold', 'sum'),
    total_customers=('customer_count', 'sum')
).round(2)

print("\nQuarterly Product Performance:")
print(quarterly_products)

Example 2: Customer Segmentation

# Customer data
customers = pd.DataFrame({
    'customer_id': range(1, 101),
    'age': np.random.randint(18, 75, 100),
    'annual_income': np.random.randint(20000, 150000, 100),
    'total_purchases': np.random.randint(1, 50, 100),
    'loyalty_years': np.random.randint(0, 10, 100),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100)
})

# Create age groups
customers['age_group'] = pd.cut(customers['age'], 
                                bins=[0, 25, 35, 50, 100],
                                labels=['18-25', '26-35', '36-50', '50+'])

# Segment analysis
segment_analysis = customers.groupby(['age_group', 'region']).agg(
    customer_count=('customer_id', 'count'),
    avg_income=('annual_income', 'mean'),
    avg_purchases=('total_purchases', 'mean'),
    avg_loyalty=('loyalty_years', 'mean'),
    total_revenue=('annual_income', 'sum')
).round(2)

print("\nCustomer Segment Analysis:")
print(segment_analysis)

# Identify high-value segments
high_value = segment_analysis[segment_analysis['avg_income'] > 75000]
print("\nHigh-Value Segments:")
print(high_value)

Example 3: Time-Based Grouping

# Time series data
dates = pd.date_range('2024-01-01', periods=365, freq='D')
time_data = pd.DataFrame({
    'date': dates,
    'temperature': np.random.normal(70, 15, 365),
    'rainfall': np.random.exponential(0.5, 365),
    'visitors': np.random.poisson(1000, 365)
})

time_data.set_index('date', inplace=True)

# Group by different time periods
weekly = time_data.resample('W').agg({
    'temperature': 'mean',
    'rainfall': 'sum',
    'visitors': 'sum'
})

print("\nWeekly Aggregation:")
print(weekly.head())

# Group by month
monthly = time_data.resample('M').agg(
    avg_temp=('temperature', 'mean'),
    total_rain=('rainfall', 'sum'),
    total_visitors=('visitors', 'sum'),
    max_temp=('temperature', 'max'),
    days_no_rain=('rainfall', lambda x: (x == 0).sum())
)

print("\nMonthly Summary:")
print(monthly)

Advanced GroupBy Techniques

Iterating Through Groups

# Iterate through each group
for name, group in df.groupby('department'):
    print(f"\n{name} Department:")
    print(group[['name', 'salary', 'performance_score']])
    print(f"Average Salary: ${group['salary'].mean():,.2f}")

Multiple GroupBy Operations

# Chain multiple groupby operations
result = (df.groupby('department')
          .agg({'salary': 'mean', 'performance_score': 'mean'})
          .round(2)
          .sort_values('salary', ascending=False))

print("\nDepartments Sorted by Average Salary:")
print(result)

Handling Missing Values in Groups

# Add some missing values
df_with_na = df.copy()
df_with_na.loc[1, 'performance_score'] = np.nan
df_with_na.loc[4, 'salary'] = np.nan

# GroupBy automatically excludes NaN
print("\nGroupBy with Missing Values:")
print(df_with_na.groupby('department')['salary'].mean())

# Include NaN in count
print("\nCount including NaN:")
print(df_with_na.groupby('department').size())

Performance Tips

Use Categorical Data Types

# Convert to categorical for better performance with many groups
df_cat = df.copy()
df_cat['department'] = df_cat['department'].astype('category')

# This is faster with large datasets
result = df_cat.groupby('department')['salary'].mean()

Use Appropriate Aggregation Methods

# These are equivalent but first is faster
df.groupby('department')['salary'].sum()  # Faster
df.groupby('department')['salary'].agg('sum')  # Slightly slower

# For multiple operations, agg() is more efficient
df.groupby('department')['salary'].agg(['sum', 'mean', 'count'])

Common Patterns and Idioms

Percentage of Total

# Calculate percentage of total sales by category
sales_by_cat = sales.groupby('product_category')['sales_amount'].sum()
sales_pct = (sales_by_cat / sales_by_cat.sum() * 100).round(2)

print("\nSales Percentage by Category:")
print(sales_pct)

Cumulative Statistics Within Groups

# Cumulative sum within each department
df_sorted = df.sort_values(['department', 'years_experience'])
df_sorted['cumulative_salary'] = df_sorted.groupby('department')['salary'].cumsum()

print("\nCumulative Salary by Department:")
print(df_sorted[['name', 'department', 'years_experience', 'salary', 'cumulative_salary']])

Rank Within Groups

# Rank employees by performance within department
df['perf_rank'] = df.groupby('department')['performance_score'].rank(
    ascending=False, 
    method='min'
)

print("\nPerformance Ranking Within Departments:")
print(df[['name', 'department', 'performance_score', 'perf_rank']].sort_values(
    ['department', 'perf_rank']
))

Practice Exercises

Exercise 1: Product Analysis

Given sales data, calculate:

Total revenue by product and region
Average transaction size by product
Top 3 products by revenue in each region

Exercise 2: Customer Lifetime Value

Create customer segments based on:

Purchase frequency groups
Average order value groups
Calculate metrics for each segment

Exercise 3: Time Series Aggregation

With daily transaction data:

Calculate weekly moving averages by store
Identify peak sales days by category
Compare month-over-month growth

Sample Solutions

Exercise 1:

# Product analysis solution
product_analysis = sales.groupby(['region', 'product_category']).agg(
    total_revenue=('sales_amount', 'sum'),
    avg_transaction=('sales_amount', 'mean'),
    transaction_count=('sales_amount', 'count')
).round(2)

# Top 3 products per region
top_products = (sales.groupby(['region', 'product_category'])['sales_amount']
                .sum()
                .groupby(level=0)
                .nlargest(3))

print("Top 3 Products by Region:")
print(top_products)

Exercise 2:

# Customer segmentation solution
customers['purchase_freq_group'] = pd.qcut(customers['total_purchases'], 
                                           q=3, 
                                           labels=['Low', 'Medium', 'High'])
customers['order_value_group'] = pd.qcut(customers['annual_income'], 
                                         q=3, 
                                         labels=['Low', 'Medium', 'High'])

segment_metrics = customers.groupby(['purchase_freq_group', 'order_value_group']).agg(
    customer_count=('customer_id', 'count'),
    avg_purchases=('total_purchases', 'mean'),
    avg_income=('annual_income', 'mean'),
    lifetime_value=('annual_income', 'sum')
).round(2)

print("Customer Segment Metrics:")
print(segment_metrics)

Key Takeaways

GroupBy follows split-apply-combine pattern for powerful data aggregation
Use agg() for multiple aggregation functions on groups
transform() keeps original DataFrame size for adding group statistics
filter() selects entire groups based on conditions
apply() provides maximum flexibility for custom group operations
Named aggregations make results more readable
Categorical data types improve groupby performance on large datasets
GroupBy is essential for data analysis, reporting, and feature engineering

What's Next?

Now that you understand GroupBy operations, explore:

Merging and Joining DataFrames - Combine data from multiple sources
Reshaping Data (Pivot, Melt) - Transform data structure
Time Series Analysis - Work with temporal data

Resources

Pandas GroupBy Documentation: pandas.pydata.org/docs/user_guide/groupby.html
GroupBy Tutorial: Step-by-step guide with real examples
Performance Tips: Optimizing groupby operations for large datasets

GroupBy operations are the foundation of data analysis in pandas. Master these techniques and you'll be able to extract insights from any dataset efficiently.