GroupBy Operations in Pandas
GroupBy operations are one of the most powerful features in pandas, allowing you to split your data into groups, apply functions to each group, and combine the results. If you're familiar with SQL's GROUP BY or Stata's collapse/bysort commands, pandas groupby provides similar functionality with much more flexibility.
What is GroupBy?
The GroupBy operation follows a split-apply-combine pattern:
- Split: Divide data into groups based on one or more keys
- Apply: Perform a calculation or transformation on each group independently
- Combine: Merge the results back into a single data structure
Think of it as answering questions like:
- "What's the average salary by department?"
- "What's the total sales for each region and product?"
- "How many customers are in each age group?"
Basic GroupBy Syntax
import pandas as pd
import numpy as np
# Create sample employee data
data = {
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
'department': ['Sales', 'Engineering', 'Sales', 'Marketing', 'Engineering', 'Sales', 'Marketing', 'Engineering'],
'salary': [50000, 75000, 55000, 60000, 80000, 52000, 58000, 70000],
'years_experience': [2, 5, 3, 4, 7, 2, 3, 6],
'performance_score': [85, 92, 78, 88, 95, 80, 90, 87]
}
df = pd.DataFrame(data)
print(df)
Simple GroupBy Operation
# Group by department and calculate mean salary
dept_salary = df.groupby('department')['salary'].mean()
print("\nAverage Salary by Department:")
print(dept_salary)
Output:
department
Engineering 75000.0
Marketing 59000.0
Sales 52333.3
Name: salary, dtype: float64
GroupBy with Stata Commands Comparison
For Stata users, here's how pandas groupby relates to Stata commands:
# Stata: collapse (mean) salary, by(department)
df.groupby('department')['salary'].mean()
# Stata: collapse (sum) salary (mean) experience, by(department)
df.groupby('department').agg({'salary': 'sum', 'years_experience': 'mean'})
# Stata: bysort department: egen mean_salary = mean(salary)
df['mean_salary'] = df.groupby('department')['salary'].transform('mean')
# Stata: table department, c(mean salary count)
df.groupby('department')['salary'].agg(['mean', 'count'])
Common Aggregation Functions
Single Aggregation
# Different aggregation functions
print("Count by department:")
print(df.groupby('department').size())
print("\nSum of salaries:")
print(df.groupby('department')['salary'].sum())
print("\nMedian salary:")
print(df.groupby('department')['salary'].median())
print("\nMin and max salary:")
print(df.groupby('department')['salary'].min())
print(df.groupby('department')['salary'].max())
print("\nStandard deviation:")
print(df.groupby('department')['salary'].std())
Multiple Aggregations with agg()
# Apply multiple aggregation functions at once
salary_stats = df.groupby('department')['salary'].agg([
'count',
'mean',
'median',
'std',
'min',
'max'
])
print("\nSalary Statistics by Department:")
print(salary_stats)
Output:
count mean median std min max
department
Engineering 3 75000 75000 10000.0 70000 80000
Marketing 2 59000 59000 2828.4 58000 60000
Sales 3 52333 52000 2516.6 50000 55000
Different Aggregations for Different Columns
# Apply different functions to different columns
summary = df.groupby('department').agg({
'salary': ['mean', 'median', 'std'],
'years_experience': ['mean', 'max'],
'performance_score': ['mean', 'min']
})
print("\nMulti-column Aggregation:")
print(summary)
Grouping by Multiple Columns
# Create more detailed sample data
sales_data = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=100, freq='D'),
'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
'product': np.random.choice(['A', 'B', 'C'], 100),
'sales': np.random.randint(100, 1000, 100),
'units': np.random.randint(1, 20, 100)
})
# Group by multiple columns
regional_product_sales = sales_data.groupby(['region', 'product'])['sales'].agg([
'sum',
'mean',
'count'
])
print("\nSales by Region and Product:")
print(regional_product_sales)
# Access multi-level index
print("\nNorth region sales:")
print(regional_product_sales.loc['North'])
Transform: Keep Original DataFrame Size
The transform() method applies a function but returns a result with the same shape as the input:
# Add group statistics as new columns
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean')
df['dept_max_salary'] = df.groupby('department')['salary'].transform('max')
df['salary_vs_dept_avg'] = df['salary'] - df['dept_avg_salary']
print("\nDataFrame with Department Averages:")
print(df[['name', 'department', 'salary', 'dept_avg_salary', 'salary_vs_dept_avg']])
Custom Transform Functions
# Custom transformation: standardize within groups
def standardize(x):
return (x - x.mean()) / x.std()
df['salary_zscore_dept'] = df.groupby('department')['salary'].transform(standardize)
print("\nStandardized Salaries within Departments:")
print(df[['name', 'department', 'salary', 'salary_zscore_dept']])
Filter: Select Entire Groups
The filter() method selects entire groups based on a condition:
# Keep only departments with more than 2 employees
large_depts = df.groupby('department').filter(lambda x: len(x) > 2)
print("\nEmployees in Large Departments:")
print(large_depts)
# Keep only groups where average salary > 60000
high_salary_depts = df.groupby('department').filter(
lambda x: x['salary'].mean() > 60000
)
print("\nEmployees in High-Paying Departments:")
print(high_salary_depts[['name', 'department', 'salary']])
Apply: Custom Group Operations
The apply() method provides maximum flexibility for custom operations:
# Rank employees within each department by salary
def rank_within_group(group):
group['dept_rank'] = group['salary'].rank(ascending=False)
return group
df_ranked = df.groupby('department').apply(rank_within_group)
print("\nEmployees Ranked Within Departments:")
print(df_ranked[['name', 'department', 'salary', 'dept_rank']])
More Complex Apply Examples
# Get top 2 earners per department
def top_n_earners(group, n=2):
return group.nlargest(n, 'salary')
top_earners = df.groupby('department').apply(top_n_earners, n=2)
print("\nTop 2 Earners per Department:")
print(top_earners[['name', 'department', 'salary']])
# Calculate custom statistics
def custom_stats(group):
return pd.Series({
'count': len(group),
'salary_mean': group['salary'].mean(),
'salary_range': group['salary'].max() - group['salary'].min(),
'high_performers': (group['performance_score'] >= 90).sum()
})
dept_stats = df.groupby('department').apply(custom_stats)
print("\nCustom Department Statistics:")
print(dept_stats)
Named Aggregations (Pandas 0.25+)
Create more readable aggregations with custom column names:
# Named aggregations for cleaner output
summary = df.groupby('department').agg(
total_employees=('name', 'count'),
avg_salary=('salary', 'mean'),
median_salary=('salary', 'median'),
total_payroll=('salary', 'sum'),
avg_experience=('years_experience', 'mean'),
avg_performance=('performance_score', 'mean')
).round(2)
print("\nNamed Aggregations:")
print(summary)
Practical Examples
Example 1: Sales Analysis
# Create comprehensive sales dataset
np.random.seed(42)
sales = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=365, freq='D'),
'store_id': np.random.choice(['S001', 'S002', 'S003', 'S004'], 365),
'product_category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Home'], 365),
'sales_amount': np.random.uniform(100, 5000, 365),
'units_sold': np.random.randint(1, 50, 365),
'customer_count': np.random.randint(1, 100, 365)
})
sales['month'] = sales['date'].dt.month
sales['quarter'] = sales['date'].dt.quarter
# Comprehensive sales analysis
monthly_sales = sales.groupby(['month', 'store_id']).agg({
'sales_amount': ['sum', 'mean'],
'units_sold': 'sum',
'customer_count': 'sum'
}).round(2)
print("\nMonthly Sales by Store:")
print(monthly_sales.head(10))
# Product performance by quarter
quarterly_products = sales.groupby(['quarter', 'product_category']).agg(
total_sales=('sales_amount', 'sum'),
avg_transaction=('sales_amount', 'mean'),
total_units=('units_sold', 'sum'),
total_customers=('customer_count', 'sum')
).round(2)
print("\nQuarterly Product Performance:")
print(quarterly_products)
Example 2: Customer Segmentation
# Customer data
customers = pd.DataFrame({
'customer_id': range(1, 101),
'age': np.random.randint(18, 75, 100),
'annual_income': np.random.randint(20000, 150000, 100),
'total_purchases': np.random.randint(1, 50, 100),
'loyalty_years': np.random.randint(0, 10, 100),
'region': np.random.choice(['North', 'South', 'East', 'West'], 100)
})
# Create age groups
customers['age_group'] = pd.cut(customers['age'],
bins=[0, 25, 35, 50, 100],
labels=['18-25', '26-35', '36-50', '50+'])
# Segment analysis
segment_analysis = customers.groupby(['age_group', 'region']).agg(
customer_count=('customer_id', 'count'),
avg_income=('annual_income', 'mean'),
avg_purchases=('total_purchases', 'mean'),
avg_loyalty=('loyalty_years', 'mean'),
total_revenue=('annual_income', 'sum')
).round(2)
print("\nCustomer Segment Analysis:")
print(segment_analysis)
# Identify high-value segments
high_value = segment_analysis[segment_analysis['avg_income'] > 75000]
print("\nHigh-Value Segments:")
print(high_value)
Example 3: Time-Based Grouping
# Time series data
dates = pd.date_range('2024-01-01', periods=365, freq='D')
time_data = pd.DataFrame({
'date': dates,
'temperature': np.random.normal(70, 15, 365),
'rainfall': np.random.exponential(0.5, 365),
'visitors': np.random.poisson(1000, 365)
})
time_data.set_index('date', inplace=True)
# Group by different time periods
weekly = time_data.resample('W').agg({
'temperature': 'mean',
'rainfall': 'sum',
'visitors': 'sum'
})
print("\nWeekly Aggregation:")
print(weekly.head())
# Group by month
monthly = time_data.resample('M').agg(
avg_temp=('temperature', 'mean'),
total_rain=('rainfall', 'sum'),
total_visitors=('visitors', 'sum'),
max_temp=('temperature', 'max'),
days_no_rain=('rainfall', lambda x: (x == 0).sum())
)
print("\nMonthly Summary:")
print(monthly)
Advanced GroupBy Techniques
Iterating Through Groups
# Iterate through each group
for name, group in df.groupby('department'):
print(f"\n{name} Department:")
print(group[['name', 'salary', 'performance_score']])
print(f"Average Salary: ${group['salary'].mean():,.2f}")
Multiple GroupBy Operations
# Chain multiple groupby operations
result = (df.groupby('department')
.agg({'salary': 'mean', 'performance_score': 'mean'})
.round(2)
.sort_values('salary', ascending=False))
print("\nDepartments Sorted by Average Salary:")
print(result)
Handling Missing Values in Groups
# Add some missing values
df_with_na = df.copy()
df_with_na.loc[1, 'performance_score'] = np.nan
df_with_na.loc[4, 'salary'] = np.nan
# GroupBy automatically excludes NaN
print("\nGroupBy with Missing Values:")
print(df_with_na.groupby('department')['salary'].mean())
# Include NaN in count
print("\nCount including NaN:")
print(df_with_na.groupby('department').size())
Performance Tips
Use Categorical Data Types
# Convert to categorical for better performance with many groups
df_cat = df.copy()
df_cat['department'] = df_cat['department'].astype('category')
# This is faster with large datasets
result = df_cat.groupby('department')['salary'].mean()
Use Appropriate Aggregation Methods
# These are equivalent but first is faster
df.groupby('department')['salary'].sum() # Faster
df.groupby('department')['salary'].agg('sum') # Slightly slower
# For multiple operations, agg() is more efficient
df.groupby('department')['salary'].agg(['sum', 'mean', 'count'])
Common Patterns and Idioms
Percentage of Total
# Calculate percentage of total sales by category
sales_by_cat = sales.groupby('product_category')['sales_amount'].sum()
sales_pct = (sales_by_cat / sales_by_cat.sum() * 100).round(2)
print("\nSales Percentage by Category:")
print(sales_pct)
Cumulative Statistics Within Groups
# Cumulative sum within each department
df_sorted = df.sort_values(['department', 'years_experience'])
df_sorted['cumulative_salary'] = df_sorted.groupby('department')['salary'].cumsum()
print("\nCumulative Salary by Department:")
print(df_sorted[['name', 'department', 'years_experience', 'salary', 'cumulative_salary']])
Rank Within Groups
# Rank employees by performance within department
df['perf_rank'] = df.groupby('department')['performance_score'].rank(
ascending=False,
method='min'
)
print("\nPerformance Ranking Within Departments:")
print(df[['name', 'department', 'performance_score', 'perf_rank']].sort_values(
['department', 'perf_rank']
))
Practice Exercises
Exercise 1: Product Analysis
Given sales data, calculate:
- Total revenue by product and region
- Average transaction size by product
- Top 3 products by revenue in each region
Exercise 2: Customer Lifetime Value
Create customer segments based on:
- Purchase frequency groups
- Average order value groups
- Calculate metrics for each segment
Exercise 3: Time Series Aggregation
With daily transaction data:
- Calculate weekly moving averages by store
- Identify peak sales days by category
- Compare month-over-month growth
Sample Solutions
Exercise 1:
# Product analysis solution
product_analysis = sales.groupby(['region', 'product_category']).agg(
total_revenue=('sales_amount', 'sum'),
avg_transaction=('sales_amount', 'mean'),
transaction_count=('sales_amount', 'count')
).round(2)
# Top 3 products per region
top_products = (sales.groupby(['region', 'product_category'])['sales_amount']
.sum()
.groupby(level=0)
.nlargest(3))
print("Top 3 Products by Region:")
print(top_products)
Exercise 2:
# Customer segmentation solution
customers['purchase_freq_group'] = pd.qcut(customers['total_purchases'],
q=3,
labels=['Low', 'Medium', 'High'])
customers['order_value_group'] = pd.qcut(customers['annual_income'],
q=3,
labels=['Low', 'Medium', 'High'])
segment_metrics = customers.groupby(['purchase_freq_group', 'order_value_group']).agg(
customer_count=('customer_id', 'count'),
avg_purchases=('total_purchases', 'mean'),
avg_income=('annual_income', 'mean'),
lifetime_value=('annual_income', 'sum')
).round(2)
print("Customer Segment Metrics:")
print(segment_metrics)
Key Takeaways
- GroupBy follows split-apply-combine pattern for powerful data aggregation
- Use
agg()for multiple aggregation functions on groups transform()keeps original DataFrame size for adding group statisticsfilter()selects entire groups based on conditionsapply()provides maximum flexibility for custom group operations- Named aggregations make results more readable
- Categorical data types improve groupby performance on large datasets
- GroupBy is essential for data analysis, reporting, and feature engineering
What's Next?
Now that you understand GroupBy operations, explore:
- Merging and Joining DataFrames - Combine data from multiple sources
- Reshaping Data (Pivot, Melt) - Transform data structure
- Time Series Analysis - Work with temporal data
Resources
- Pandas GroupBy Documentation: pandas.pydata.org/docs/user_guide/groupby.html
- GroupBy Tutorial: Step-by-step guide with real examples
- Performance Tips: Optimizing groupby operations for large datasets
GroupBy operations are the foundation of data analysis in pandas. Master these techniques and you'll be able to extract insights from any dataset efficiently.
