Merging and Joining DataFrames in Pandas

Combining data from multiple sources is a fundamental task in data analysis. Pandas provides powerful, flexible methods for merging, joining, and concatenating DataFrames. Whether you're combining datasets from different files, linking related tables, or appending new records, pandas has you covered.

Why Merge and Join?

In real-world scenarios, data often lives in multiple places:

Customer information in one file, transactions in another
Multiple CSV files that need to be combined
Database tables that need to be linked
Time series data from different sources

Pandas merge operations let you:

Combine related datasets based on common keys
Add new columns from another DataFrame
Stack DataFrames vertically or horizontally
Perform SQL-like JOIN operations

Types of Combining Operations

1. Merge: SQL-style joins based on common columns

2. Join: Merge on index

3. Concat: Stack DataFrames vertically or horizontally

4. Combine: Element-wise combination with custom logic

Basic Merge Operation

import pandas as pd
import numpy as np

# Create sample employee data
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'department_id': [101, 102, 101, 103, 102]
})

departments = pd.DataFrame({
    'dept_id': [101, 102, 103, 104],
    'dept_name': ['Sales', 'Engineering', 'Marketing', 'HR'],
    'location': ['New York', 'San Francisco', 'Boston', 'Chicago']
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)

# Basic merge (inner join by default)
merged = pd.merge(employees, departments, 
                  left_on='department_id', 
                  right_on='dept_id')

print("\nMerged Data:")
print(merged)

Merge Types (SQL JOIN Equivalents)

Inner Join (Default)

Returns only rows with matching keys in both DataFrames:

# Inner join - only matching records
inner = pd.merge(employees, departments,
                 left_on='department_id',
                 right_on='dept_id',
                 how='inner')

print("Inner Join:")
print(inner)
# Result: Only employees with valid department IDs

Left Join

Returns all rows from left DataFrame, matching rows from right:

# Left join - all employees, matching departments
left = pd.merge(employees, departments,
                left_on='department_id',
                right_on='dept_id',
                how='left')

print("Left Join:")
print(left)
# Result: All employees, NaN for departments without matches

Right Join

Returns all rows from right DataFrame, matching rows from left:

# Right join - all departments, matching employees
right = pd.merge(employees, departments,
                 left_on='department_id',
                 right_on='dept_id',
                 how='right')

print("Right Join:")
print(right)
# Result: All departments, NaN for employees without matches

Outer Join (Full Outer Join)

Returns all rows from both DataFrames:

# Outer join - all records from both
outer = pd.merge(employees, departments,
                 left_on='department_id',
                 right_on='dept_id',
                 how='outer')

print("Outer Join:")
print(outer)
# Result: All employees and all departments, NaN where no match

Stata Commands Comparison

For Stata users, here's how pandas merge relates to Stata commands:

# Stata: merge 1:1 id using otherfile
pd.merge(df1, df2, on='id', how='inner')

# Stata: merge m:1 employee_id using employees
pd.merge(transactions, employees, on='employee_id', how='left')

# Stata: merge 1:m department_id using departments
pd.merge(departments, employees, on='department_id', how='right')

# Stata: merge 1:1 id using otherfile, keep(master matched)
pd.merge(df1, df2, on='id', how='left')

# Stata: merge m:m (many-to-many)
pd.merge(df1, df2, on='key', how='outer')

# Stata: append using otherfile
pd.concat([df1, df2], ignore_index=True)

Merging on Multiple Keys

# Create sales data with multiple keys
sales = pd.DataFrame({
    'year': [2023, 2023, 2024, 2024],
    'quarter': [1, 2, 1, 2],
    'region': ['North', 'North', 'North', 'North'],
    'revenue': [100000, 120000, 110000, 130000]
})

targets = pd.DataFrame({
    'year': [2023, 2023, 2024, 2024],
    'quarter': [1, 2, 1, 2],
    'region': ['North', 'North', 'North', 'North'],
    'target': [95000, 115000, 105000, 125000]
})

# Merge on multiple columns
performance = pd.merge(sales, targets, 
                      on=['year', 'quarter', 'region'])

performance['vs_target'] = performance['revenue'] - performance['target']
performance['pct_of_target'] = (performance['revenue'] / performance['target'] * 100).round(2)

print("Performance vs Target:")
print(performance)

Merge with Index

Join on Index

# Set index before merging
employees_idx = employees.set_index('emp_id')
salaries = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'salary': [50000, 75000, 55000, 60000, 80000]
}).set_index('emp_id')

# Join on index (shorthand for merge on index)
emp_with_salary = employees_idx.join(salaries)

print("Join on Index:")
print(emp_with_salary)

# Equivalent merge operation
emp_with_salary2 = pd.merge(employees_idx, salaries,
                            left_index=True,
                            right_index=True)

Merge Index with Column

# Merge using left index and right column
result = pd.merge(employees_idx, departments,
                  left_on='department_id',
                  right_on='dept_id',
                  left_index=True)

print("Merge Index with Column:")
print(result)

Concatenation (Stacking DataFrames)

Vertical Concatenation (Stacking Rows)

# Stack DataFrames vertically
q1_sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'product': np.random.choice(['A', 'B', 'C'], 10),
    'sales': np.random.randint(100, 1000, 10)
})

q2_sales = pd.DataFrame({
    'date': pd.date_range('2024-04-01', periods=10, freq='D'),
    'product': np.random.choice(['A', 'B', 'C'], 10),
    'sales': np.random.randint(100, 1000, 10)
})

# Concatenate vertically
full_sales = pd.concat([q1_sales, q2_sales], ignore_index=True)

print("Concatenated Sales Data:")
print(full_sales)
print(f"\nTotal rows: {len(full_sales)}")

Horizontal Concatenation (Adding Columns)

# Stack DataFrames horizontally
basic_info = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
})

contact_info = pd.DataFrame({
    'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com'],
    'phone': ['555-0001', '555-0002', '555-0003']
})

# Concatenate horizontally
full_info = pd.concat([basic_info, contact_info], axis=1)

print("Horizontally Concatenated:")
print(full_info)

Concatenation with Keys

# Add hierarchical index to identify source
northeast_sales = pd.DataFrame({
    'store': ['A', 'B'],
    'sales': [1000, 1200]
})

southeast_sales = pd.DataFrame({
    'store': ['C', 'D'],
    'sales': [900, 1100]
})

# Concatenate with keys to identify regions
regional_sales = pd.concat(
    [northeast_sales, southeast_sales],
    keys=['Northeast', 'Southeast'],
    names=['Region', 'Index']
)

print("Sales with Region Keys:")
print(regional_sales)

# Access specific region
print("\nNortheast Sales:")
print(regional_sales.loc['Northeast'])

Handling Duplicate Column Names

# DataFrames with overlapping column names
df1 = pd.DataFrame({
    'id': [1, 2, 3],
    'value': [10, 20, 30]
})

df2 = pd.DataFrame({
    'id': [1, 2, 3],
    'value': [100, 200, 300]
})

# Merge with suffixes to distinguish columns
merged_with_suffix = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))

print("Merge with Suffixes:")
print(merged_with_suffix)

Indicator Column (Merge Validation)

Track which DataFrame each row came from:

# Add indicator to show merge source
employees_subset = employees[employees['emp_id'] <= 3]

merged_indicator = pd.merge(employees_subset, departments,
                           left_on='department_id',
                           right_on='dept_id',
                           how='outer',
                           indicator=True)

print("Merge with Indicator:")
print(merged_indicator)

# Filter based on merge source
print("\nOnly in left DataFrame:")
print(merged_indicator[merged_indicator['_merge'] == 'left_only'])

print("\nOnly in right DataFrame:")
print(merged_indicator[merged_indicator['_merge'] == 'right_only'])

Practical Examples

Example 1: Customer Order Analysis

# Customer information
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105],
    'customer_name': ['John Doe', 'Jane Smith', 'Bob Johnson', 'Alice Brown', 'Charlie Davis'],
    'segment': ['Premium', 'Standard', 'Premium', 'Standard', 'Premium'],
    'join_date': pd.date_range('2023-01-01', periods=5, freq='M')
})

# Order history
orders = pd.DataFrame({
    'order_id': range(1, 11),
    'customer_id': [101, 102, 101, 103, 104, 101, 102, 105, 103, 104],
    'order_date': pd.date_range('2024-01-01', periods=10, freq='W'),
    'order_amount': [250, 180, 320, 450, 200, 300, 220, 500, 380, 190]
})

# Product details
products = pd.DataFrame({
    'order_id': range(1, 11),
    'product_category': ['Electronics', 'Clothing', 'Electronics', 'Home', 
                        'Clothing', 'Electronics', 'Books', 'Home', 'Electronics', 'Books'],
    'product_name': ['Laptop', 'Shirt', 'Mouse', 'Lamp', 'Jeans', 
                    'Keyboard', 'Novel', 'Chair', 'Monitor', 'Magazine']
})

# Merge all three DataFrames
# Step 1: Merge orders with customers
orders_with_customers = pd.merge(orders, customers, on='customer_id', how='left')

# Step 2: Merge with products
complete_data = pd.merge(orders_with_customers, products, on='order_id', how='left')

print("Complete Order Data:")
print(complete_data[['order_id', 'customer_name', 'segment', 'order_amount', 
                     'product_category', 'product_name']].head(10))

# Analysis: Sales by customer segment and category
segment_category_sales = complete_data.groupby(['segment', 'product_category'])['order_amount'].agg([
    'sum', 'mean', 'count'
]).round(2)

print("\nSales by Segment and Category:")
print(segment_category_sales)

Example 2: Time Series Data Integration

# Weather data
weather = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=30, freq='D'),
    'temperature': np.random.normal(70, 10, 30),
    'rainfall': np.random.exponential(0.3, 30)
})

# Sales data (some missing dates)
sales_dates = pd.date_range('2024-01-01', periods=30, freq='D')
sales_dates = np.random.choice(sales_dates, size=25, replace=False)

daily_sales = pd.DataFrame({
    'date': sorted(sales_dates),
    'daily_revenue': np.random.uniform(5000, 15000, 25)
})

# Merge to see impact of weather on sales
weather_sales = pd.merge(weather, daily_sales, on='date', how='left')

print("Weather and Sales Data:")
print(weather_sales.head(10))

# Fill missing sales with 0 (store was closed)
weather_sales['daily_revenue'].fillna(0, inplace=True)

# Calculate correlation
print(f"\nCorrelation between temperature and sales: {weather_sales['temperature'].corr(weather_sales['daily_revenue']):.3f}")
print(f"Correlation between rainfall and sales: {weather_sales['rainfall'].corr(weather_sales['daily_revenue']):.3f}")

Example 3: Employee-Manager Hierarchy

# Self-join example: Employees with their managers
employees_full = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5, 6],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'manager_id': [None, 1, 1, 2, 2, 3]
})

# Self-merge to get manager names
emp_with_managers = pd.merge(
    employees_full,
    employees_full[['emp_id', 'name']],
    left_on='manager_id',
    right_on='emp_id',
    how='left',
    suffixes=('', '_manager')
)

emp_with_managers = emp_with_managers[['emp_id', 'name', 'name_manager']]
emp_with_managers.columns = ['emp_id', 'employee', 'manager']

print("Employee-Manager Relationships:")
print(emp_with_managers)

Advanced Merging Techniques

Fuzzy Matching (Non-Exact Matches)

# Merge with non-exact string matching (requires fuzzywuzzy)
# This is a conceptual example
from difflib import get_close_matches

def fuzzy_merge(df1, df2, key1, key2, threshold=80):
    """
    Merge DataFrames with fuzzy string matching
    """
    matches = []
    for val1 in df1[key1]:
        match = get_close_matches(val1, df2[key2], n=1, cutoff=threshold/100)
        matches.append(match[0] if match else None)
    
    df1['matched_key'] = matches
    return pd.merge(df1, df2, left_on='matched_key', right_on=key2, how='left')

Merge Asof (Nearest Key Match)

# Merge based on nearest time (useful for time series)
quotes = pd.DataFrame({
    'time': pd.date_range('2024-01-01 09:00', periods=5, freq='30min'),
    'ticker': ['AAPL'] * 5,
    'price': [150, 151, 149, 152, 150]
})

trades = pd.DataFrame({
    'time': pd.date_range('2024-01-01 09:15', periods=3, freq='45min'),
    'ticker': ['AAPL'] * 3,
    'quantity': [100, 200, 150]
})

# Merge asof: Match each trade to the most recent quote
merged_asof = pd.merge_asof(trades, quotes, on='time', by='ticker')

print("Trades with Nearest Quote Price:")
print(merged_asof)

Cross Join (Cartesian Product)

# Create all possible combinations
sizes = pd.DataFrame({'size': ['Small', 'Medium', 'Large']})
colors = pd.DataFrame({'color': ['Red', 'Blue', 'Green']})

# Cross join (all combinations)
products_matrix = pd.merge(sizes, colors, how='cross')

print("All Product Combinations:")
print(products_matrix)

Validation and Troubleshooting

Validate Merge Results

# Check for one-to-one merge
try:
    result = pd.merge(df1, df2, on='key', validate='one_to_one')
except pd.errors.MergeError as e:
    print(f"Merge validation failed: {e}")

# Valid options: 'one_to_one', 'one_to_many', 'many_to_one', 'many_to_many'

Check for Duplicates Before Merging

# Identify duplicate keys
print("Duplicate employee IDs:")
print(employees[employees['emp_id'].duplicated()])

# Remove duplicates before merging
employees_clean = employees.drop_duplicates(subset='emp_id')

Handle Missing Keys

# Fill missing keys before merge
departments_filled = departments.copy()
departments_filled['dept_id'].fillna(0, inplace=True)

# Or filter out rows with missing keys
departments_clean = departments.dropna(subset=['dept_id'])

Performance Optimization

Merge Large DataFrames

# For large datasets, consider:
# 1. Sort both DataFrames by merge key first
df1_sorted = df1.sort_values('key')
df2_sorted = df2.sort_values('key')
result = pd.merge(df1_sorted, df2_sorted, on='key')

# 2. Use appropriate dtypes (categories for repeated strings)
df1['category_col'] = df1['category_col'].astype('category')

# 3. Merge in chunks for very large datasets
chunk_size = 10000
merged_chunks = []
for i in range(0, len(df1), chunk_size):
    chunk = df1[i:i+chunk_size]
    merged_chunk = pd.merge(chunk, df2, on='key')
    merged_chunks.append(merged_chunk)

result = pd.concat(merged_chunks, ignore_index=True)

Practice Exercises

Exercise 1: Sales Analysis

Given three DataFrames (customers, orders, order_items):

Merge to get complete order details
Calculate total revenue per customer
Find customers with orders but no items

Exercise 2: Student Grades

Merge student info, course enrollment, and grades:

Calculate GPA per student
Find students enrolled but missing grades
Compare grades across different semesters

Exercise 3: Inventory Management

Combine product catalog, inventory, and sales:

Identify out-of-stock items
Calculate inventory turnover
Find products never ordered

Sample Solutions

Exercise 1:

# Sample solution
full_orders = pd.merge(orders, customers, on='customer_id', how='left')
complete = pd.merge(full_orders, order_items, on='order_id', how='left')

# Total revenue per customer
customer_revenue = complete.groupby('customer_name')['amount'].sum().sort_values(ascending=False)
print("Top Customers:")
print(customer_revenue.head())

# Orders with no items (data quality check)
orders_no_items = pd.merge(orders, order_items, on='order_id', how='left', indicator=True)
orphan_orders = orders_no_items[orders_no_items['_merge'] == 'left_only']
print(f"\nOrders without items: {len(orphan_orders)}")

Key Takeaways

Use pd.merge() for SQL-style joins on columns
Use .join() for merging on index (shorthand)
Use pd.concat() to stack DataFrames vertically or horizontally
Choose appropriate join type: inner, left, right, or outer
Use indicator=True to track merge sources
Validate merge results to catch data quality issues
Consider performance for large datasets
Handle duplicates and missing values before merging

What's Next?

Now that you can combine DataFrames, explore:

GroupBy Operations - Aggregate merged data
Reshaping Data - Transform data structure
Data Cleaning - Prepare data for merging

Resources

Pandas Merge Documentation: pandas.pydata.org/docs/user_guide/merging.html
SQL to Pandas Comparison: Understanding database operations in pandas
Merge Performance Guide: Optimizing join operations

Mastering merge and join operations is essential for working with real-world data. Practice with different join types and datasets to build your confidence.