Statistical Operations in Pandas

Statistical analysis is at the heart of data science, and pandas provides a comprehensive suite of statistical functions and methods. This guide covers everything from basic descriptive statistics to advanced statistical operations, making it easy to extract meaningful insights from your data.

Why Statistical Operations Matter

Statistical analysis helps you:

Understand your data: Discover patterns, trends, and relationships
Make informed decisions: Base conclusions on quantitative evidence
Identify anomalies: Spot outliers and unusual patterns
Compare groups: Test hypotheses and measure differences
Predict outcomes: Build models based on historical patterns
Communicate insights: Present findings with statistical backing

Setting Up for Statistical Analysis

import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 3)

# Create a comprehensive dataset for examples
np.random.seed(42)
n_samples = 1000

data = {
    'customer_id': range(1001, 1001 + n_samples),
    'age': np.random.normal(35, 12, n_samples).astype(int),
    'income': np.random.lognormal(10.5, 0.8, n_samples),
    'education_years': np.random.normal(14, 3, n_samples),
    'experience': np.random.exponential(8, n_samples),
    'purchase_amount': np.random.gamma(2, 50, n_samples),
    'satisfaction_score': np.random.normal(7, 1.5, n_samples),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples, p=[0.3, 0.25, 0.25, 0.2]),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], n_samples, p=[0.4, 0.3, 0.2, 0.1]),
    'is_premium': np.random.choice([True, False], n_samples, p=[0.3, 0.7]),
    'months_since_signup': np.random.poisson(24, n_samples)
}

# Create DataFrame
df = pd.DataFrame(data)

# Clean up some values to be more realistic
df['age'] = np.clip(df['age'], 18, 80)
df['education_years'] = np.clip(df['education_years'], 8, 22).astype(int)
df['experience'] = np.clip(df['experience'], 0, 40).round(1)
df['satisfaction_score'] = np.clip(df['satisfaction_score'], 1, 10).round(1)
df['income'] = df['income'].round(2)
df['purchase_amount'] = df['purchase_amount'].round(2)

print("Dataset created for statistical analysis:")
print(df.head())
print(f"Dataset shape: {df.shape}")

Basic Descriptive Statistics

Central Tendency Measures

print("=== CENTRAL TENDENCY MEASURES ===")

# Mean, median, and mode for numerical columns
numerical_cols = ['age', 'income', 'education_years', 'experience', 'purchase_amount', 'satisfaction_score']

print("Central tendency summary:")
central_tendency = pd.DataFrame({
    'Mean': df[numerical_cols].mean(),
    'Median': df[numerical_cols].median(),
    'Mode': df[numerical_cols].mode().iloc[0] if len(df) > 0 else np.nan
}).round(2)

print(central_tendency)

# Detailed analysis for a specific variable
print(f"\n--- INCOME ANALYSIS ---")
income_stats = {
    'Mean': df['income'].mean(),
    'Median': df['income'].median(),
    'Trimmed Mean (5%)': stats.trim_mean(df['income'], 0.05),
    'Harmonic Mean': stats.hmean(df['income']),
    'Geometric Mean': stats.gmean(df['income'])
}

for stat_name, value in income_stats.items():
    print(f"{stat_name}: ${value:,.2f}")

# Mode analysis for categorical variables
print(f"\n--- CATEGORICAL MODE ANALYSIS ---")
categorical_cols = ['region', 'product_category', 'is_premium']
for col in categorical_cols:
    mode_value = df[col].mode().iloc[0]
    mode_count = df[col].value_counts().iloc[0]
    mode_percentage = (mode_count / len(df)) * 100
    print(f"{col}: {mode_value} ({mode_count} occurrences, {mode_percentage:.1f}%)")

Measures of Dispersion

print("=== MEASURES OF DISPERSION ===")

# Variance and Standard Deviation
dispersion_stats = pd.DataFrame({
    'Variance': df[numerical_cols].var(),
    'Std Deviation': df[numerical_cols].std(),
    'Mean Abs Deviation': df[numerical_cols].apply(lambda x: np.mean(np.abs(x - x.mean()))),
    'Coefficient of Variation': (df[numerical_cols].std() / df[numerical_cols].mean()) * 100
}).round(3)

print("Dispersion measures:")
print(dispersion_stats)

# Range and Interquartile Range
print(f"\n--- RANGE ANALYSIS ---")
range_stats = pd.DataFrame({
    'Min': df[numerical_cols].min(),
    'Max': df[numerical_cols].max(),
    'Range': df[numerical_cols].max() - df[numerical_cols].min(),
    'IQR': df[numerical_cols].quantile(0.75) - df[numerical_cols].quantile(0.25)
}).round(2)

print(range_stats)

# Detailed percentile analysis
print(f"\n--- PERCENTILE ANALYSIS (Income) ---")
income_percentiles = df['income'].quantile([0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99])
print("Income percentiles:")
for percentile, value in income_percentiles.items():
    print(f"{percentile*100:4.0f}th percentile: ${value:8,.2f}")

Distribution Shape Analysis

print("=== DISTRIBUTION SHAPE ANALYSIS ===")

# Skewness and Kurtosis
shape_stats = pd.DataFrame({
    'Skewness': df[numerical_cols].skew(),
    'Kurtosis': df[numerical_cols].kurtosis(),
    'Jarque-Bera Test (p-value)': df[numerical_cols].apply(
        lambda x: stats.jarque_bera(x.dropna())[1] if len(x.dropna()) > 0 else np.nan
    )
}).round(4)

print("Distribution shape measures:")
print(shape_stats)

# Interpretation of skewness and kurtosis
print(f"\n--- INTERPRETATION ---")
for col in numerical_cols:
    skewness = df[col].skew()
    kurtosis = df[col].kurtosis()
    
    # Skewness interpretation
    if abs(skewness) < 0.5:
        skew_desc = "approximately symmetric"
    elif skewness < -0.5:
        skew_desc = "left-skewed (negative)"
    else:
        skew_desc = "right-skewed (positive)"
    
    # Kurtosis interpretation
    if abs(kurtosis) < 0.5:
        kurt_desc = "normal kurtosis (mesokurtic)"
    elif kurtosis < -0.5:
        kurt_desc = "light tails (platykurtic)"
    else:
        kurt_desc = "heavy tails (leptokurtic)"
    
    print(f"{col}: {skew_desc}, {kurt_desc}")

# Normality tests
print(f"\n--- NORMALITY TESTS ---")
for col in numerical_cols[:4]:  # Test first 4 columns
    data_col = df[col].dropna()
    
    # Shapiro-Wilk test (good for small samples)
    if len(data_col) <= 5000:
        shapiro_stat, shapiro_p = stats.shapiro(data_col[:5000])  # Limit for performance
    else:
        shapiro_stat, shapiro_p = np.nan, np.nan
    
    # Anderson-Darling test
    anderson_result = stats.anderson(data_col, dist='norm')
    
    # Kolmogorov-Smirnov test
    ks_stat, ks_p = stats.kstest(data_col, 'norm', args=(data_col.mean(), data_col.std()))
    
    print(f"\n{col}:")
    if not np.isnan(shapiro_p):
        print(f"  Shapiro-Wilk: p = {shapiro_p:.4f} ({'Normal' if shapiro_p > 0.05 else 'Non-normal'})")
    print(f"  Anderson-Darling: statistic = {anderson_result.statistic:.4f}")
    print(f"  Kolmogorov-Smirnov: p = {ks_p:.4f} ({'Normal' if ks_p > 0.05 else 'Non-normal'})")

Correlation Analysis

print("=== CORRELATION ANALYSIS ===")

# Pearson correlation matrix
correlation_matrix = df[numerical_cols].corr()
print("Pearson correlation matrix:")
print(correlation_matrix.round(3))

# Find strong correlations
def find_strong_correlations(corr_matrix, threshold=0.7):
    """Find pairs of variables with strong correlations"""
    strong_corr = []
    
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            corr_value = corr_matrix.iloc[i, j]
            if abs(corr_value) >= threshold:
                strong_corr.append({
                    'Variable 1': corr_matrix.columns[i],
                    'Variable 2': corr_matrix.columns[j],
                    'Correlation': corr_value,
                    'Strength': 'Strong Positive' if corr_value > 0 else 'Strong Negative'
                })
    
    return pd.DataFrame(strong_corr)

strong_correlations = find_strong_correlations(correlation_matrix, threshold=0.3)
if not strong_correlations.empty:
    print(f"\nStrong correlations (|r| >= 0.3):")
    print(strong_correlations)
else:
    print(f"\nNo strong correlations found (threshold = 0.3)")

# Spearman rank correlation (for non-linear relationships)
spearman_corr = df[numerical_cols].corr(method='spearman')
print(f"\nSpearman rank correlations (top 3 pairs):")
spearman_pairs = []
for i in range(len(spearman_corr.columns)):
    for j in range(i+1, len(spearman_corr.columns)):
        spearman_pairs.append({
            'Pair': f"{spearman_corr.columns[i]} - {spearman_corr.columns[j]}",
            'Spearman': spearman_corr.iloc[i, j],
            'Pearson': correlation_matrix.iloc[i, j]
        })

spearman_df = pd.DataFrame(spearman_pairs).sort_values('Spearman', key=abs, ascending=False)
print(spearman_df.head(3).round(3))

# Correlation significance testing
def correlation_significance_test(x, y, alpha=0.05):
    """Test statistical significance of correlation"""
    corr_coef, p_value = stats.pearsonr(x, y)
    is_significant = p_value < alpha
    
    return {
        'correlation': corr_coef,
        'p_value': p_value,
        'significant': is_significant,
        'confidence_level': (1 - alpha) * 100
    }

# Test significance for a few key pairs
print(f"\n--- CORRELATION SIGNIFICANCE TESTS ---")
test_pairs = [
    ('age', 'experience'),
    ('income', 'education_years'),
    ('purchase_amount', 'satisfaction_score')
]

for var1, var2 in test_pairs:
    result = correlation_significance_test(df[var1], df[var2])
    print(f"{var1} - {var2}:")
    print(f"  Correlation: {result['correlation']:.4f}")
    print(f"  P-value: {result['p_value']:.4f}")
    print(f"  Significant at 95% level: {result['significant']}")

Group Statistics and Comparisons

print("=== GROUP STATISTICS AND COMPARISONS ===")

# Group by categorical variables
print("--- GROUP STATISTICS BY REGION ---")
region_stats = df.groupby('region')[numerical_cols].agg(['count', 'mean', 'std', 'median']).round(2)
print("Income statistics by region:")
print(region_stats['income'])

# Compare groups using descriptive statistics
print(f"\n--- PRODUCT CATEGORY COMPARISON ---")
category_comparison = df.groupby('product_category')['purchase_amount'].agg([
    'count', 'mean', 'std', 'median', 'min', 'max'
]).round(2)
print(category_comparison)

# Statistical tests for group differences
print(f"\n--- STATISTICAL TESTS FOR GROUP DIFFERENCES ---")

# T-test: Compare premium vs non-premium customers
premium_income = df[df['is_premium'] == True]['income']
regular_income = df[df['is_premium'] == False]['income']

# Two-sample t-test
t_stat, t_p_value = stats.ttest_ind(premium_income, regular_income)
print(f"T-test (Premium vs Regular Income):")
print(f"  Premium mean: ${premium_income.mean():,.2f}")
print(f"  Regular mean: ${regular_income.mean():,.2f}")
print(f"  T-statistic: {t_stat:.4f}")
print(f"  P-value: {t_p_value:.4f}")
print(f"  Significant difference: {t_p_value < 0.05}")

# Mann-Whitney U test (non-parametric alternative)
u_stat, u_p_value = stats.mannwhitneyu(premium_income, regular_income, alternative='two-sided')
print(f"\nMann-Whitney U test:")
print(f"  U-statistic: {u_stat}")
print(f"  P-value: {u_p_value:.4f}")

# ANOVA: Compare satisfaction across multiple regions
print(f"\n--- ONE-WAY ANOVA ---")
region_groups = [group['satisfaction_score'].values for name, group in df.groupby('region')]
f_stat, anova_p_value = stats.f_oneway(*region_groups)

print(f"ANOVA - Satisfaction Score by Region:")
print(f"  F-statistic: {f_stat:.4f}")
print(f"  P-value: {anova_p_value:.4f}")
print(f"  Significant difference between regions: {anova_p_value < 0.05}")

# Post-hoc analysis if ANOVA is significant
if anova_p_value < 0.05:
    print(f"\nPost-hoc analysis (Tukey HSD):")
    from scipy.stats import tukey_hsd
    
    # Note: tukey_hsd might not be available in all scipy versions
    # Alternative approach using pairwise t-tests with Bonferroni correction
    regions = df['region'].unique()
    n_comparisons = len(regions) * (len(regions) - 1) // 2
    alpha_corrected = 0.05 / n_comparisons
    
    print(f"Pairwise comparisons (Bonferroni corrected α = {alpha_corrected:.4f}):")
    for i, region1 in enumerate(regions):
        for region2 in regions[i+1:]:
            group1 = df[df['region'] == region1]['satisfaction_score']
            group2 = df[df['region'] == region2]['satisfaction_score']
            t_stat, p_val = stats.ttest_ind(group1, group2)
            significant = p_val < alpha_corrected
            print(f"  {region1} vs {region2}: p = {p_val:.4f} {'*' if significant else ''}")

Time Series and Trend Analysis

print("=== TIME SERIES AND TREND ANALYSIS ===")

# Create a time series component
np.random.seed(42)
dates = pd.date_range('2020-01-01', '2023-12-31', freq='D')
trend = np.linspace(100, 200, len(dates))
seasonal = 20 * np.sin(2 * np.pi * np.arange(len(dates)) / 365.25)
noise = np.random.normal(0, 10, len(dates))
ts_values = trend + seasonal + noise

ts_df = pd.DataFrame({
    'date': dates,
    'value': ts_values,
    'month': dates.month,
    'year': dates.year,
    'day_of_week': dates.dayofweek
})

print("Time series data created:")
print(ts_df.head())

# Basic time series statistics
print(f"\n--- TIME SERIES DESCRIPTIVE STATISTICS ---")
print(f"Time period: {ts_df['date'].min()} to {ts_df['date'].max()}")
print(f"Number of observations: {len(ts_df)}")
print(f"Value statistics:")
print(ts_df['value'].describe().round(2))

# Trend analysis
print(f"\n--- TREND ANALYSIS ---")
# Linear trend
slope, intercept, r_value, p_value, std_err = stats.linregress(
    range(len(ts_df)), ts_df['value']
)
print(f"Linear trend:")
print(f"  Slope: {slope:.4f} units per day")
print(f"  R-squared: {r_value**2:.4f}")
print(f"  P-value: {p_value:.4f}")
print(f"  Trend is {'significant' if p_value < 0.05 else 'not significant'}")

# Seasonal analysis
monthly_stats = ts_df.groupby('month')['value'].agg(['mean', 'std']).round(2)
print(f"\nMonthly statistics:")
print(monthly_stats)

# Autocorrelation analysis
def calculate_autocorrelation(series, max_lags=30):
    """Calculate autocorrelation for different lags"""
    autocorr = []
    for lag in range(1, max_lags + 1):
        corr = series.corr(series.shift(lag))
        autocorr.append({'lag': lag, 'autocorr': corr})
    return pd.DataFrame(autocorr)

autocorr_results = calculate_autocorrelation(ts_df['value'])
print(f"\nAutocorrelation analysis (top 5 lags):")
print(autocorr_results.sort_values('autocorr', key=abs, ascending=False).head())

# Moving averages and smoothing
ts_df['ma_7'] = ts_df['value'].rolling(window=7).mean()
ts_df['ma_30'] = ts_df['value'].rolling(window=30).mean()
ts_df['ewm_alpha_0_1'] = ts_df['value'].ewm(alpha=0.1).mean()

print(f"\nSmoothing comparison (last 5 values):")
print(ts_df[['date', 'value', 'ma_7', 'ma_30', 'ewm_alpha_0_1']].tail())

Advanced Statistical Operations

print("=== ADVANCED STATISTICAL OPERATIONS ===")

# Regression analysis
print("--- LINEAR REGRESSION ANALYSIS ---")
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Predict income based on age, education, and experience
X = df[['age', 'education_years', 'experience']]
y = df['income']

# Fit regression model
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

# Regression statistics
r2 = r2_score(y, y_pred)
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)

print(f"Multiple Linear Regression Results:")
print(f"  R-squared: {r2:.4f}")
print(f"  RMSE: ${rmse:,.2f}")
print(f"  Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
    print(f"    {feature}: {coef:.2f}")
print(f"  Intercept: {model.intercept_:.2f}")

# Statistical significance of predictors (using scipy)
from scipy.stats import t
n = len(X)
k = len(X.columns)
dof = n - k - 1

# Calculate t-statistics and p-values for coefficients
residuals = y - y_pred
mse_res = np.sum(residuals**2) / dof
var_coef = mse_res * np.linalg.inv(X.T.dot(X)).diagonal()
se_coef = np.sqrt(var_coef)
t_stats = model.coef_ / se_coef
p_values = 2 * (1 - t.cdf(np.abs(t_stats), dof))

print(f"\nCoefficient significance:")
for feature, coef, se, t_stat, p_val in zip(X.columns, model.coef_, se_coef, t_stats, p_values):
    significance = "***" if p_val < 0.001 else "**" if p_val < 0.01 else "*" if p_val < 0.05 else ""
    print(f"  {feature}: {coef:.4f} (SE: {se:.4f}, t: {t_stat:.2f}, p: {p_val:.4f}) {significance}")

# Chi-square test for independence
print(f"\n--- CHI-SQUARE TEST OF INDEPENDENCE ---")
contingency_table = pd.crosstab(df['region'], df['product_category'])
print("Contingency table (Region vs Product Category):")
print(contingency_table)

chi2, chi2_p, dof, expected = stats.chi2_contingency(contingency_table)
print(f"\nChi-square test results:")
print(f"  Chi-square statistic: {chi2:.4f}")
print(f"  Degrees of freedom: {dof}")
print(f"  P-value: {chi2_p:.4f}")
print(f"  Significant association: {chi2_p < 0.05}")

# Cramér's V (effect size for chi-square)
n = contingency_table.sum().sum()
cramers_v = np.sqrt(chi2 / (n * (min(contingency_table.shape) - 1)))
print(f"  Cramér's V (effect size): {cramers_v:.4f}")

Statistical Hypothesis Testing

print("=== STATISTICAL HYPOTHESIS TESTING ===")

# One-sample t-test
print("--- ONE-SAMPLE T-TEST ---")
# Test if average satisfaction score differs from 7.0
hypothesized_mean = 7.0
satisfaction_data = df['satisfaction_score'].dropna()

t_stat, p_value = stats.ttest_1samp(satisfaction_data, hypothesized_mean)
sample_mean = satisfaction_data.mean()
sample_std = satisfaction_data.std()
n = len(satisfaction_data)
se = sample_std / np.sqrt(n)

# Confidence interval
confidence_level = 0.95
alpha = 1 - confidence_level
t_critical = stats.t.ppf(1 - alpha/2, n-1)
margin_error = t_critical * se
ci_lower = sample_mean - margin_error
ci_upper = sample_mean + margin_error

print(f"One-sample t-test (H0: μ = {hypothesized_mean}):")
print(f"  Sample mean: {sample_mean:.3f}")
print(f"  Sample std: {sample_std:.3f}")
print(f"  T-statistic: {t_stat:.4f}")
print(f"  P-value: {p_value:.4f}")
print(f"  {confidence_level*100}% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")
print(f"  Reject H0: {p_value < 0.05}")

# Paired t-test example (create paired data)
print(f"\n--- PAIRED T-TEST ---")
# Simulate before/after scenario
np.random.seed(42)
before_scores = df['satisfaction_score'].sample(100).values
after_scores = before_scores + np.random.normal(0.5, 1, 100)  # Small improvement

paired_t_stat, paired_p_value = stats.ttest_rel(after_scores, before_scores)
mean_difference = np.mean(after_scores - before_scores)

print(f"Paired t-test (Before vs After satisfaction):")
print(f"  Before mean: {np.mean(before_scores):.3f}")
print(f"  After mean: {np.mean(after_scores):.3f}")
print(f"  Mean difference: {mean_difference:.3f}")
print(f"  T-statistic: {paired_t_stat:.4f}")
print(f"  P-value: {paired_p_value:.4f}")
print(f"  Significant improvement: {paired_p_value < 0.05}")

# Power analysis
print(f"\n--- STATISTICAL POWER ANALYSIS ---")
from scipy.stats import norm

def calculate_power(effect_size, sample_size, alpha=0.05):
    """Calculate statistical power for a two-tailed t-test"""
    # Critical value
    t_critical = stats.t.ppf(1 - alpha/2, sample_size - 1)
    
    # Non-centrality parameter
    ncp = effect_size * np.sqrt(sample_size)
    
    # Power calculation
    power = 1 - stats.t.cdf(t_critical, sample_size - 1, ncp) + stats.t.cdf(-t_critical, sample_size - 1, ncp)
    return power

# Calculate power for different scenarios
effect_sizes = [0.2, 0.5, 0.8]  # Small, medium, large effects
sample_sizes = [20, 50, 100, 200]

print("Statistical power analysis:")
print("Effect Size | Sample Size | Power")
print("-" * 35)
for effect in effect_sizes:
    for n in sample_sizes:
        power = calculate_power(effect, n)
        print(f"    {effect:4.1f}   |     {n:3d}     | {power:5.3f}")

Robust Statistics

print("=== ROBUST STATISTICS ===")

# Robust measures of central tendency and spread
print("--- ROBUST vs TRADITIONAL STATISTICS ---")

# Add some outliers to demonstrate robustness
income_with_outliers = df['income'].copy()
income_with_outliers.iloc[:10] = income_with_outliers.iloc[:10] * 10  # Create outliers

comparison_stats = pd.DataFrame({
    'Original': [
        df['income'].mean(),
        df['income'].median(),
        stats.trim_mean(df['income'], 0.1),
        df['income'].std(),
        stats.median_abs_deviation(df['income']),
        stats.iqr(df['income'])
    ],
    'With Outliers': [
        income_with_outliers.mean(),
        income_with_outliers.median(),
        stats.trim_mean(income_with_outliers, 0.1),
        income_with_outliers.std(),
        stats.median_abs_deviation(income_with_outliers),
        stats.iqr(income_with_outliers)
    ]
}, index=['Mean', 'Median', 'Trimmed Mean (10%)', 'Std Dev', 'MAD', 'IQR'])

print("Comparison of statistics with and without outliers:")
print(comparison_stats.round(2))

# Robust correlation
print(f"\n--- ROBUST CORRELATION ---")
# Spearman correlation is more robust to outliers
pearson_corr = df['income'].corr(df['age'])
spearman_corr = df['income'].corr(df['age'], method='spearman')
kendall_corr = df['income'].corr(df['age'], method='kendall')

print(f"Income-Age correlations:")
print(f"  Pearson: {pearson_corr:.4f}")
print(f"  Spearman: {spearman_corr:.4f}")
print(f"  Kendall: {kendall_corr:.4f}")

# Bootstrap confidence intervals
print(f"\n--- BOOTSTRAP CONFIDENCE INTERVALS ---")

def bootstrap_statistic(data, statistic_func, n_bootstrap=1000, confidence_level=0.95):
    """Calculate bootstrap confidence interval for a statistic"""
    bootstrap_stats = []
    n = len(data)
    
    for _ in range(n_bootstrap):
        bootstrap_sample = np.random.choice(data, size=n, replace=True)
        bootstrap_stats.append(statistic_func(bootstrap_sample))
    
    alpha = 1 - confidence_level
    lower_percentile = (alpha / 2) * 100
    upper_percentile = (1 - alpha / 2) * 100
    
    ci_lower = np.percentile(bootstrap_stats, lower_percentile)
    ci_upper = np.percentile(bootstrap_stats, upper_percentile)
    
    return ci_lower, ci_upper, bootstrap_stats

# Bootstrap confidence interval for median income
ci_lower, ci_upper, bootstrap_medians = bootstrap_statistic(
    df['income'].values, 
    np.median, 
    n_bootstrap=1000
)

print(f"Bootstrap 95% CI for median income: [${ci_lower:,.2f}, ${ci_upper:,.2f}]")
print(f"Original median: ${df['income'].median():,.2f}")

# Bootstrap for correlation coefficient
def correlation_statistic(data):
    """Calculate correlation for bootstrap (data should be 2D array)"""
    return np.corrcoef(data[:, 0], data[:, 1])[0, 1]

income_age_data = df[['income', 'age']].values
ci_lower_corr, ci_upper_corr, bootstrap_corrs = bootstrap_statistic(
    income_age_data,
    correlation_statistic,
    n_bootstrap=1000
)

print(f"Bootstrap 95% CI for income-age correlation: [{ci_lower_corr:.4f}, {ci_upper_corr:.4f}]")

Practical Statistical Applications

print("=== PRACTICAL STATISTICAL APPLICATIONS ===")

# A/B Testing Framework
print("--- A/B TESTING FRAMEWORK ---")

def ab_test_analysis(control_group, treatment_group, metric_name="metric", alpha=0.05):
    """
    Comprehensive A/B test analysis
    """
    # Basic statistics
    control_mean = np.mean(control_group)
    treatment_mean = np.mean(treatment_group)
    control_std = np.std(control_group, ddof=1)
    treatment_std = np.std(treatment_group, ddof=1)
    
    # Effect size (Cohen's d)
    pooled_std = np.sqrt(((len(control_group) - 1) * control_std**2 + 
                         (len(treatment_group) - 1) * treatment_std**2) / 
                        (len(control_group) + len(treatment_group) - 2))
    cohens_d = (treatment_mean - control_mean) / pooled_std
    
    # Statistical test
    t_stat, p_value = stats.ttest_ind(treatment_group, control_group)
    
    # Confidence interval for difference
    se_diff = np.sqrt(control_std**2/len(control_group) + treatment_std**2/len(treatment_group))
    dof = len(control_group) + len(treatment_group) - 2
    t_critical = stats.t.ppf(1 - alpha/2, dof)
    diff = treatment_mean - control_mean
    ci_lower = diff - t_critical * se_diff
    ci_upper = diff + t_critical * se_diff
    
    # Results
    results = {
        'control_mean': control_mean,
        'treatment_mean': treatment_mean,
        'absolute_difference': diff,
        'relative_difference': (diff / control_mean) * 100,
        'cohens_d': cohens_d,
        't_statistic': t_stat,
        'p_value': p_value,
        'significant': p_value < alpha,
        'confidence_interval': (ci_lower, ci_upper),
        'control_size': len(control_group),
        'treatment_size': len(treatment_group)
    }
    
    return results

# Simulate A/B test data
np.random.seed(42)
control_purchases = np.random.normal(120, 30, 500)  # Control group
treatment_purchases = np.random.normal(135, 30, 480)  # Treatment group (higher mean)

ab_results = ab_test_analysis(control_purchases, treatment_purchases, "Purchase Amount")

print("A/B Test Results:")
print(f"  Control mean: ${ab_results['control_mean']:.2f} (n={ab_results['control_size']})")
print(f"  Treatment mean: ${ab_results['treatment_mean']:.2f} (n={ab_results['treatment_size']})")
print(f"  Absolute difference: ${ab_results['absolute_difference']:.2f}")
print(f"  Relative difference: {ab_results['relative_difference']:.2f}%")
print(f"  Effect size (Cohen's d): {ab_results['cohens_d']:.3f}")
print(f"  P-value: {ab_results['p_value']:.4f}")
print(f"  Statistically significant: {ab_results['significant']}")
print(f"  95% CI for difference: [${ab_results['confidence_interval'][0]:.2f}, ${ab_results['confidence_interval'][1]:.2f}]")

# Customer Segmentation Analysis
print(f"\n--- CUSTOMER SEGMENTATION ANALYSIS ---")

# Create customer segments based on RFM (Recency, Frequency, Monetary)
# Simulate additional data for RFM analysis
df['recency_days'] = np.random.exponential(30, len(df))
df['frequency_purchases'] = np.random.poisson(5, len(df))
df['monetary_total'] = df['purchase_amount'] * df['frequency_purchases'] + np.random.normal(0, 50, len(df))

# Calculate RFM scores (quintiles)
df['R_score'] = pd.qcut(df['recency_days'], q=5, labels=[5,4,3,2,1])  # Reverse for recency
df['F_score'] = pd.qcut(df['frequency_purchases'], q=5, labels=[1,2,3,4,5])
df['M_score'] = pd.qcut(df['monetary_total'], q=5, labels=[1,2,3,4,5])

# Convert to numeric for calculation
df['R_score'] = df['R_score'].astype(int)
df['F_score'] = df['F_score'].astype(int)
df['M_score'] = df['M_score'].astype(int)

# Create overall RFM score
df['RFM_score'] = df['R_score'] + df['F_score'] + df['M_score']

# Segment customers
def assign_segment(rfm_score):
    if rfm_score >= 12:
        return 'Champions'
    elif rfm_score >= 10:
        return 'Loyal Customers'
    elif rfm_score >= 8:
        return 'Potential Loyalists'
    elif rfm_score >= 6:
        return 'New Customers'
    else:
        return 'At Risk'

df['Customer_Segment'] = df['RFM_score'].apply(assign_segment)

# Segment analysis
segment_stats = df.groupby('Customer_Segment').agg({
    'monetary_total': ['count', 'mean', 'std'],
    'frequency_purchases': 'mean',
    'recency_days': 'mean',
    'satisfaction_score': 'mean'
}).round(2)

print("Customer segment analysis:")
print(segment_stats)

# Statistical tests between segments
print(f"\n--- SEGMENT COMPARISON TESTS ---")
segments = df['Customer_Segment'].unique()
champions = df[df['Customer_Segment'] == 'Champions']['satisfaction_score']
at_risk = df[df['Customer_Segment'] == 'At Risk']['satisfaction_score']

if len(champions) > 0 and len(at_risk) > 0:
    t_stat, p_value = stats.ttest_ind(champions, at_risk)
    print(f"Satisfaction comparison (Champions vs At Risk):")
    print(f"  Champions mean: {champions.mean():.2f}")
    print(f"  At Risk mean: {at_risk.mean():.2f}")
    print(f"  T-test p-value: {p_value:.4f}")
    print(f"  Significant difference: {p_value < 0.05}")

Statistical Reporting and Visualization

print("=== STATISTICAL REPORTING ===")

def create_statistical_summary_report(df, numerical_cols, categorical_cols):
    """
    Create comprehensive statistical summary report
    """
    report = {
        'dataset_info': {
            'name': 'Customer Analysis Dataset',
            'observations': len(df),
            'variables': len(df.columns),
            'numerical_variables': len(numerical_cols),
            'categorical_variables': len(categorical_cols),
            'missing_values': df.isnull().sum().sum(),
            'duplicate_rows': df.duplicated().sum()
        },
        'numerical_summary': {},
        'categorical_summary': {},
        'correlation_summary': {},
        'statistical_tests': {}
    }
    
    # Numerical variable summary
    for col in numerical_cols:
        if col in df.columns:
            col_data = df[col].dropna()
            report['numerical_summary'][col] = {
                'count': len(col_data),
                'mean': col_data.mean(),
                'median': col_data.median(),
                'std': col_data.std(),
                'min': col_data.min(),
                'max': col_data.max(),
                'skewness': col_data.skew(),
                'kurtosis': col_data.kurtosis(),
                'q25': col_data.quantile(0.25),
                'q75': col_data.quantile(0.75),
                'iqr': col_data.quantile(0.75) - col_data.quantile(0.25)
            }
    
    # Categorical variable summary
    for col in categorical_cols:
        if col in df.columns:
            report['categorical_summary'][col] = {
                'unique_values': df[col].nunique(),
                'most_frequent': df[col].mode().iloc[0] if len(df[col].mode()) > 0 else None,
                'most_frequent_count': df[col].value_counts().iloc[0] if len(df[col]) > 0 else 0,
                'distribution': df[col].value_counts().to_dict()
            }
    
    # Correlation analysis
    numeric_df = df[numerical_cols].select_dtypes(include=[np.number])
    if len(numeric_df.columns) > 1:
        corr_matrix = numeric_df.corr()
        # Find highest correlations
        corr_pairs = []
        for i in range(len(corr_matrix.columns)):
            for j in range(i+1, len(corr_matrix.columns)):
                corr_pairs.append({
                    'var1': corr_matrix.columns[i],
                    'var2': corr_matrix.columns[j],
                    'correlation': corr_matrix.iloc[i, j]
                })
        
        corr_pairs_df = pd.DataFrame(corr_pairs)
        strongest_corr = corr_pairs_df.loc[corr_pairs_df['correlation'].abs().idxmax()]
        report['correlation_summary'] = {
            'strongest_positive': corr_pairs_df.loc[corr_pairs_df['correlation'].idxmax()].to_dict(),
            'strongest_negative': corr_pairs_df.loc[corr_pairs_df['correlation'].idxmin()].to_dict(),
            'average_correlation': corr_pairs_df['correlation'].abs().mean()
        }
    
    return report

# Generate comprehensive report
numerical_vars = ['age', 'income', 'education_years', 'experience', 'purchase_amount', 'satisfaction_score']
categorical_vars = ['region', 'product_category', 'is_premium']

statistical_report = create_statistical_summary_report(df, numerical_vars, categorical_vars)

print("STATISTICAL ANALYSIS REPORT")
print("=" * 50)
print(f"Dataset: {statistical_report['dataset_info']['name']}")
print(f"Observations: {statistical_report['dataset_info']['observations']:,}")
print(f"Variables: {statistical_report['dataset_info']['variables']}")
print(f"Missing values: {statistical_report['dataset_info']['missing_values']}")

print(f"\nKEY FINDINGS:")
print("-" * 20)

# Highlight interesting findings
for var, stats in statistical_report['numerical_summary'].items():
    if abs(stats['skewness']) > 1:
        skew_direction = "right" if stats['skewness'] > 0 else "left"
        print(f"• {var} is highly {skew_direction}-skewed (skewness: {stats['skewness']:.2f})")

if 'correlation_summary' in statistical_report and statistical_report['correlation_summary']:
    strongest_pos = statistical_report['correlation_summary']['strongest_positive']
    print(f"• Strongest positive correlation: {strongest_pos['var1']} & {strongest_pos['var2']} (r = {strongest_pos['correlation']:.3f})")
    
    strongest_neg = statistical_report['correlation_summary']['strongest_negative'] 
    print(f"• Strongest negative correlation: {strongest_neg['var1']} & {strongest_neg['var2']} (r = {strongest_neg['correlation']:.3f})")

# Top categories
for var, stats in statistical_report['categorical_summary'].items():
    most_common = stats['most_frequent']
    percentage = (stats['most_frequent_count'] / len(df)) * 100
    print(f"• Most common {var}: {most_common} ({percentage:.1f}%)")

print(f"\nRECOMMENDations:")
print("-" * 20)
print("• Consider log transformation for right-skewed variables")
print("• Investigate outliers in variables with high kurtosis")
print("• Use non-parametric tests for non-normal distributions")
print("• Consider stratified analysis by key categorical variables")

# Save report to file
import json
with open('statistical_analysis_report.json', 'w') as f:
    # Convert numpy types for JSON serialization
    def convert_numpy(obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif pd.isna(obj):
            return None
        return obj
    
    json.dump(statistical_report, f, indent=2, default=convert_numpy)

print(f"\n✅ Detailed report saved to 'statistical_analysis_report.json'")

Key Takeaways

Essential Statistical Concepts:

Descriptive Statistics:

Central Tendency: Mean, median, mode
Dispersion: Variance, standard deviation, IQR
Distribution Shape: Skewness, kurtosis
Position: Percentiles, quartiles

Inferential Statistics:

Hypothesis Testing: t-tests, ANOVA, chi-square
Confidence Intervals: Bootstrap, parametric
Correlation: Pearson, Spearman, Kendall
Regression: Linear, multiple regression

Practical Applications:

A/B Testing: Compare groups statistically
Segmentation: RFM analysis, clustering validation
Quality Control: Outlier detection, process monitoring
Forecasting: Trend analysis, time series

Best Practices:

Always explore your data before applying statistical tests
Check assumptions (normality, independence, equal variance)
Use appropriate tests for your data type and distribution
Consider effect size along with statistical significance
Apply multiple comparison corrections when needed
Use robust statistics when dealing with outliers
Validate results with different methods when possible
Document your methodology and assumptions

Common Pitfalls to Avoid:

P-hacking: Testing multiple hypotheses without correction
Assuming causation from correlation
Ignoring outliers without investigation
Using inappropriate tests for non-normal data
Over-interpreting small effect sizes
Not checking statistical assumptions
Conflating statistical and practical significance

What's Next?

Now that you've mastered statistical operations in pandas, explore:

Statistical analysis is the foundation of data-driven decision making. Master these techniques to extract meaningful insights and make confident conclusions from your data.