Statistical Operations in Pandas
Statistical analysis is at the heart of data science, and pandas provides a comprehensive suite of statistical functions and methods. This guide covers everything from basic descriptive statistics to advanced statistical operations, making it easy to extract meaningful insights from your data.
Why Statistical Operations Matter
Statistical analysis helps you:
- Understand your data: Discover patterns, trends, and relationships
- Make informed decisions: Base conclusions on quantitative evidence
- Identify anomalies: Spot outliers and unusual patterns
- Compare groups: Test hypotheses and measure differences
- Predict outcomes: Build models based on historical patterns
- Communicate insights: Present findings with statistical backing
Setting Up for Statistical Analysis
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 3)
# Create a comprehensive dataset for examples
np.random.seed(42)
n_samples = 1000
data = {
'customer_id': range(1001, 1001 + n_samples),
'age': np.random.normal(35, 12, n_samples).astype(int),
'income': np.random.lognormal(10.5, 0.8, n_samples),
'education_years': np.random.normal(14, 3, n_samples),
'experience': np.random.exponential(8, n_samples),
'purchase_amount': np.random.gamma(2, 50, n_samples),
'satisfaction_score': np.random.normal(7, 1.5, n_samples),
'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples, p=[0.3, 0.25, 0.25, 0.2]),
'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], n_samples, p=[0.4, 0.3, 0.2, 0.1]),
'is_premium': np.random.choice([True, False], n_samples, p=[0.3, 0.7]),
'months_since_signup': np.random.poisson(24, n_samples)
}
# Create DataFrame
df = pd.DataFrame(data)
# Clean up some values to be more realistic
df['age'] = np.clip(df['age'], 18, 80)
df['education_years'] = np.clip(df['education_years'], 8, 22).astype(int)
df['experience'] = np.clip(df['experience'], 0, 40).round(1)
df['satisfaction_score'] = np.clip(df['satisfaction_score'], 1, 10).round(1)
df['income'] = df['income'].round(2)
df['purchase_amount'] = df['purchase_amount'].round(2)
print("Dataset created for statistical analysis:")
print(df.head())
print(f"Dataset shape: {df.shape}")
Basic Descriptive Statistics
Central Tendency Measures
print("=== CENTRAL TENDENCY MEASURES ===")
# Mean, median, and mode for numerical columns
numerical_cols = ['age', 'income', 'education_years', 'experience', 'purchase_amount', 'satisfaction_score']
print("Central tendency summary:")
central_tendency = pd.DataFrame({
'Mean': df[numerical_cols].mean(),
'Median': df[numerical_cols].median(),
'Mode': df[numerical_cols].mode().iloc[0] if len(df) > 0 else np.nan
}).round(2)
print(central_tendency)
# Detailed analysis for a specific variable
print(f"\n--- INCOME ANALYSIS ---")
income_stats = {
'Mean': df['income'].mean(),
'Median': df['income'].median(),
'Trimmed Mean (5%)': stats.trim_mean(df['income'], 0.05),
'Harmonic Mean': stats.hmean(df['income']),
'Geometric Mean': stats.gmean(df['income'])
}
for stat_name, value in income_stats.items():
print(f"{stat_name}: ${value:,.2f}")
# Mode analysis for categorical variables
print(f"\n--- CATEGORICAL MODE ANALYSIS ---")
categorical_cols = ['region', 'product_category', 'is_premium']
for col in categorical_cols:
mode_value = df[col].mode().iloc[0]
mode_count = df[col].value_counts().iloc[0]
mode_percentage = (mode_count / len(df)) * 100
print(f"{col}: {mode_value} ({mode_count} occurrences, {mode_percentage:.1f}%)")
Measures of Dispersion
print("=== MEASURES OF DISPERSION ===")
# Variance and Standard Deviation
dispersion_stats = pd.DataFrame({
'Variance': df[numerical_cols].var(),
'Std Deviation': df[numerical_cols].std(),
'Mean Abs Deviation': df[numerical_cols].apply(lambda x: np.mean(np.abs(x - x.mean()))),
'Coefficient of Variation': (df[numerical_cols].std() / df[numerical_cols].mean()) * 100
}).round(3)
print("Dispersion measures:")
print(dispersion_stats)
# Range and Interquartile Range
print(f"\n--- RANGE ANALYSIS ---")
range_stats = pd.DataFrame({
'Min': df[numerical_cols].min(),
'Max': df[numerical_cols].max(),
'Range': df[numerical_cols].max() - df[numerical_cols].min(),
'IQR': df[numerical_cols].quantile(0.75) - df[numerical_cols].quantile(0.25)
}).round(2)
print(range_stats)
# Detailed percentile analysis
print(f"\n--- PERCENTILE ANALYSIS (Income) ---")
income_percentiles = df['income'].quantile([0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99])
print("Income percentiles:")
for percentile, value in income_percentiles.items():
print(f"{percentile*100:4.0f}th percentile: ${value:8,.2f}")
Distribution Shape Analysis
print("=== DISTRIBUTION SHAPE ANALYSIS ===")
# Skewness and Kurtosis
shape_stats = pd.DataFrame({
'Skewness': df[numerical_cols].skew(),
'Kurtosis': df[numerical_cols].kurtosis(),
'Jarque-Bera Test (p-value)': df[numerical_cols].apply(
lambda x: stats.jarque_bera(x.dropna())[1] if len(x.dropna()) > 0 else np.nan
)
}).round(4)
print("Distribution shape measures:")
print(shape_stats)
# Interpretation of skewness and kurtosis
print(f"\n--- INTERPRETATION ---")
for col in numerical_cols:
skewness = df[col].skew()
kurtosis = df[col].kurtosis()
# Skewness interpretation
if abs(skewness) < 0.5:
skew_desc = "approximately symmetric"
elif skewness < -0.5:
skew_desc = "left-skewed (negative)"
else:
skew_desc = "right-skewed (positive)"
# Kurtosis interpretation
if abs(kurtosis) < 0.5:
kurt_desc = "normal kurtosis (mesokurtic)"
elif kurtosis < -0.5:
kurt_desc = "light tails (platykurtic)"
else:
kurt_desc = "heavy tails (leptokurtic)"
print(f"{col}: {skew_desc}, {kurt_desc}")
# Normality tests
print(f"\n--- NORMALITY TESTS ---")
for col in numerical_cols[:4]: # Test first 4 columns
data_col = df[col].dropna()
# Shapiro-Wilk test (good for small samples)
if len(data_col) <= 5000:
shapiro_stat, shapiro_p = stats.shapiro(data_col[:5000]) # Limit for performance
else:
shapiro_stat, shapiro_p = np.nan, np.nan
# Anderson-Darling test
anderson_result = stats.anderson(data_col, dist='norm')
# Kolmogorov-Smirnov test
ks_stat, ks_p = stats.kstest(data_col, 'norm', args=(data_col.mean(), data_col.std()))
print(f"\n{col}:")
if not np.isnan(shapiro_p):
print(f" Shapiro-Wilk: p = {shapiro_p:.4f} ({'Normal' if shapiro_p > 0.05 else 'Non-normal'})")
print(f" Anderson-Darling: statistic = {anderson_result.statistic:.4f}")
print(f" Kolmogorov-Smirnov: p = {ks_p:.4f} ({'Normal' if ks_p > 0.05 else 'Non-normal'})")
Correlation Analysis
print("=== CORRELATION ANALYSIS ===")
# Pearson correlation matrix
correlation_matrix = df[numerical_cols].corr()
print("Pearson correlation matrix:")
print(correlation_matrix.round(3))
# Find strong correlations
def find_strong_correlations(corr_matrix, threshold=0.7):
"""Find pairs of variables with strong correlations"""
strong_corr = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
corr_value = corr_matrix.iloc[i, j]
if abs(corr_value) >= threshold:
strong_corr.append({
'Variable 1': corr_matrix.columns[i],
'Variable 2': corr_matrix.columns[j],
'Correlation': corr_value,
'Strength': 'Strong Positive' if corr_value > 0 else 'Strong Negative'
})
return pd.DataFrame(strong_corr)
strong_correlations = find_strong_correlations(correlation_matrix, threshold=0.3)
if not strong_correlations.empty:
print(f"\nStrong correlations (|r| >= 0.3):")
print(strong_correlations)
else:
print(f"\nNo strong correlations found (threshold = 0.3)")
# Spearman rank correlation (for non-linear relationships)
spearman_corr = df[numerical_cols].corr(method='spearman')
print(f"\nSpearman rank correlations (top 3 pairs):")
spearman_pairs = []
for i in range(len(spearman_corr.columns)):
for j in range(i+1, len(spearman_corr.columns)):
spearman_pairs.append({
'Pair': f"{spearman_corr.columns[i]} - {spearman_corr.columns[j]}",
'Spearman': spearman_corr.iloc[i, j],
'Pearson': correlation_matrix.iloc[i, j]
})
spearman_df = pd.DataFrame(spearman_pairs).sort_values('Spearman', key=abs, ascending=False)
print(spearman_df.head(3).round(3))
# Correlation significance testing
def correlation_significance_test(x, y, alpha=0.05):
"""Test statistical significance of correlation"""
corr_coef, p_value = stats.pearsonr(x, y)
is_significant = p_value < alpha
return {
'correlation': corr_coef,
'p_value': p_value,
'significant': is_significant,
'confidence_level': (1 - alpha) * 100
}
# Test significance for a few key pairs
print(f"\n--- CORRELATION SIGNIFICANCE TESTS ---")
test_pairs = [
('age', 'experience'),
('income', 'education_years'),
('purchase_amount', 'satisfaction_score')
]
for var1, var2 in test_pairs:
result = correlation_significance_test(df[var1], df[var2])
print(f"{var1} - {var2}:")
print(f" Correlation: {result['correlation']:.4f}")
print(f" P-value: {result['p_value']:.4f}")
print(f" Significant at 95% level: {result['significant']}")
Group Statistics and Comparisons
print("=== GROUP STATISTICS AND COMPARISONS ===")
# Group by categorical variables
print("--- GROUP STATISTICS BY REGION ---")
region_stats = df.groupby('region')[numerical_cols].agg(['count', 'mean', 'std', 'median']).round(2)
print("Income statistics by region:")
print(region_stats['income'])
# Compare groups using descriptive statistics
print(f"\n--- PRODUCT CATEGORY COMPARISON ---")
category_comparison = df.groupby('product_category')['purchase_amount'].agg([
'count', 'mean', 'std', 'median', 'min', 'max'
]).round(2)
print(category_comparison)
# Statistical tests for group differences
print(f"\n--- STATISTICAL TESTS FOR GROUP DIFFERENCES ---")
# T-test: Compare premium vs non-premium customers
premium_income = df[df['is_premium'] == True]['income']
regular_income = df[df['is_premium'] == False]['income']
# Two-sample t-test
t_stat, t_p_value = stats.ttest_ind(premium_income, regular_income)
print(f"T-test (Premium vs Regular Income):")
print(f" Premium mean: ${premium_income.mean():,.2f}")
print(f" Regular mean: ${regular_income.mean():,.2f}")
print(f" T-statistic: {t_stat:.4f}")
print(f" P-value: {t_p_value:.4f}")
print(f" Significant difference: {t_p_value < 0.05}")
# Mann-Whitney U test (non-parametric alternative)
u_stat, u_p_value = stats.mannwhitneyu(premium_income, regular_income, alternative='two-sided')
print(f"\nMann-Whitney U test:")
print(f" U-statistic: {u_stat}")
print(f" P-value: {u_p_value:.4f}")
# ANOVA: Compare satisfaction across multiple regions
print(f"\n--- ONE-WAY ANOVA ---")
region_groups = [group['satisfaction_score'].values for name, group in df.groupby('region')]
f_stat, anova_p_value = stats.f_oneway(*region_groups)
print(f"ANOVA - Satisfaction Score by Region:")
print(f" F-statistic: {f_stat:.4f}")
print(f" P-value: {anova_p_value:.4f}")
print(f" Significant difference between regions: {anova_p_value < 0.05}")
# Post-hoc analysis if ANOVA is significant
if anova_p_value < 0.05:
print(f"\nPost-hoc analysis (Tukey HSD):")
from scipy.stats import tukey_hsd
# Note: tukey_hsd might not be available in all scipy versions
# Alternative approach using pairwise t-tests with Bonferroni correction
regions = df['region'].unique()
n_comparisons = len(regions) * (len(regions) - 1) // 2
alpha_corrected = 0.05 / n_comparisons
print(f"Pairwise comparisons (Bonferroni corrected α = {alpha_corrected:.4f}):")
for i, region1 in enumerate(regions):
for region2 in regions[i+1:]:
group1 = df[df['region'] == region1]['satisfaction_score']
group2 = df[df['region'] == region2]['satisfaction_score']
t_stat, p_val = stats.ttest_ind(group1, group2)
significant = p_val < alpha_corrected
print(f" {region1} vs {region2}: p = {p_val:.4f} {'*' if significant else ''}")
Time Series and Trend Analysis
print("=== TIME SERIES AND TREND ANALYSIS ===")
# Create a time series component
np.random.seed(42)
dates = pd.date_range('2020-01-01', '2023-12-31', freq='D')
trend = np.linspace(100, 200, len(dates))
seasonal = 20 * np.sin(2 * np.pi * np.arange(len(dates)) / 365.25)
noise = np.random.normal(0, 10, len(dates))
ts_values = trend + seasonal + noise
ts_df = pd.DataFrame({
'date': dates,
'value': ts_values,
'month': dates.month,
'year': dates.year,
'day_of_week': dates.dayofweek
})
print("Time series data created:")
print(ts_df.head())
# Basic time series statistics
print(f"\n--- TIME SERIES DESCRIPTIVE STATISTICS ---")
print(f"Time period: {ts_df['date'].min()} to {ts_df['date'].max()}")
print(f"Number of observations: {len(ts_df)}")
print(f"Value statistics:")
print(ts_df['value'].describe().round(2))
# Trend analysis
print(f"\n--- TREND ANALYSIS ---")
# Linear trend
slope, intercept, r_value, p_value, std_err = stats.linregress(
range(len(ts_df)), ts_df['value']
)
print(f"Linear trend:")
print(f" Slope: {slope:.4f} units per day")
print(f" R-squared: {r_value**2:.4f}")
print(f" P-value: {p_value:.4f}")
print(f" Trend is {'significant' if p_value < 0.05 else 'not significant'}")
# Seasonal analysis
monthly_stats = ts_df.groupby('month')['value'].agg(['mean', 'std']).round(2)
print(f"\nMonthly statistics:")
print(monthly_stats)
# Autocorrelation analysis
def calculate_autocorrelation(series, max_lags=30):
"""Calculate autocorrelation for different lags"""
autocorr = []
for lag in range(1, max_lags + 1):
corr = series.corr(series.shift(lag))
autocorr.append({'lag': lag, 'autocorr': corr})
return pd.DataFrame(autocorr)
autocorr_results = calculate_autocorrelation(ts_df['value'])
print(f"\nAutocorrelation analysis (top 5 lags):")
print(autocorr_results.sort_values('autocorr', key=abs, ascending=False).head())
# Moving averages and smoothing
ts_df['ma_7'] = ts_df['value'].rolling(window=7).mean()
ts_df['ma_30'] = ts_df['value'].rolling(window=30).mean()
ts_df['ewm_alpha_0_1'] = ts_df['value'].ewm(alpha=0.1).mean()
print(f"\nSmoothing comparison (last 5 values):")
print(ts_df[['date', 'value', 'ma_7', 'ma_30', 'ewm_alpha_0_1']].tail())
Advanced Statistical Operations
print("=== ADVANCED STATISTICAL OPERATIONS ===")
# Regression analysis
print("--- LINEAR REGRESSION ANALYSIS ---")
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
# Predict income based on age, education, and experience
X = df[['age', 'education_years', 'experience']]
y = df['income']
# Fit regression model
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
# Regression statistics
r2 = r2_score(y, y_pred)
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
print(f"Multiple Linear Regression Results:")
print(f" R-squared: {r2:.4f}")
print(f" RMSE: ${rmse:,.2f}")
print(f" Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
print(f" {feature}: {coef:.2f}")
print(f" Intercept: {model.intercept_:.2f}")
# Statistical significance of predictors (using scipy)
from scipy.stats import t
n = len(X)
k = len(X.columns)
dof = n - k - 1
# Calculate t-statistics and p-values for coefficients
residuals = y - y_pred
mse_res = np.sum(residuals**2) / dof
var_coef = mse_res * np.linalg.inv(X.T.dot(X)).diagonal()
se_coef = np.sqrt(var_coef)
t_stats = model.coef_ / se_coef
p_values = 2 * (1 - t.cdf(np.abs(t_stats), dof))
print(f"\nCoefficient significance:")
for feature, coef, se, t_stat, p_val in zip(X.columns, model.coef_, se_coef, t_stats, p_values):
significance = "***" if p_val < 0.001 else "**" if p_val < 0.01 else "*" if p_val < 0.05 else ""
print(f" {feature}: {coef:.4f} (SE: {se:.4f}, t: {t_stat:.2f}, p: {p_val:.4f}) {significance}")
# Chi-square test for independence
print(f"\n--- CHI-SQUARE TEST OF INDEPENDENCE ---")
contingency_table = pd.crosstab(df['region'], df['product_category'])
print("Contingency table (Region vs Product Category):")
print(contingency_table)
chi2, chi2_p, dof, expected = stats.chi2_contingency(contingency_table)
print(f"\nChi-square test results:")
print(f" Chi-square statistic: {chi2:.4f}")
print(f" Degrees of freedom: {dof}")
print(f" P-value: {chi2_p:.4f}")
print(f" Significant association: {chi2_p < 0.05}")
# Cramér's V (effect size for chi-square)
n = contingency_table.sum().sum()
cramers_v = np.sqrt(chi2 / (n * (min(contingency_table.shape) - 1)))
print(f" Cramér's V (effect size): {cramers_v:.4f}")
Statistical Hypothesis Testing
print("=== STATISTICAL HYPOTHESIS TESTING ===")
# One-sample t-test
print("--- ONE-SAMPLE T-TEST ---")
# Test if average satisfaction score differs from 7.0
hypothesized_mean = 7.0
satisfaction_data = df['satisfaction_score'].dropna()
t_stat, p_value = stats.ttest_1samp(satisfaction_data, hypothesized_mean)
sample_mean = satisfaction_data.mean()
sample_std = satisfaction_data.std()
n = len(satisfaction_data)
se = sample_std / np.sqrt(n)
# Confidence interval
confidence_level = 0.95
alpha = 1 - confidence_level
t_critical = stats.t.ppf(1 - alpha/2, n-1)
margin_error = t_critical * se
ci_lower = sample_mean - margin_error
ci_upper = sample_mean + margin_error
print(f"One-sample t-test (H0: ÎĽ = {hypothesized_mean}):")
print(f" Sample mean: {sample_mean:.3f}")
print(f" Sample std: {sample_std:.3f}")
print(f" T-statistic: {t_stat:.4f}")
print(f" P-value: {p_value:.4f}")
print(f" {confidence_level*100}% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")
print(f" Reject H0: {p_value < 0.05}")
# Paired t-test example (create paired data)
print(f"\n--- PAIRED T-TEST ---")
# Simulate before/after scenario
np.random.seed(42)
before_scores = df['satisfaction_score'].sample(100).values
after_scores = before_scores + np.random.normal(0.5, 1, 100) # Small improvement
paired_t_stat, paired_p_value = stats.ttest_rel(after_scores, before_scores)
mean_difference = np.mean(after_scores - before_scores)
print(f"Paired t-test (Before vs After satisfaction):")
print(f" Before mean: {np.mean(before_scores):.3f}")
print(f" After mean: {np.mean(after_scores):.3f}")
print(f" Mean difference: {mean_difference:.3f}")
print(f" T-statistic: {paired_t_stat:.4f}")
print(f" P-value: {paired_p_value:.4f}")
print(f" Significant improvement: {paired_p_value < 0.05}")
# Power analysis
print(f"\n--- STATISTICAL POWER ANALYSIS ---")
from scipy.stats import norm
def calculate_power(effect_size, sample_size, alpha=0.05):
"""Calculate statistical power for a two-tailed t-test"""
# Critical value
t_critical = stats.t.ppf(1 - alpha/2, sample_size - 1)
# Non-centrality parameter
ncp = effect_size * np.sqrt(sample_size)
# Power calculation
power = 1 - stats.t.cdf(t_critical, sample_size - 1, ncp) + stats.t.cdf(-t_critical, sample_size - 1, ncp)
return power
# Calculate power for different scenarios
effect_sizes = [0.2, 0.5, 0.8] # Small, medium, large effects
sample_sizes = [20, 50, 100, 200]
print("Statistical power analysis:")
print("Effect Size | Sample Size | Power")
print("-" * 35)
for effect in effect_sizes:
for n in sample_sizes:
power = calculate_power(effect, n)
print(f" {effect:4.1f} | {n:3d} | {power:5.3f}")
Robust Statistics
print("=== ROBUST STATISTICS ===")
# Robust measures of central tendency and spread
print("--- ROBUST vs TRADITIONAL STATISTICS ---")
# Add some outliers to demonstrate robustness
income_with_outliers = df['income'].copy()
income_with_outliers.iloc[:10] = income_with_outliers.iloc[:10] * 10 # Create outliers
comparison_stats = pd.DataFrame({
'Original': [
df['income'].mean(),
df['income'].median(),
stats.trim_mean(df['income'], 0.1),
df['income'].std(),
stats.median_abs_deviation(df['income']),
stats.iqr(df['income'])
],
'With Outliers': [
income_with_outliers.mean(),
income_with_outliers.median(),
stats.trim_mean(income_with_outliers, 0.1),
income_with_outliers.std(),
stats.median_abs_deviation(income_with_outliers),
stats.iqr(income_with_outliers)
]
}, index=['Mean', 'Median', 'Trimmed Mean (10%)', 'Std Dev', 'MAD', 'IQR'])
print("Comparison of statistics with and without outliers:")
print(comparison_stats.round(2))
# Robust correlation
print(f"\n--- ROBUST CORRELATION ---")
# Spearman correlation is more robust to outliers
pearson_corr = df['income'].corr(df['age'])
spearman_corr = df['income'].corr(df['age'], method='spearman')
kendall_corr = df['income'].corr(df['age'], method='kendall')
print(f"Income-Age correlations:")
print(f" Pearson: {pearson_corr:.4f}")
print(f" Spearman: {spearman_corr:.4f}")
print(f" Kendall: {kendall_corr:.4f}")
# Bootstrap confidence intervals
print(f"\n--- BOOTSTRAP CONFIDENCE INTERVALS ---")
def bootstrap_statistic(data, statistic_func, n_bootstrap=1000, confidence_level=0.95):
"""Calculate bootstrap confidence interval for a statistic"""
bootstrap_stats = []
n = len(data)
for _ in range(n_bootstrap):
bootstrap_sample = np.random.choice(data, size=n, replace=True)
bootstrap_stats.append(statistic_func(bootstrap_sample))
alpha = 1 - confidence_level
lower_percentile = (alpha / 2) * 100
upper_percentile = (1 - alpha / 2) * 100
ci_lower = np.percentile(bootstrap_stats, lower_percentile)
ci_upper = np.percentile(bootstrap_stats, upper_percentile)
return ci_lower, ci_upper, bootstrap_stats
# Bootstrap confidence interval for median income
ci_lower, ci_upper, bootstrap_medians = bootstrap_statistic(
df['income'].values,
np.median,
n_bootstrap=1000
)
print(f"Bootstrap 95% CI for median income: [${ci_lower:,.2f}, ${ci_upper:,.2f}]")
print(f"Original median: ${df['income'].median():,.2f}")
# Bootstrap for correlation coefficient
def correlation_statistic(data):
"""Calculate correlation for bootstrap (data should be 2D array)"""
return np.corrcoef(data[:, 0], data[:, 1])[0, 1]
income_age_data = df[['income', 'age']].values
ci_lower_corr, ci_upper_corr, bootstrap_corrs = bootstrap_statistic(
income_age_data,
correlation_statistic,
n_bootstrap=1000
)
print(f"Bootstrap 95% CI for income-age correlation: [{ci_lower_corr:.4f}, {ci_upper_corr:.4f}]")
Practical Statistical Applications
print("=== PRACTICAL STATISTICAL APPLICATIONS ===")
# A/B Testing Framework
print("--- A/B TESTING FRAMEWORK ---")
def ab_test_analysis(control_group, treatment_group, metric_name="metric", alpha=0.05):
"""
Comprehensive A/B test analysis
"""
# Basic statistics
control_mean = np.mean(control_group)
treatment_mean = np.mean(treatment_group)
control_std = np.std(control_group, ddof=1)
treatment_std = np.std(treatment_group, ddof=1)
# Effect size (Cohen's d)
pooled_std = np.sqrt(((len(control_group) - 1) * control_std**2 +
(len(treatment_group) - 1) * treatment_std**2) /
(len(control_group) + len(treatment_group) - 2))
cohens_d = (treatment_mean - control_mean) / pooled_std
# Statistical test
t_stat, p_value = stats.ttest_ind(treatment_group, control_group)
# Confidence interval for difference
se_diff = np.sqrt(control_std**2/len(control_group) + treatment_std**2/len(treatment_group))
dof = len(control_group) + len(treatment_group) - 2
t_critical = stats.t.ppf(1 - alpha/2, dof)
diff = treatment_mean - control_mean
ci_lower = diff - t_critical * se_diff
ci_upper = diff + t_critical * se_diff
# Results
results = {
'control_mean': control_mean,
'treatment_mean': treatment_mean,
'absolute_difference': diff,
'relative_difference': (diff / control_mean) * 100,
'cohens_d': cohens_d,
't_statistic': t_stat,
'p_value': p_value,
'significant': p_value < alpha,
'confidence_interval': (ci_lower, ci_upper),
'control_size': len(control_group),
'treatment_size': len(treatment_group)
}
return results
# Simulate A/B test data
np.random.seed(42)
control_purchases = np.random.normal(120, 30, 500) # Control group
treatment_purchases = np.random.normal(135, 30, 480) # Treatment group (higher mean)
ab_results = ab_test_analysis(control_purchases, treatment_purchases, "Purchase Amount")
print("A/B Test Results:")
print(f" Control mean: ${ab_results['control_mean']:.2f} (n={ab_results['control_size']})")
print(f" Treatment mean: ${ab_results['treatment_mean']:.2f} (n={ab_results['treatment_size']})")
print(f" Absolute difference: ${ab_results['absolute_difference']:.2f}")
print(f" Relative difference: {ab_results['relative_difference']:.2f}%")
print(f" Effect size (Cohen's d): {ab_results['cohens_d']:.3f}")
print(f" P-value: {ab_results['p_value']:.4f}")
print(f" Statistically significant: {ab_results['significant']}")
print(f" 95% CI for difference: [${ab_results['confidence_interval'][0]:.2f}, ${ab_results['confidence_interval'][1]:.2f}]")
# Customer Segmentation Analysis
print(f"\n--- CUSTOMER SEGMENTATION ANALYSIS ---")
# Create customer segments based on RFM (Recency, Frequency, Monetary)
# Simulate additional data for RFM analysis
df['recency_days'] = np.random.exponential(30, len(df))
df['frequency_purchases'] = np.random.poisson(5, len(df))
df['monetary_total'] = df['purchase_amount'] * df['frequency_purchases'] + np.random.normal(0, 50, len(df))
# Calculate RFM scores (quintiles)
df['R_score'] = pd.qcut(df['recency_days'], q=5, labels=[5,4,3,2,1]) # Reverse for recency
df['F_score'] = pd.qcut(df['frequency_purchases'], q=5, labels=[1,2,3,4,5])
df['M_score'] = pd.qcut(df['monetary_total'], q=5, labels=[1,2,3,4,5])
# Convert to numeric for calculation
df['R_score'] = df['R_score'].astype(int)
df['F_score'] = df['F_score'].astype(int)
df['M_score'] = df['M_score'].astype(int)
# Create overall RFM score
df['RFM_score'] = df['R_score'] + df['F_score'] + df['M_score']
# Segment customers
def assign_segment(rfm_score):
if rfm_score >= 12:
return 'Champions'
elif rfm_score >= 10:
return 'Loyal Customers'
elif rfm_score >= 8:
return 'Potential Loyalists'
elif rfm_score >= 6:
return 'New Customers'
else:
return 'At Risk'
df['Customer_Segment'] = df['RFM_score'].apply(assign_segment)
# Segment analysis
segment_stats = df.groupby('Customer_Segment').agg({
'monetary_total': ['count', 'mean', 'std'],
'frequency_purchases': 'mean',
'recency_days': 'mean',
'satisfaction_score': 'mean'
}).round(2)
print("Customer segment analysis:")
print(segment_stats)
# Statistical tests between segments
print(f"\n--- SEGMENT COMPARISON TESTS ---")
segments = df['Customer_Segment'].unique()
champions = df[df['Customer_Segment'] == 'Champions']['satisfaction_score']
at_risk = df[df['Customer_Segment'] == 'At Risk']['satisfaction_score']
if len(champions) > 0 and len(at_risk) > 0:
t_stat, p_value = stats.ttest_ind(champions, at_risk)
print(f"Satisfaction comparison (Champions vs At Risk):")
print(f" Champions mean: {champions.mean():.2f}")
print(f" At Risk mean: {at_risk.mean():.2f}")
print(f" T-test p-value: {p_value:.4f}")
print(f" Significant difference: {p_value < 0.05}")
Statistical Reporting and Visualization
print("=== STATISTICAL REPORTING ===")
def create_statistical_summary_report(df, numerical_cols, categorical_cols):
"""
Create comprehensive statistical summary report
"""
report = {
'dataset_info': {
'name': 'Customer Analysis Dataset',
'observations': len(df),
'variables': len(df.columns),
'numerical_variables': len(numerical_cols),
'categorical_variables': len(categorical_cols),
'missing_values': df.isnull().sum().sum(),
'duplicate_rows': df.duplicated().sum()
},
'numerical_summary': {},
'categorical_summary': {},
'correlation_summary': {},
'statistical_tests': {}
}
# Numerical variable summary
for col in numerical_cols:
if col in df.columns:
col_data = df[col].dropna()
report['numerical_summary'][col] = {
'count': len(col_data),
'mean': col_data.mean(),
'median': col_data.median(),
'std': col_data.std(),
'min': col_data.min(),
'max': col_data.max(),
'skewness': col_data.skew(),
'kurtosis': col_data.kurtosis(),
'q25': col_data.quantile(0.25),
'q75': col_data.quantile(0.75),
'iqr': col_data.quantile(0.75) - col_data.quantile(0.25)
}
# Categorical variable summary
for col in categorical_cols:
if col in df.columns:
report['categorical_summary'][col] = {
'unique_values': df[col].nunique(),
'most_frequent': df[col].mode().iloc[0] if len(df[col].mode()) > 0 else None,
'most_frequent_count': df[col].value_counts().iloc[0] if len(df[col]) > 0 else 0,
'distribution': df[col].value_counts().to_dict()
}
# Correlation analysis
numeric_df = df[numerical_cols].select_dtypes(include=[np.number])
if len(numeric_df.columns) > 1:
corr_matrix = numeric_df.corr()
# Find highest correlations
corr_pairs = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
corr_pairs.append({
'var1': corr_matrix.columns[i],
'var2': corr_matrix.columns[j],
'correlation': corr_matrix.iloc[i, j]
})
corr_pairs_df = pd.DataFrame(corr_pairs)
strongest_corr = corr_pairs_df.loc[corr_pairs_df['correlation'].abs().idxmax()]
report['correlation_summary'] = {
'strongest_positive': corr_pairs_df.loc[corr_pairs_df['correlation'].idxmax()].to_dict(),
'strongest_negative': corr_pairs_df.loc[corr_pairs_df['correlation'].idxmin()].to_dict(),
'average_correlation': corr_pairs_df['correlation'].abs().mean()
}
return report
# Generate comprehensive report
numerical_vars = ['age', 'income', 'education_years', 'experience', 'purchase_amount', 'satisfaction_score']
categorical_vars = ['region', 'product_category', 'is_premium']
statistical_report = create_statistical_summary_report(df, numerical_vars, categorical_vars)
print("STATISTICAL ANALYSIS REPORT")
print("=" * 50)
print(f"Dataset: {statistical_report['dataset_info']['name']}")
print(f"Observations: {statistical_report['dataset_info']['observations']:,}")
print(f"Variables: {statistical_report['dataset_info']['variables']}")
print(f"Missing values: {statistical_report['dataset_info']['missing_values']}")
print(f"\nKEY FINDINGS:")
print("-" * 20)
# Highlight interesting findings
for var, stats in statistical_report['numerical_summary'].items():
if abs(stats['skewness']) > 1:
skew_direction = "right" if stats['skewness'] > 0 else "left"
print(f"• {var} is highly {skew_direction}-skewed (skewness: {stats['skewness']:.2f})")
if 'correlation_summary' in statistical_report and statistical_report['correlation_summary']:
strongest_pos = statistical_report['correlation_summary']['strongest_positive']
print(f"• Strongest positive correlation: {strongest_pos['var1']} & {strongest_pos['var2']} (r = {strongest_pos['correlation']:.3f})")
strongest_neg = statistical_report['correlation_summary']['strongest_negative']
print(f"• Strongest negative correlation: {strongest_neg['var1']} & {strongest_neg['var2']} (r = {strongest_neg['correlation']:.3f})")
# Top categories
for var, stats in statistical_report['categorical_summary'].items():
most_common = stats['most_frequent']
percentage = (stats['most_frequent_count'] / len(df)) * 100
print(f"• Most common {var}: {most_common} ({percentage:.1f}%)")
print(f"\nRECOMMENDations:")
print("-" * 20)
print("• Consider log transformation for right-skewed variables")
print("• Investigate outliers in variables with high kurtosis")
print("• Use non-parametric tests for non-normal distributions")
print("• Consider stratified analysis by key categorical variables")
# Save report to file
import json
with open('statistical_analysis_report.json', 'w') as f:
# Convert numpy types for JSON serialization
def convert_numpy(obj):
if isinstance(obj, np.integer):
return int(obj)
elif isinstance(obj, np.floating):
return float(obj)
elif pd.isna(obj):
return None
return obj
json.dump(statistical_report, f, indent=2, default=convert_numpy)
print(f"\nâś… Detailed report saved to 'statistical_analysis_report.json'")
Key Takeaways
Essential Statistical Concepts:
Descriptive Statistics:
- Central Tendency: Mean, median, mode
- Dispersion: Variance, standard deviation, IQR
- Distribution Shape: Skewness, kurtosis
- Position: Percentiles, quartiles
Inferential Statistics:
- Hypothesis Testing: t-tests, ANOVA, chi-square
- Confidence Intervals: Bootstrap, parametric
- Correlation: Pearson, Spearman, Kendall
- Regression: Linear, multiple regression
Practical Applications:
- A/B Testing: Compare groups statistically
- Segmentation: RFM analysis, clustering validation
- Quality Control: Outlier detection, process monitoring
- Forecasting: Trend analysis, time series
Best Practices:
- Always explore your data before applying statistical tests
- Check assumptions (normality, independence, equal variance)
- Use appropriate tests for your data type and distribution
- Consider effect size along with statistical significance
- Apply multiple comparison corrections when needed
- Use robust statistics when dealing with outliers
- Validate results with different methods when possible
- Document your methodology and assumptions
Common Pitfalls to Avoid:
- P-hacking: Testing multiple hypotheses without correction
- Assuming causation from correlation
- Ignoring outliers without investigation
- Using inappropriate tests for non-normal data
- Over-interpreting small effect sizes
- Not checking statistical assumptions
- Conflating statistical and practical significance
What's Next?
Now that you've mastered statistical operations in pandas, explore:
- Data Visualization with Pandas
- Machine Learning with Pandas and Scikit-learn
- Advanced Time Series Analysis
- Bayesian Statistics with Python
Statistical analysis is the foundation of data-driven decision making. Master these techniques to extract meaningful insights and make confident conclusions from your data.
