Time Series Analysis with Pandas

Time series analysis is crucial for working with temporal data - from stock prices and weather patterns to sales trends and sensor readings. Pandas provides powerful tools specifically designed for time series data, making it easy to manipulate, analyze, and visualize data that changes over time.

What is Time Series Data?

Time series data consists of observations recorded at specific time intervals:

Stock prices recorded every minute
Daily temperature readings
Monthly sales figures
Yearly economic indicators
Sensor data collected every millisecond

Pandas excels at handling time series through:

Flexible date/time handling
Resampling and frequency conversion
Moving window statistics
Time-based indexing and slicing
Timezone support

Date and Time Data Types

Creating Timestamps

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Various ways to create timestamps
timestamp1 = pd.Timestamp('2024-01-15')
timestamp2 = pd.Timestamp('2024-01-15 14:30:00')
timestamp3 = pd.Timestamp(2024, 1, 15, 14, 30)
timestamp4 = pd.Timestamp.now()

print("Timestamps:")
print(f"Date only: {timestamp1}")
print(f"Date and time: {timestamp2}")
print(f"From components: {timestamp3}")
print(f"Current time: {timestamp4}")

# Timestamp attributes
print(f"\nYear: {timestamp2.year}")
print(f"Month: {timestamp2.month}")
print(f"Day: {timestamp2.day}")
print(f"Hour: {timestamp2.hour}")
print(f"Day of week: {timestamp2.day_name()}")
print(f"Quarter: {timestamp2.quarter}")

Creating Date Ranges

# Create date ranges
daily = pd.date_range(start='2024-01-01', end='2024-01-31', freq='D')
print(f"Daily dates: {len(daily)} days")

# Business days only
business_days = pd.date_range(start='2024-01-01', end='2024-01-31', freq='B')
print(f"Business days: {len(business_days)} days")

# By period count
monthly = pd.date_range(start='2024-01-01', periods=12, freq='MS')
print(f"Monthly (start of month): {len(monthly)} months")

# Common frequency strings
print("\nCommon Frequencies:")
print("D - Calendar day")
print("B - Business day")
print("W - Weekly")
print("M - Month end")
print("MS - Month start")
print("Q - Quarter end")
print("A - Year end")
print("H - Hourly")
print("T or min - Minutely")
print("S - Secondly")

Parsing Dates from Strings

# Parse various date formats
dates_list = ['2024-01-15', '01/15/2024', 'Jan 15, 2024', '15-Jan-2024']

for date_str in dates_list:
    parsed = pd.to_datetime(date_str)
    print(f"{date_str:20} -> {parsed}")

# Parse with specific format (faster for large datasets)
dates = pd.to_datetime(['20240115', '20240116', '20240117'], format='%Y%m%d')
print(f"\nParsed dates: {dates}")

# Handle parsing errors
invalid_dates = ['2024-01-15', 'not a date', '2024-02-30']
dates_coerced = pd.to_datetime(invalid_dates, errors='coerce')
print(f"\nWith invalid dates (coerced): {dates_coerced}")

Time Series DataFrames

Creating Time Series Data

# Create sample stock price data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=100, freq='D')

stock_data = pd.DataFrame({
    'date': dates,
    'price': 100 + np.cumsum(np.random.randn(100)),
    'volume': np.random.randint(1000000, 5000000, 100)
})

stock_data['returns'] = stock_data['price'].pct_change()

print("Stock Data:")
print(stock_data.head())

# Set date as index (recommended for time series)
stock_data.set_index('date', inplace=True)
print("\nWith Date Index:")
print(stock_data.head())

Time-Based Indexing and Slicing

# Select specific date
jan_15 = stock_data.loc['2024-01-15']
print("Data for Jan 15:")
print(jan_15)

# Select date range
january = stock_data['2024-01']
print(f"\nJanuary data: {len(january)} days")

# Select between dates
q1 = stock_data['2024-01-01':'2024-03-31']
print(f"Q1 data: {len(q1)} days")

# Select recent data
last_week = stock_data.last('7D')
print(f"\nLast 7 days: {len(last_week)} days")

# Select by year
year_2024 = stock_data['2024']
print(f"Year 2024: {len(year_2024)} days")

Resampling: Changing Frequency

Resampling converts time series from one frequency to another.

Downsampling (Higher to Lower Frequency)

# Create hourly data
hourly_data = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=24*7, freq='H'),
    'temperature': np.random.normal(70, 5, 24*7),
    'humidity': np.random.normal(60, 10, 24*7)
}).set_index('timestamp')

print("Hourly Data (first 5 rows):")
print(hourly_data.head())

# Resample to daily (average)
daily_avg = hourly_data.resample('D').mean()
print("\nDaily Averages:")
print(daily_avg.head())

# Multiple aggregations
daily_stats = hourly_data.resample('D').agg({
    'temperature': ['mean', 'min', 'max'],
    'humidity': ['mean', 'std']
})
print("\nDaily Statistics:")
print(daily_stats.head())

# Weekly summary
weekly = hourly_data.resample('W').agg({
    'temperature': 'mean',
    'humidity': 'mean'
})
print(f"\nWeekly data: {len(weekly)} weeks")

Upsampling (Lower to Higher Frequency)

# Create monthly data
monthly_data = pd.DataFrame({
    'month': pd.date_range('2024-01-01', periods=12, freq='MS'),
    'sales': np.random.randint(10000, 50000, 12)
}).set_index('month')

print("Monthly Data:")
print(monthly_data)

# Upsample to daily (forward fill)
daily_sales = monthly_data.resample('D').ffill()
print(f"\nUpsampled to daily: {len(daily_sales)} days")
print(daily_sales.head(10))

# Upsample with interpolation
daily_interpolated = monthly_data.resample('D').interpolate(method='linear')
print("\nDaily with Interpolation:")
print(daily_interpolated.head(10))

Rolling Windows (Moving Statistics)

Simple Moving Average

# Calculate moving averages
stock_data['SMA_7'] = stock_data['price'].rolling(window=7).mean()
stock_data['SMA_30'] = stock_data['price'].rolling(window=30).mean()

print("Stock Prices with Moving Averages:")
print(stock_data[['price', 'SMA_7', 'SMA_30']].tail(10))

# Plot (conceptual)
# stock_data[['price', 'SMA_7', 'SMA_30']].plot(figsize=(12, 6))

Rolling Statistics

# Various rolling calculations
stock_data['rolling_std'] = stock_data['price'].rolling(window=7).std()
stock_data['rolling_min'] = stock_data['price'].rolling(window=7).min()
stock_data['rolling_max'] = stock_data['price'].rolling(window=7).max()

# Bollinger Bands
window = 20
stock_data['middle_band'] = stock_data['price'].rolling(window=window).mean()
stock_data['std'] = stock_data['price'].rolling(window=window).std()
stock_data['upper_band'] = stock_data['middle_band'] + (stock_data['std'] * 2)
stock_data['lower_band'] = stock_data['middle_band'] - (stock_data['std'] * 2)

print("Bollinger Bands:")
print(stock_data[['price', 'upper_band', 'middle_band', 'lower_band']].tail())

Expanding Windows

# Expanding window (cumulative statistics)
stock_data['expanding_mean'] = stock_data['price'].expanding().mean()
stock_data['expanding_max'] = stock_data['price'].expanding().max()

print("Expanding Statistics:")
print(stock_data[['price', 'expanding_mean', 'expanding_max']].tail())

Shifting and Lagging

# Shift data forward (lag)
stock_data['price_yesterday'] = stock_data['price'].shift(1)
stock_data['price_last_week'] = stock_data['price'].shift(7)

# Calculate changes
stock_data['daily_change'] = stock_data['price'] - stock_data['price_yesterday']
stock_data['weekly_change'] = stock_data['price'] - stock_data['price_last_week']

print("Shifted Data:")
print(stock_data[['price', 'price_yesterday', 'daily_change']].head(10))

# Shift backwards (lead)
stock_data['price_tomorrow'] = stock_data['price'].shift(-1)

# Percentage change
stock_data['pct_change'] = stock_data['price'].pct_change()
stock_data['pct_change_7d'] = stock_data['price'].pct_change(periods=7)

print("\nPercentage Changes:")
print(stock_data[['price', 'pct_change', 'pct_change_7d']].tail())

Time Differences and Deltas

# Calculate time differences
dates_df = pd.DataFrame({
    'start': pd.date_range('2024-01-01', periods=5, freq='D'),
    'end': pd.date_range('2024-01-10', periods=5, freq='D')
})

dates_df['duration'] = dates_df['end'] - dates_df['start']
dates_df['duration_days'] = dates_df['duration'].dt.days

print("Time Differences:")
print(dates_df)

# Add time deltas
dates_df['one_week_later'] = dates_df['start'] + pd.Timedelta(days=7)
dates_df['one_month_later'] = dates_df['start'] + pd.DateOffset(months=1)

print("\nWith Time Deltas:")
print(dates_df[['start', 'one_week_later', 'one_month_later']])

Practical Examples

Example 1: Sales Trend Analysis

# Generate sales data
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2024-12-31', freq='D')

# Simulate sales with trend and seasonality
trend = np.linspace(5000, 8000, len(dates))
seasonal = 1000 * np.sin(np.arange(len(dates)) * 2 * np.pi / 365)
noise = np.random.normal(0, 500, len(dates))
sales_values = trend + seasonal + noise

sales_df = pd.DataFrame({
    'date': dates,
    'daily_sales': sales_values
}).set_index('date')

# Calculate various metrics
sales_df['7_day_avg'] = sales_df['daily_sales'].rolling(window=7).mean()
sales_df['30_day_avg'] = sales_df['daily_sales'].rolling(window=30).mean()
sales_df['mtd_sales'] = sales_df.groupby(sales_df.index.to_period('M'))['daily_sales'].cumsum()

# Monthly aggregation
monthly_sales = sales_df['daily_sales'].resample('M').agg([
    ('total_sales', 'sum'),
    ('avg_daily_sales', 'mean'),
    ('max_daily_sales', 'max'),
    ('days_over_7000', lambda x: (x > 7000).sum())
])

print("Monthly Sales Summary:")
print(monthly_sales.head())

# Year-over-year comparison
sales_df['year'] = sales_df.index.year
sales_df['day_of_year'] = sales_df.index.dayofyear

# Compare same day across years
comparison = sales_df.pivot_table(
    values='daily_sales',
    index='day_of_year',
    columns='year',
    aggfunc='mean'
)

print("\nYear-over-Year Comparison (first 10 days):")
print(comparison.head(10))

Example 2: Website Traffic Analysis

# Website traffic data
traffic = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=24*30, freq='H'),
    'visitors': np.random.poisson(100, 24*30) + 50 * np.sin(np.arange(24*30) * 2 * np.pi / 24),
    'page_views': np.random.poisson(500, 24*30)
}).set_index('timestamp')

traffic['pages_per_visitor'] = traffic['page_views'] / traffic['visitors']

# Extract time components
traffic['hour'] = traffic.index.hour
traffic['day_of_week'] = traffic.index.day_name()
traffic['is_weekend'] = traffic.index.dayofweek >= 5

# Hourly patterns
hourly_avg = traffic.groupby('hour')['visitors'].mean()
print("Average Visitors by Hour:")
print(hourly_avg)

# Weekend vs Weekday comparison
weekend_comparison = traffic.groupby('is_weekend')[['visitors', 'page_views']].mean()
weekend_comparison.index = ['Weekday', 'Weekend']
print("\nWeekend vs Weekday:")
print(weekend_comparison)

# Peak traffic hours
daily_peak = traffic.resample('D')['visitors'].idxmax()
print("\nPeak Traffic Hour (first 5 days):")
print(daily_peak.head())

Example 3: Sensor Data Analysis

# Simulate sensor data
sensor_data = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=1000, freq='T'),
    'temperature': 70 + np.cumsum(np.random.randn(1000) * 0.1),
    'pressure': 1013 + np.cumsum(np.random.randn(1000) * 0.5),
    'humidity': 60 + 10 * np.sin(np.arange(1000) * 2 * np.pi / 1440)
}).set_index('timestamp')

# Detect anomalies (values beyond 3 standard deviations)
for col in ['temperature', 'pressure', 'humidity']:
    mean = sensor_data[col].mean()
    std = sensor_data[col].std()
    sensor_data[f'{col}_anomaly'] = (
        (sensor_data[col] < mean - 3*std) | 
        (sensor_data[col] > mean + 3*std)
    )

anomalies = sensor_data[
    sensor_data['temperature_anomaly'] | 
    sensor_data['pressure_anomaly'] | 
    sensor_data['humidity_anomaly']
]

print(f"Detected {len(anomalies)} anomalies")
print("\nFirst few anomalies:")
print(anomalies[['temperature', 'pressure', 'humidity']].head())

# Calculate rate of change
sensor_data['temp_rate'] = sensor_data['temperature'].diff() / sensor_data.index.to_series().diff().dt.total_seconds()

print("\nTemperature Rate of Change (°C per second):")
print(sensor_data[['temperature', 'temp_rate']].head(10))

Time Zones

Working with Time Zones

# Create timezone-aware datetime
utc_time = pd.Timestamp('2024-01-15 12:00:00', tz='UTC')
print(f"UTC time: {utc_time}")

# Convert to different timezone
ny_time = utc_time.tz_convert('America/New_York')
tokyo_time = utc_time.tz_convert('Asia/Tokyo')

print(f"New York time: {ny_time}")
print(f"Tokyo time: {tokyo_time}")

# Localize naive datetime
naive_time = pd.Timestamp('2024-01-15 12:00:00')
localized = naive_time.tz_localize('US/Pacific')
print(f"Localized to Pacific: {localized}")

# DataFrame with timezone
tz_df = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=5, freq='H', tz='UTC'),
    'value': range(5)
})

print("\nTimezone-aware DataFrame:")
print(tz_df)

# Convert entire index
tz_df['timestamp_ny'] = tz_df['timestamp'].dt.tz_convert('America/New_York')
print("\nWith New York timezone:")
print(tz_df)

Advanced Time Series Operations

Seasonal Decomposition (Conceptual)

# Decompose time series into trend, seasonal, and residual components
# This would typically use statsmodels.tsa.seasonal.seasonal_decompose

# Manual seasonal pattern identification
sales_df['month'] = sales_df.index.month
monthly_pattern = sales_df.groupby('month')['daily_sales'].mean()

print("Average Sales by Month (Seasonal Pattern):")
print(monthly_pattern)

Autocorrelation

# Calculate autocorrelation at different lags
def autocorrelation(series, lag):
    return series.corr(series.shift(lag))

print("Autocorrelation Analysis:")
for lag in [1, 7, 30]:
    corr = autocorrelation(stock_data['price'], lag)
    print(f"Lag {lag}: {corr:.4f}")

Forward and Backward Fill

# Create data with missing values
incomplete_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'value': [10, np.nan, np.nan, 20, np.nan, 30, np.nan, np.nan, 40, 50]
}).set_index('date')

print("Original Data with NaN:")
print(incomplete_data)

# Forward fill
ffill = incomplete_data.fillna(method='ffill')
print("\nForward Fill:")
print(ffill)

# Backward fill
bfill = incomplete_data.fillna(method='bfill')
print("\nBackward Fill:")
print(bfill)

# Interpolate
interpolated = incomplete_data.interpolate(method='linear')
print("\nLinear Interpolation:")
print(interpolated)

Performance Tips

# For large time series, use appropriate data types
# Convert to datetime64 if not already
df['date'] = pd.to_datetime(df['date'])

# Use categorical for repeated values
df['day_of_week'] = df['date'].dt.day_name().astype('category')

# For filtering, use boolean indexing with dates
# This is faster than string-based indexing for large datasets
mask = (df.index >= '2024-01-01') & (df.index <= '2024-12-31')
filtered = df[mask]

Practice Exercises

Exercise 1: Temperature Analysis

Given hourly temperature data:

Calculate daily min, max, and mean temperatures
Find the hottest and coldest weeks
Identify heat waves (3+ consecutive days above 90°F)

Exercise 2: Stock Portfolio

With multiple stock prices:

Calculate daily returns for each stock
Compute 50-day and 200-day moving averages
Identify golden crosses (50-day crosses above 200-day)

Exercise 3: E-commerce Metrics

Using order timestamp data:

Calculate orders per hour, day, and month
Find peak shopping hours and days
Compare year-over-year growth

Sample Solutions

Exercise 1:

# Temperature analysis solution
daily_temp = hourly_data['temperature'].resample('D').agg(['min', 'max', 'mean'])

# Weekly analysis
weekly_temp = hourly_data['temperature'].resample('W').agg(['min', 'max', 'mean'])
hottest_week = weekly_temp['mean'].idxmax()
coldest_week = weekly_temp['mean'].idxmin()

print(f"Hottest week: {hottest_week}")
print(f"Coldest week: {coldest_week}")

# Heat wave detection
heat_wave = (daily_temp['max'] > 90).rolling(window=3).sum() >= 3
heat_wave_days = daily_temp[heat_wave]
print(f"\nHeat wave days: {len(heat_wave_days)}")

Key Takeaways

Use pd.Timestamp for individual time points and pd.date_range() for sequences
Set datetime as index for time-based operations
Use resample() to change frequency (downsample or upsample)
Apply rolling windows for moving averages and statistics
Use shift() for lagging and leading values
Handle timezones with tz_localize() and tz_convert()
Extract time components with .dt accessor
Use appropriate filling methods for missing time series data

What's Next?

Now that you understand time series in pandas, explore:

Statistical Operations - Advanced statistical analysis
Data Visualization - Plot time series data
GroupBy Operations - Group time series data

Resources

Pandas Time Series Documentation: pandas.pydata.org/docs/user_guide/timeseries.html
Time Series Analysis: Advanced techniques with statsmodels
Forecasting: Introduction to time series forecasting

Time series analysis is essential for understanding temporal patterns. Practice with different frequencies and techniques to master pandas time series capabilities.