Time Series Analysis with Pandas
Time series analysis is crucial for working with temporal data - from stock prices and weather patterns to sales trends and sensor readings. Pandas provides powerful tools specifically designed for time series data, making it easy to manipulate, analyze, and visualize data that changes over time.
What is Time Series Data?
Time series data consists of observations recorded at specific time intervals:
- Stock prices recorded every minute
- Daily temperature readings
- Monthly sales figures
- Yearly economic indicators
- Sensor data collected every millisecond
Pandas excels at handling time series through:
- Flexible date/time handling
- Resampling and frequency conversion
- Moving window statistics
- Time-based indexing and slicing
- Timezone support
Date and Time Data Types
Creating Timestamps
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Various ways to create timestamps
timestamp1 = pd.Timestamp('2024-01-15')
timestamp2 = pd.Timestamp('2024-01-15 14:30:00')
timestamp3 = pd.Timestamp(2024, 1, 15, 14, 30)
timestamp4 = pd.Timestamp.now()
print("Timestamps:")
print(f"Date only: {timestamp1}")
print(f"Date and time: {timestamp2}")
print(f"From components: {timestamp3}")
print(f"Current time: {timestamp4}")
# Timestamp attributes
print(f"\nYear: {timestamp2.year}")
print(f"Month: {timestamp2.month}")
print(f"Day: {timestamp2.day}")
print(f"Hour: {timestamp2.hour}")
print(f"Day of week: {timestamp2.day_name()}")
print(f"Quarter: {timestamp2.quarter}")
Creating Date Ranges
# Create date ranges
daily = pd.date_range(start='2024-01-01', end='2024-01-31', freq='D')
print(f"Daily dates: {len(daily)} days")
# Business days only
business_days = pd.date_range(start='2024-01-01', end='2024-01-31', freq='B')
print(f"Business days: {len(business_days)} days")
# By period count
monthly = pd.date_range(start='2024-01-01', periods=12, freq='MS')
print(f"Monthly (start of month): {len(monthly)} months")
# Common frequency strings
print("\nCommon Frequencies:")
print("D - Calendar day")
print("B - Business day")
print("W - Weekly")
print("M - Month end")
print("MS - Month start")
print("Q - Quarter end")
print("A - Year end")
print("H - Hourly")
print("T or min - Minutely")
print("S - Secondly")
Parsing Dates from Strings
# Parse various date formats
dates_list = ['2024-01-15', '01/15/2024', 'Jan 15, 2024', '15-Jan-2024']
for date_str in dates_list:
parsed = pd.to_datetime(date_str)
print(f"{date_str:20} -> {parsed}")
# Parse with specific format (faster for large datasets)
dates = pd.to_datetime(['20240115', '20240116', '20240117'], format='%Y%m%d')
print(f"\nParsed dates: {dates}")
# Handle parsing errors
invalid_dates = ['2024-01-15', 'not a date', '2024-02-30']
dates_coerced = pd.to_datetime(invalid_dates, errors='coerce')
print(f"\nWith invalid dates (coerced): {dates_coerced}")
Time Series DataFrames
Creating Time Series Data
# Create sample stock price data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=100, freq='D')
stock_data = pd.DataFrame({
'date': dates,
'price': 100 + np.cumsum(np.random.randn(100)),
'volume': np.random.randint(1000000, 5000000, 100)
})
stock_data['returns'] = stock_data['price'].pct_change()
print("Stock Data:")
print(stock_data.head())
# Set date as index (recommended for time series)
stock_data.set_index('date', inplace=True)
print("\nWith Date Index:")
print(stock_data.head())
Time-Based Indexing and Slicing
# Select specific date
jan_15 = stock_data.loc['2024-01-15']
print("Data for Jan 15:")
print(jan_15)
# Select date range
january = stock_data['2024-01']
print(f"\nJanuary data: {len(january)} days")
# Select between dates
q1 = stock_data['2024-01-01':'2024-03-31']
print(f"Q1 data: {len(q1)} days")
# Select recent data
last_week = stock_data.last('7D')
print(f"\nLast 7 days: {len(last_week)} days")
# Select by year
year_2024 = stock_data['2024']
print(f"Year 2024: {len(year_2024)} days")
Resampling: Changing Frequency
Resampling converts time series from one frequency to another.
Downsampling (Higher to Lower Frequency)
# Create hourly data
hourly_data = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=24*7, freq='H'),
'temperature': np.random.normal(70, 5, 24*7),
'humidity': np.random.normal(60, 10, 24*7)
}).set_index('timestamp')
print("Hourly Data (first 5 rows):")
print(hourly_data.head())
# Resample to daily (average)
daily_avg = hourly_data.resample('D').mean()
print("\nDaily Averages:")
print(daily_avg.head())
# Multiple aggregations
daily_stats = hourly_data.resample('D').agg({
'temperature': ['mean', 'min', 'max'],
'humidity': ['mean', 'std']
})
print("\nDaily Statistics:")
print(daily_stats.head())
# Weekly summary
weekly = hourly_data.resample('W').agg({
'temperature': 'mean',
'humidity': 'mean'
})
print(f"\nWeekly data: {len(weekly)} weeks")
Upsampling (Lower to Higher Frequency)
# Create monthly data
monthly_data = pd.DataFrame({
'month': pd.date_range('2024-01-01', periods=12, freq='MS'),
'sales': np.random.randint(10000, 50000, 12)
}).set_index('month')
print("Monthly Data:")
print(monthly_data)
# Upsample to daily (forward fill)
daily_sales = monthly_data.resample('D').ffill()
print(f"\nUpsampled to daily: {len(daily_sales)} days")
print(daily_sales.head(10))
# Upsample with interpolation
daily_interpolated = monthly_data.resample('D').interpolate(method='linear')
print("\nDaily with Interpolation:")
print(daily_interpolated.head(10))
Rolling Windows (Moving Statistics)
Simple Moving Average
# Calculate moving averages
stock_data['SMA_7'] = stock_data['price'].rolling(window=7).mean()
stock_data['SMA_30'] = stock_data['price'].rolling(window=30).mean()
print("Stock Prices with Moving Averages:")
print(stock_data[['price', 'SMA_7', 'SMA_30']].tail(10))
# Plot (conceptual)
# stock_data[['price', 'SMA_7', 'SMA_30']].plot(figsize=(12, 6))
Rolling Statistics
# Various rolling calculations
stock_data['rolling_std'] = stock_data['price'].rolling(window=7).std()
stock_data['rolling_min'] = stock_data['price'].rolling(window=7).min()
stock_data['rolling_max'] = stock_data['price'].rolling(window=7).max()
# Bollinger Bands
window = 20
stock_data['middle_band'] = stock_data['price'].rolling(window=window).mean()
stock_data['std'] = stock_data['price'].rolling(window=window).std()
stock_data['upper_band'] = stock_data['middle_band'] + (stock_data['std'] * 2)
stock_data['lower_band'] = stock_data['middle_band'] - (stock_data['std'] * 2)
print("Bollinger Bands:")
print(stock_data[['price', 'upper_band', 'middle_band', 'lower_band']].tail())
Expanding Windows
# Expanding window (cumulative statistics)
stock_data['expanding_mean'] = stock_data['price'].expanding().mean()
stock_data['expanding_max'] = stock_data['price'].expanding().max()
print("Expanding Statistics:")
print(stock_data[['price', 'expanding_mean', 'expanding_max']].tail())
Shifting and Lagging
# Shift data forward (lag)
stock_data['price_yesterday'] = stock_data['price'].shift(1)
stock_data['price_last_week'] = stock_data['price'].shift(7)
# Calculate changes
stock_data['daily_change'] = stock_data['price'] - stock_data['price_yesterday']
stock_data['weekly_change'] = stock_data['price'] - stock_data['price_last_week']
print("Shifted Data:")
print(stock_data[['price', 'price_yesterday', 'daily_change']].head(10))
# Shift backwards (lead)
stock_data['price_tomorrow'] = stock_data['price'].shift(-1)
# Percentage change
stock_data['pct_change'] = stock_data['price'].pct_change()
stock_data['pct_change_7d'] = stock_data['price'].pct_change(periods=7)
print("\nPercentage Changes:")
print(stock_data[['price', 'pct_change', 'pct_change_7d']].tail())
Time Differences and Deltas
# Calculate time differences
dates_df = pd.DataFrame({
'start': pd.date_range('2024-01-01', periods=5, freq='D'),
'end': pd.date_range('2024-01-10', periods=5, freq='D')
})
dates_df['duration'] = dates_df['end'] - dates_df['start']
dates_df['duration_days'] = dates_df['duration'].dt.days
print("Time Differences:")
print(dates_df)
# Add time deltas
dates_df['one_week_later'] = dates_df['start'] + pd.Timedelta(days=7)
dates_df['one_month_later'] = dates_df['start'] + pd.DateOffset(months=1)
print("\nWith Time Deltas:")
print(dates_df[['start', 'one_week_later', 'one_month_later']])
Practical Examples
Example 1: Sales Trend Analysis
# Generate sales data
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2024-12-31', freq='D')
# Simulate sales with trend and seasonality
trend = np.linspace(5000, 8000, len(dates))
seasonal = 1000 * np.sin(np.arange(len(dates)) * 2 * np.pi / 365)
noise = np.random.normal(0, 500, len(dates))
sales_values = trend + seasonal + noise
sales_df = pd.DataFrame({
'date': dates,
'daily_sales': sales_values
}).set_index('date')
# Calculate various metrics
sales_df['7_day_avg'] = sales_df['daily_sales'].rolling(window=7).mean()
sales_df['30_day_avg'] = sales_df['daily_sales'].rolling(window=30).mean()
sales_df['mtd_sales'] = sales_df.groupby(sales_df.index.to_period('M'))['daily_sales'].cumsum()
# Monthly aggregation
monthly_sales = sales_df['daily_sales'].resample('M').agg([
('total_sales', 'sum'),
('avg_daily_sales', 'mean'),
('max_daily_sales', 'max'),
('days_over_7000', lambda x: (x > 7000).sum())
])
print("Monthly Sales Summary:")
print(monthly_sales.head())
# Year-over-year comparison
sales_df['year'] = sales_df.index.year
sales_df['day_of_year'] = sales_df.index.dayofyear
# Compare same day across years
comparison = sales_df.pivot_table(
values='daily_sales',
index='day_of_year',
columns='year',
aggfunc='mean'
)
print("\nYear-over-Year Comparison (first 10 days):")
print(comparison.head(10))
Example 2: Website Traffic Analysis
# Website traffic data
traffic = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=24*30, freq='H'),
'visitors': np.random.poisson(100, 24*30) + 50 * np.sin(np.arange(24*30) * 2 * np.pi / 24),
'page_views': np.random.poisson(500, 24*30)
}).set_index('timestamp')
traffic['pages_per_visitor'] = traffic['page_views'] / traffic['visitors']
# Extract time components
traffic['hour'] = traffic.index.hour
traffic['day_of_week'] = traffic.index.day_name()
traffic['is_weekend'] = traffic.index.dayofweek >= 5
# Hourly patterns
hourly_avg = traffic.groupby('hour')['visitors'].mean()
print("Average Visitors by Hour:")
print(hourly_avg)
# Weekend vs Weekday comparison
weekend_comparison = traffic.groupby('is_weekend')[['visitors', 'page_views']].mean()
weekend_comparison.index = ['Weekday', 'Weekend']
print("\nWeekend vs Weekday:")
print(weekend_comparison)
# Peak traffic hours
daily_peak = traffic.resample('D')['visitors'].idxmax()
print("\nPeak Traffic Hour (first 5 days):")
print(daily_peak.head())
Example 3: Sensor Data Analysis
# Simulate sensor data
sensor_data = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=1000, freq='T'),
'temperature': 70 + np.cumsum(np.random.randn(1000) * 0.1),
'pressure': 1013 + np.cumsum(np.random.randn(1000) * 0.5),
'humidity': 60 + 10 * np.sin(np.arange(1000) * 2 * np.pi / 1440)
}).set_index('timestamp')
# Detect anomalies (values beyond 3 standard deviations)
for col in ['temperature', 'pressure', 'humidity']:
mean = sensor_data[col].mean()
std = sensor_data[col].std()
sensor_data[f'{col}_anomaly'] = (
(sensor_data[col] < mean - 3*std) |
(sensor_data[col] > mean + 3*std)
)
anomalies = sensor_data[
sensor_data['temperature_anomaly'] |
sensor_data['pressure_anomaly'] |
sensor_data['humidity_anomaly']
]
print(f"Detected {len(anomalies)} anomalies")
print("\nFirst few anomalies:")
print(anomalies[['temperature', 'pressure', 'humidity']].head())
# Calculate rate of change
sensor_data['temp_rate'] = sensor_data['temperature'].diff() / sensor_data.index.to_series().diff().dt.total_seconds()
print("\nTemperature Rate of Change (°C per second):")
print(sensor_data[['temperature', 'temp_rate']].head(10))
Time Zones
Working with Time Zones
# Create timezone-aware datetime
utc_time = pd.Timestamp('2024-01-15 12:00:00', tz='UTC')
print(f"UTC time: {utc_time}")
# Convert to different timezone
ny_time = utc_time.tz_convert('America/New_York')
tokyo_time = utc_time.tz_convert('Asia/Tokyo')
print(f"New York time: {ny_time}")
print(f"Tokyo time: {tokyo_time}")
# Localize naive datetime
naive_time = pd.Timestamp('2024-01-15 12:00:00')
localized = naive_time.tz_localize('US/Pacific')
print(f"Localized to Pacific: {localized}")
# DataFrame with timezone
tz_df = pd.DataFrame({
'timestamp': pd.date_range('2024-01-01', periods=5, freq='H', tz='UTC'),
'value': range(5)
})
print("\nTimezone-aware DataFrame:")
print(tz_df)
# Convert entire index
tz_df['timestamp_ny'] = tz_df['timestamp'].dt.tz_convert('America/New_York')
print("\nWith New York timezone:")
print(tz_df)
Advanced Time Series Operations
Seasonal Decomposition (Conceptual)
# Decompose time series into trend, seasonal, and residual components
# This would typically use statsmodels.tsa.seasonal.seasonal_decompose
# Manual seasonal pattern identification
sales_df['month'] = sales_df.index.month
monthly_pattern = sales_df.groupby('month')['daily_sales'].mean()
print("Average Sales by Month (Seasonal Pattern):")
print(monthly_pattern)
Autocorrelation
# Calculate autocorrelation at different lags
def autocorrelation(series, lag):
return series.corr(series.shift(lag))
print("Autocorrelation Analysis:")
for lag in [1, 7, 30]:
corr = autocorrelation(stock_data['price'], lag)
print(f"Lag {lag}: {corr:.4f}")
Forward and Backward Fill
# Create data with missing values
incomplete_data = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=10, freq='D'),
'value': [10, np.nan, np.nan, 20, np.nan, 30, np.nan, np.nan, 40, 50]
}).set_index('date')
print("Original Data with NaN:")
print(incomplete_data)
# Forward fill
ffill = incomplete_data.fillna(method='ffill')
print("\nForward Fill:")
print(ffill)
# Backward fill
bfill = incomplete_data.fillna(method='bfill')
print("\nBackward Fill:")
print(bfill)
# Interpolate
interpolated = incomplete_data.interpolate(method='linear')
print("\nLinear Interpolation:")
print(interpolated)
Performance Tips
# For large time series, use appropriate data types
# Convert to datetime64 if not already
df['date'] = pd.to_datetime(df['date'])
# Use categorical for repeated values
df['day_of_week'] = df['date'].dt.day_name().astype('category')
# For filtering, use boolean indexing with dates
# This is faster than string-based indexing for large datasets
mask = (df.index >= '2024-01-01') & (df.index <= '2024-12-31')
filtered = df[mask]
Practice Exercises
Exercise 1: Temperature Analysis
Given hourly temperature data:
- Calculate daily min, max, and mean temperatures
- Find the hottest and coldest weeks
- Identify heat waves (3+ consecutive days above 90°F)
Exercise 2: Stock Portfolio
With multiple stock prices:
- Calculate daily returns for each stock
- Compute 50-day and 200-day moving averages
- Identify golden crosses (50-day crosses above 200-day)
Exercise 3: E-commerce Metrics
Using order timestamp data:
- Calculate orders per hour, day, and month
- Find peak shopping hours and days
- Compare year-over-year growth
Sample Solutions
Exercise 1:
# Temperature analysis solution
daily_temp = hourly_data['temperature'].resample('D').agg(['min', 'max', 'mean'])
# Weekly analysis
weekly_temp = hourly_data['temperature'].resample('W').agg(['min', 'max', 'mean'])
hottest_week = weekly_temp['mean'].idxmax()
coldest_week = weekly_temp['mean'].idxmin()
print(f"Hottest week: {hottest_week}")
print(f"Coldest week: {coldest_week}")
# Heat wave detection
heat_wave = (daily_temp['max'] > 90).rolling(window=3).sum() >= 3
heat_wave_days = daily_temp[heat_wave]
print(f"\nHeat wave days: {len(heat_wave_days)}")
Key Takeaways
- Use
pd.Timestampfor individual time points andpd.date_range()for sequences - Set datetime as index for time-based operations
- Use
resample()to change frequency (downsample or upsample) - Apply rolling windows for moving averages and statistics
- Use
shift()for lagging and leading values - Handle timezones with
tz_localize()andtz_convert() - Extract time components with
.dtaccessor - Use appropriate filling methods for missing time series data
What's Next?
Now that you understand time series in pandas, explore:
- Statistical Operations - Advanced statistical analysis
- Data Visualization - Plot time series data
- GroupBy Operations - Group time series data
Resources
- Pandas Time Series Documentation: pandas.pydata.org/docs/user_guide/timeseries.html
- Time Series Analysis: Advanced techniques with statsmodels
- Forecasting: Introduction to time series forecasting
Time series analysis is essential for understanding temporal patterns. Practice with different frequencies and techniques to master pandas time series capabilities.
