Introduction to Pandas
Pandas is the most popular Python library for data analysis and manipulation. It provides powerful, flexible data structures and data analysis tools that make working with structured data fast, easy, and expressive. Whether you're cleaning messy data, analyzing large datasets, or preparing data for machine learning, pandas is an essential tool in your Python toolkit.
What is Pandas?
Pandas (Python Data Analysis Library) is an open-source library built on top of NumPy that provides:
- Data structures: DataFrame and Series for handling structured data
- Data manipulation: Filtering, grouping, merging, and reshaping data
- Data cleaning: Handling missing values, duplicates, and data type conversions
- File I/O: Reading and writing data from various formats (CSV, Excel, JSON, SQL, Stata, etc.)
- Data analysis: Statistical operations, aggregations, and time series analysis
Why Use Pandas?
1. Powerful Data Structures
import pandas as pd
# Series - 1D labeled array
temperatures = pd.Series([72, 75, 68, 80],
index=['Mon', 'Tue', 'Wed', 'Thu'])
print(temperatures)
# DataFrame - 2D labeled data structure
weather_data = pd.DataFrame({
'temperature': [72, 75, 68, 80],
'humidity': [65, 70, 80, 60],
'wind_speed': [10, 8, 12, 15]
}, index=['Mon', 'Tue', 'Wed', 'Thu'])
print(weather_data)
2. Easy Data Import/Export
# Read various file formats
df_csv = pd.read_csv('data.csv')
df_excel = pd.read_excel('data.xlsx')
df_stata = pd.read_stata('data.dta') # Stata files
df_json = pd.read_json('data.json')
# Write to different formats
df.to_csv('output.csv')
df.to_excel('output.xlsx')
df.to_stata('output.dta') # Save as Stata format
3. Intuitive Data Manipulation
# Filter data
high_temp = weather_data[weather_data['temperature'] > 70]
# Add new columns
weather_data['temp_fahrenheit'] = weather_data['temperature']
weather_data['temp_celsius'] = (weather_data['temperature'] - 32) * 5/9
# Group and aggregate
monthly_avg = sales_data.groupby('month')['sales'].mean()
Core Pandas Concepts
DataFrame vs Series
Series: A one-dimensional labeled array
# Creating a Series
prices = pd.Series([100, 150, 200, 175],
index=['Product A', 'Product B', 'Product C', 'Product D'])
DataFrame: A two-dimensional labeled data structure (like a spreadsheet)
# Creating a DataFrame
sales_data = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'price': [100, 150, 200, 175],
'quantity': [50, 30, 25, 40]
})
Index and Columns
# DataFrame structure
print("Columns:", df.columns.tolist())
print("Index:", df.index.tolist())
print("Shape:", df.shape) # (rows, columns)
print("Data types:", df.dtypes)
Getting Started with Pandas
Installation
# Install pandas
pip install pandas
# For additional functionality
pip install pandas[all]
# For reading Stata files specifically
pip install pandas pyreadstat
Basic DataFrame Operations
import pandas as pd
# Create sample data
data = {
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['New York', 'London', 'Tokyo', 'Paris'],
'salary': [50000, 60000, 75000, 55000]
}
df = pd.DataFrame(data)
# Basic information
print(df.head()) # First 5 rows
print(df.info()) # Data types and non-null counts
print(df.describe()) # Statistical summary
print(df.shape) # Dimensions (rows, columns)
Working with Stata Data
Pandas has excellent support for Stata files (.dta), making it easy to transition from Stata to Python:
Reading Stata Files
# Basic read
df = pd.read_stata('dataset.dta')
# With additional options
df = pd.read_stata('dataset.dta',
convert_dates=True, # Convert Stata dates to pandas datetime
convert_categoricals=True, # Convert Stata categories
preserve_dtypes=True) # Preserve original data types
# Read specific columns
df = pd.read_stata('dataset.dta', columns=['var1', 'var2', 'var3'])
# Read with value labels
df = pd.read_stata('dataset.dta', convert_categoricals=False)
Stata-like Operations in Pandas
# Stata: describe
df.describe()
df.info()
# Stata: summarize
df.describe(include='all')
# Stata: list in 1/10
df.head(10)
# Stata: count
len(df)
df.shape[0]
# Stata: tabulate
df['category'].value_counts()
pd.crosstab(df['var1'], df['var2'])
Essential DataFrame Methods
Data Inspection
# Quick overview
df.head(10) # First 10 rows
df.tail(5) # Last 5 rows
df.sample(3) # Random 3 rows
df.info() # Data types and memory usage
df.describe() # Statistical summary
df.isnull().sum() # Missing values per column
Data Selection
# Select columns
df['column_name'] # Single column (Series)
df[['col1', 'col2']] # Multiple columns (DataFrame)
# Select rows
df.iloc[0] # First row by position
df.iloc[0:5] # First 5 rows
df.loc[df['age'] > 30] # Rows where age > 30
# Boolean indexing
df[df['salary'] > 55000] # High salary employees
df[(df['age'] > 25) & (df['city'] == 'New York')] # Multiple conditions
Data Manipulation
# Add new columns
df['salary_k'] = df['salary'] / 1000
df['age_group'] = df['age'].apply(lambda x: 'Young' if x < 30 else 'Older')
# Modify existing columns
df['name'] = df['name'].str.upper()
df['salary'] = df['salary'] * 1.1 # 10% raise
# Drop columns/rows
df.drop('column_name', axis=1, inplace=True) # Drop column
df.drop([0, 1], axis=0, inplace=True) # Drop rows by index
df.dropna() # Drop rows with missing values
Learning Path Overview
This pandas series will take you from beginner to advanced user:
1. Fundamentals
2. Data Import/Export
3. Data Manipulation ✅
- Selecting and Filtering Data ⭐ NEW
- Data Cleaning and Transformation
- Merging and Joining DataFrames ⭐ NEW
- Reshaping Data (Pivot, Melt) ⭐ NEW
4. Advanced Operations ✅
Quick Example: Analyzing Survey Data
Here's a taste of what you can do with pandas:
import pandas as pd
import numpy as np
# Load survey data from Stata file
df = pd.read_stata('survey_data.dta')
# Quick data overview
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
print(df.head())
print("\nData types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())
# Basic analysis
print("\nAge statistics:")
print(df['age'].describe())
print("\nEducation level distribution:")
print(df['education'].value_counts())
# Create age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 50, 100],
labels=['18-25', '26-35', '36-50', '50+'])
# Cross-tabulation
print("\nEducation by Age Group:")
print(pd.crosstab(df['age_group'], df['education'], normalize='index'))
# Group analysis
print("\nAverage income by education:")
income_by_edu = df.groupby('education')['income'].agg(['mean', 'median', 'count'])
print(income_by_edu)
# Save results
df.to_csv('processed_survey.csv', index=False)
income_by_edu.to_excel('income_analysis.xlsx')
Pandas vs Other Tools
Pandas vs Excel
- Pandas: Better for large datasets, reproducible analysis, automation
- Excel: Better for quick visual inspection, business users, simple calculations
Pandas vs R
- Pandas: Part of Python ecosystem, better integration with machine learning
- R: More statistical functions out-of-the-box, better for pure statistics
Pandas vs Stata
- Pandas: Free, more flexible programming, better data structures
- Stata: Specialized for econometrics, built-in statistical tests, simpler syntax
Key Advantages of Pandas
- Performance: Built on NumPy for fast operations on large datasets
- Flexibility: Handle any data format and structure
- Integration: Works seamlessly with other Python libraries
- Memory Efficient: Optimized data storage and operations
- Expressive: Intuitive syntax for complex operations
- Ecosystem: Part of the rich Python data science ecosystem
Prerequisites
To get the most out of this pandas series, you should have:
- Basic Python knowledge (variables, functions, loops)
- Understanding of data types and data structures
- Familiarity with basic statistical concepts
- Optional: Experience with Excel or other spreadsheet software
If you need to brush up on Python basics, check out our Python fundamentals series.
What's Next?
Ready to start your pandas journey? Begin with:
Or if you want to dive deep into specific topics:
Resources
- Official Documentation: pandas.pydata.org
- 10 Minutes to Pandas: pandas.pydata.org/docs/user_guide/10min.html
- Pandas Cookbook: Real-world examples and recipes
- Python for Data Analysis: Book by Wes McKinney (pandas creator)
Pandas transforms the way you work with data in Python. Master these fundamentals, and you'll be able to tackle any data analysis challenge with confidence.
