Introduction to Pandas

Pandas is the most popular Python library for data analysis and manipulation. It provides powerful, flexible data structures and data analysis tools that make working with structured data fast, easy, and expressive. Whether you're cleaning messy data, analyzing large datasets, or preparing data for machine learning, pandas is an essential tool in your Python toolkit.

What is Pandas?

Pandas (Python Data Analysis Library) is an open-source library built on top of NumPy that provides:

Data structures: DataFrame and Series for handling structured data
Data manipulation: Filtering, grouping, merging, and reshaping data
Data cleaning: Handling missing values, duplicates, and data type conversions
File I/O: Reading and writing data from various formats (CSV, Excel, JSON, SQL, Stata, etc.)
Data analysis: Statistical operations, aggregations, and time series analysis

Why Use Pandas?

1. Powerful Data Structures

import pandas as pd

# Series - 1D labeled array
temperatures = pd.Series([72, 75, 68, 80], 
                        index=['Mon', 'Tue', 'Wed', 'Thu'])
print(temperatures)

# DataFrame - 2D labeled data structure
weather_data = pd.DataFrame({
    'temperature': [72, 75, 68, 80],
    'humidity': [65, 70, 80, 60],
    'wind_speed': [10, 8, 12, 15]
}, index=['Mon', 'Tue', 'Wed', 'Thu'])
print(weather_data)

2. Easy Data Import/Export

# Read various file formats
df_csv = pd.read_csv('data.csv')
df_excel = pd.read_excel('data.xlsx')
df_stata = pd.read_stata('data.dta')  # Stata files
df_json = pd.read_json('data.json')

# Write to different formats
df.to_csv('output.csv')
df.to_excel('output.xlsx')
df.to_stata('output.dta')  # Save as Stata format

3. Intuitive Data Manipulation

# Filter data
high_temp = weather_data[weather_data['temperature'] > 70]

# Add new columns
weather_data['temp_fahrenheit'] = weather_data['temperature']
weather_data['temp_celsius'] = (weather_data['temperature'] - 32) * 5/9

# Group and aggregate
monthly_avg = sales_data.groupby('month')['sales'].mean()

Core Pandas Concepts

DataFrame vs Series

Series: A one-dimensional labeled array

# Creating a Series
prices = pd.Series([100, 150, 200, 175], 
                  index=['Product A', 'Product B', 'Product C', 'Product D'])

DataFrame: A two-dimensional labeled data structure (like a spreadsheet)

# Creating a DataFrame
sales_data = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'price': [100, 150, 200, 175],
    'quantity': [50, 30, 25, 40]
})

Index and Columns

# DataFrame structure
print("Columns:", df.columns.tolist())
print("Index:", df.index.tolist())
print("Shape:", df.shape)  # (rows, columns)
print("Data types:", df.dtypes)

Getting Started with Pandas

Installation

# Install pandas
pip install pandas

# For additional functionality
pip install pandas[all]

# For reading Stata files specifically
pip install pandas pyreadstat

Basic DataFrame Operations

import pandas as pd

# Create sample data
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['New York', 'London', 'Tokyo', 'Paris'],
    'salary': [50000, 60000, 75000, 55000]
}

df = pd.DataFrame(data)

# Basic information
print(df.head())        # First 5 rows
print(df.info())        # Data types and non-null counts
print(df.describe())    # Statistical summary
print(df.shape)         # Dimensions (rows, columns)

Working with Stata Data

Pandas has excellent support for Stata files (.dta), making it easy to transition from Stata to Python:

Reading Stata Files

# Basic read
df = pd.read_stata('dataset.dta')

# With additional options
df = pd.read_stata('dataset.dta', 
                   convert_dates=True,           # Convert Stata dates to pandas datetime
                   convert_categoricals=True,    # Convert Stata categories
                   preserve_dtypes=True)         # Preserve original data types

# Read specific columns
df = pd.read_stata('dataset.dta', columns=['var1', 'var2', 'var3'])

# Read with value labels
df = pd.read_stata('dataset.dta', convert_categoricals=False)

Stata-like Operations in Pandas

# Stata: describe
df.describe()
df.info()

# Stata: summarize
df.describe(include='all')

# Stata: list in 1/10
df.head(10)

# Stata: count
len(df)
df.shape[0]

# Stata: tabulate
df['category'].value_counts()
pd.crosstab(df['var1'], df['var2'])

Essential DataFrame Methods

Data Inspection

# Quick overview
df.head(10)          # First 10 rows
df.tail(5)           # Last 5 rows
df.sample(3)         # Random 3 rows
df.info()            # Data types and memory usage
df.describe()        # Statistical summary
df.isnull().sum()    # Missing values per column

Data Selection

# Select columns
df['column_name']              # Single column (Series)
df[['col1', 'col2']]          # Multiple columns (DataFrame)

# Select rows
df.iloc[0]                     # First row by position
df.iloc[0:5]                   # First 5 rows
df.loc[df['age'] > 30]         # Rows where age > 30

# Boolean indexing
df[df['salary'] > 55000]       # High salary employees
df[(df['age'] > 25) & (df['city'] == 'New York')]  # Multiple conditions

Data Manipulation

# Add new columns
df['salary_k'] = df['salary'] / 1000
df['age_group'] = df['age'].apply(lambda x: 'Young' if x < 30 else 'Older')

# Modify existing columns
df['name'] = df['name'].str.upper()
df['salary'] = df['salary'] * 1.1  # 10% raise

# Drop columns/rows
df.drop('column_name', axis=1, inplace=True)  # Drop column
df.drop([0, 1], axis=0, inplace=True)         # Drop rows by index
df.dropna()                                   # Drop rows with missing values

Learning Path Overview

This pandas series will take you from beginner to advanced user:

1. Fundamentals

2. Data Import/Export

3. Data Manipulation ✅

4. Advanced Operations ✅

GroupBy Operations and Aggregation ⭐ NEW
Time Series Analysis ⭐ NEW
Statistical Operations

Quick Example: Analyzing Survey Data

Here's a taste of what you can do with pandas:

import pandas as pd
import numpy as np

# Load survey data from Stata file
df = pd.read_stata('survey_data.dta')

# Quick data overview
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
print(df.head())

print("\nData types:")
print(df.dtypes)

print("\nMissing values:")
print(df.isnull().sum())

# Basic analysis
print("\nAge statistics:")
print(df['age'].describe())

print("\nEducation level distribution:")
print(df['education'].value_counts())

# Create age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 50, 100], 
                        labels=['18-25', '26-35', '36-50', '50+'])

# Cross-tabulation
print("\nEducation by Age Group:")
print(pd.crosstab(df['age_group'], df['education'], normalize='index'))

# Group analysis
print("\nAverage income by education:")
income_by_edu = df.groupby('education')['income'].agg(['mean', 'median', 'count'])
print(income_by_edu)

# Save results
df.to_csv('processed_survey.csv', index=False)
income_by_edu.to_excel('income_analysis.xlsx')

Pandas vs Other Tools

Pandas vs Excel

Pandas: Better for large datasets, reproducible analysis, automation
Excel: Better for quick visual inspection, business users, simple calculations

Pandas vs R

Pandas: Part of Python ecosystem, better integration with machine learning
R: More statistical functions out-of-the-box, better for pure statistics

Pandas vs Stata

Pandas: Free, more flexible programming, better data structures
Stata: Specialized for econometrics, built-in statistical tests, simpler syntax

Key Advantages of Pandas

Performance: Built on NumPy for fast operations on large datasets
Flexibility: Handle any data format and structure
Integration: Works seamlessly with other Python libraries
Memory Efficient: Optimized data storage and operations
Expressive: Intuitive syntax for complex operations
Ecosystem: Part of the rich Python data science ecosystem

Prerequisites

To get the most out of this pandas series, you should have:

Basic Python knowledge (variables, functions, loops)
Understanding of data types and data structures
Familiarity with basic statistical concepts
Optional: Experience with Excel or other spreadsheet software

If you need to brush up on Python basics, check out our Python fundamentals series.

What's Next?

Ready to start your pandas journey? Begin with:

Or if you want to dive deep into specific topics:

Resources

Official Documentation: pandas.pydata.org
10 Minutes to Pandas: pandas.pydata.org/docs/user_guide/10min.html
Pandas Cookbook: Real-world examples and recipes
Python for Data Analysis: Book by Wes McKinney (pandas creator)

Pandas transforms the way you work with data in Python. Master these fundamentals, and you'll be able to tackle any data analysis challenge with confidence.