Pandas应用

发表于 2024年12月26日 · 归类于技术 · 阅读完需 7 分钟 ·

一、简介

Pandas是Python语言的一个扩展程序库，专门用于数据分析。Pandas提供了高性能、易于使用的数据结构和数据分析工具。它可以从各种文件格式如CSV、JSON、SQL、Microsoft Excel等导入数据。

Pandas广泛应用在学术、金融、统计学等各个数据分析领域。它提供了一个简单、高效的数据处理框架，使得数据清洗、转换、分析和可视化变得更加容易。

二、学习笔记

2. 数据预览与检查

# 基本信息
df.head()     # 查看前5行
df tail()     # 查看后5行
df.info()     # 查看数据表信息
df.describe() # 数值列统计摘要
df.shape      # 数据表维度df.columns#列名
df.dtypes     # 每列的数据类型

3. 数据清洗

# 缺失值处理
df.isnull().sum()                 # 每列的缺失值数量
df.fillna(0)                      # 用填充缺失值
df.dropna()                       # 删除包含缺失值的行
df.dropna(subset=['column_name']) # 删除指定列有缺失值的行
df.rename(columns={'old name': 'new name'}, inplace=True) # 重命名列

# 删除行或列
df.drop(columns=['col1','col2'],inplace=True)  # 删除列
df.drop(index=[0，1], inplace=True)            # 删除行
df.reset_index(drop=True, inplace=True)        # 重置索引

4. 数据选择与过滤

# 选择列
df['column name']    # 单列
df[['col1'，'col2']] # 多列
# 选择行
df.iloc[0]           # 按位置
df.loc[df['column name']>10] 
# 按条件
# 条件过滤
df[(df['col1']>10)&(df['col2']<50)] # 多条件

5. 数据变换

# 创建新列
df['new_col'] = df['col1']+ df['col2']
# 数据类型转换
df['col'] = df['col'].astype('int')
# 字符串操作
df['col'].str.upper()                  # 转为大写
df['col'].str.contains('keyword')      # 判断包含关键词
# 日期操作
df['date col'] = pd.to datetime(df['date col'])  # 转为日期格式
df['year']     = df['date_col'].dt.year          # 提取年份
df['month']    = df['date col'].dt.month         # 提取月份

6. 分组与聚合#分组统计

df.groupby('col')['value_col'].mean()   # 按列分组求均值
df.groupby(['col1','col2']).sum()       # 多列分组求和
# 透视表
df.pivot_table(values='value_col', index='col1', columns='col2', aggfunc='sum')

7. 可视化

import matplotlib.pyplot as plt
import seaborn as sns
# 简单绘图
df['col'].plot(kind='hist')           # 直方图
plt.show()
# Seaborn 示例
sns.barplot(x='col1'，y='col2'，data=df)           # 条形图
sns.heatmap(df.corr()，annot=True,cmap='coolwarm') # 热力图
plt.show()

8. 高级分析# 数据透视表

pd.pivot table(df, values='value', index='col1', columns='col2', aggfunc='sum')
# 合并数据
pd.merge(df1，df2，on='key')          # 按键合并
# 数据透视或多级索引df.set index(['col1','col2']).sort index()

9. 时间序列分析

#设置日期索引
df['date'] = pd.to datetime(df['date'])
df.set_index('date',inplace=True)
# 时间序列可视化df['value'].plot()plt.show()
# 滑动平均df['rolling mean']= df['value'].rolling(window=7).mean()

10. 数据导出与保存

# 保存数据
df.to_csv('output.csv',index=False)
df.to_excel('output.xlsx',index=False)

三、案例

（1）读取Excel并从身份证号码中获取到出生日期

import pandas as pd

# 读取Excel文件到DataFrame
df = pd.read_excel('pandas.xlsx', sheet_name='Sheet1')

# 函数：获取出生日期
def get_birth(id_number):
    return id_number[6:10]+"-"+id_number[10:12]+"-"+id_number[12:14]

df['出生日期'] = df['身份证号码'].apply(get_birth)

# 如果需要保存结果到新的Excel文件中
df.to_excel('verified_data.xlsx', index=False)