几个 Pandas 小技巧，提升数据处理效率

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

统一 columns names 格式¶

小写，去括号，空格替换为下划线

In [2]:

dates = pd.date_range('20130101', periods=3)
dates

data = np.arange(6).reshape(3, 2)
df = pd.DataFrame(data, index=dates, columns=[
                  "Sales (dolloars)", "COUNT (piecies)"])
df

Out[2]:

	Sales (dolloars)	COUNT (piecies)
2013-01-01	0	1
2013-01-02	2	3
2013-01-03	4	5

In [3]:

df.columns = df.columns.str.strip().str.lower().str.replace(
    ' ', '_').str.replace('(', '').str.replace(')', '')
df

Out[3]:

	sales_dolloars	count_piecies
2013-01-01	0	1
2013-01-02	2	3
2013-01-03	4	5

过滤野值¶

丢弃值超过最大值的某个百分比的所有项

例如只保留 count_piecies 列中 10% ~ 90% 区间的值（基于最大值计算）

In [4]:

# 按比例过滤 outliers
mask1 = df['count_piecies'] < np.percentile(df['count_piecies'], 90)
mask2 = df['count_piecies'] > np.percentile(df['count_piecies'], 10)

df.loc[mask1 & mask2]

Out[4]:

	sales_dolloars	count_piecies
2013-01-02	2	3

加载前预览 csv 文件¶

对于需要加载的 csv 文件，需要考虑文件头有没有应该忽略的注释行等问题

用一下命令预览 csv 文件

! head -10 'file_name.csv'

In [5]:

! head - 5 'data/sample.csv'

head: -: No such file or directory
head: 5: No such file or directory
==> data/sample.csv <==
# 这是一行注释
# 这是一行注释
,sales_dolloars,count_piecies
2013-01-01,0,1
2013-01-02,2,3
2013-01-03,4,5

In [6]:

df = pd.read_csv('data/sample.csv', skiprows=2, names=None,
                 index_col=0, encoding="utf-8")
df

Out[6]:

	sales_dolloars	count_piecies
2013-01-01	0	1
2013-01-02	2	3
2013-01-03	4	5

计算变化百分比¶

当前值相对之前的值的变化百分比

In [7]:

df["perc_change"] = df["count_piecies"].pct_change()
df

Out[7]:

	sales_dolloars	count_piecies	perc_change
2013-01-01	0	1	NaN
2013-01-02	2	3	2.000000
2013-01-03	4	5	0.666667

Date range¶

补全空缺日期

In [8]:

dates = pd.date_range('2015-02-14', periods=15, freq='W')
df = pd.DataFrame({'date': dates, 'val': np.random.randn(len(dates))})
df.head()

Out[8]:

	date	val
0	2015-02-15	-0.628081
1	2015-02-22	0.800372
2	2015-03-01	1.216360
3	2015-03-08	-0.479696
4	2015-03-15	-0.210752

In [9]:

idx = pd.date_range(df.date.min(), df.date.max())  # 生成更细分的 index
df = df.set_index('date')
df.head()

Out[9]:

	val
date
2015-02-15	-0.628081
2015-02-22	0.800372
2015-03-01	1.216360
2015-03-08	-0.479696
2015-03-15	-0.210752

In [10]:

df_fill = df.reindex(idx, fill_value=0)
df_fill.head()

Out[10]:

	val
2015-02-15	-0.628081
2015-02-16	0.000000
2015-02-17	0.000000
2015-02-18	0.000000
2015-02-19	0.000000

In [11]:

plt.figure()
plt.xlabel('Dates')
plt.ylabel('Number of tweets')
plt.plot(df.index, df.val, label='unfilled')
plt.plot(df_fill.index, df_fill.val, label='filled')
plt.legend()
plt.grid()
plt.show()

merge 细节¶

使用 merge 时，需要注意两个参数：

suffixes (str, str) tuple, 默认 (‘_x’, ‘_y’)，用于区分两个 DataFrame 中均存在的 columns
indicator 布尔值，默认 False。若为 True 则添加一列 _merge，展示该行来自于哪个 DataFrame

压缩存储 DataFrame¶

DataFrame 较大时可节省磁盘空间

In [15]:

df.to_csv('data/dataset.csv')

In [16]:

! head - 3 dataset.csv

head: dataset.csv: No such file or directory

In [17]:

df.to_csv('data/dataset.gz', compression='gzip')

保存为压缩文件不影响 csv 的读取，Pandas 可直接读取压缩文件

In [18]:

pd.read_csv('data/dataset.gz')

Out[18]:

	date	val
0	2015-02-15	-0.628081
1	2015-02-22	0.800372
2	2015-03-01	1.216360
3	2015-03-08	-0.479696
4	2015-03-15	-0.210752
5	2015-03-22	-0.951160
6	2015-03-29	0.953215
7	2015-04-05	0.552696
8	2015-04-12	0.684407
9	2015-04-19	0.278450
10	2015-04-26	0.322685
11	2015-05-03	0.499823
12	2015-05-10	-0.094185
13	2015-05-17	0.321307
14	2015-05-24	-0.176026

	key	value	lval
0	foo	1	1
1	foo	1	2

	key	value	rval
0	foo	2	4
1	bar	3	5

	key	value_left	lval	value_right	rval	_merge
0	foo	1	1	2	4	both
1	foo	1	2	2	4	both

	key	value_left	lval	value_right	rval	_merge
0	foo	1.0	1.0	2	4	both
1	foo	1.0	2.0	2	4	both
2	bar	NaN	NaN	3	5	right_only

Pandas 小技巧

Table of Contents

统一 columns names 格式¶

过滤野值¶

加载前预览 csv 文件¶

计算变化百分比¶

Date range¶

merge 细节¶

压缩存储 DataFrame¶

Published

Category

Tags

Contact