几个 Pandas 小技巧,提升数据处理效率
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
统一 columns names 格式¶
小写,去括号,空格替换为下划线
In [2]:
dates = pd.date_range('20130101', periods=3)
dates
data = np.arange(6).reshape(3, 2)
df = pd.DataFrame(data, index=dates, columns=[
"Sales (dolloars)", "COUNT (piecies)"])
df
Out[2]:
In [3]:
df.columns = df.columns.str.strip().str.lower().str.replace(
' ', '_').str.replace('(', '').str.replace(')', '')
df
Out[3]:
In [4]:
# 按比例过滤 outliers
mask1 = df['count_piecies'] < np.percentile(df['count_piecies'], 90)
mask2 = df['count_piecies'] > np.percentile(df['count_piecies'], 10)
df.loc[mask1 & mask2]
Out[4]:
In [5]:
! head - 5 'data/sample.csv'
In [6]:
df = pd.read_csv('data/sample.csv', skiprows=2, names=None,
index_col=0, encoding="utf-8")
df
Out[6]:
计算变化百分比¶
当前值相对之前的值的变化百分比
In [7]:
df["perc_change"] = df["count_piecies"].pct_change()
df
Out[7]:
Date range¶
补全空缺日期
In [8]:
dates = pd.date_range('2015-02-14', periods=15, freq='W')
df = pd.DataFrame({'date': dates, 'val': np.random.randn(len(dates))})
df.head()
Out[8]:
In [9]:
idx = pd.date_range(df.date.min(), df.date.max()) # 生成更细分的 index
df = df.set_index('date')
df.head()
Out[9]:
In [10]:
df_fill = df.reindex(idx, fill_value=0)
df_fill.head()
Out[10]:
In [11]:
plt.figure()
plt.xlabel('Dates')
plt.ylabel('Number of tweets')
plt.plot(df.index, df.val, label='unfilled')
plt.plot(df_fill.index, df_fill.val, label='filled')
plt.legend()
plt.grid()
plt.show()
merge 细节¶
使用 merge 时,需要注意两个参数:
suffixes
(str, str) tuple, 默认 (‘_x’, ‘_y’),用于区分两个 DataFrame 中均存在的 columnsindicator
布尔值,默认 False。若为 True 则添加一列 _merge,展示该行来自于哪个 DataFrame
相关的函数还有 concat
In [12]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'value': [1, 1], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'value': [2, 3], 'rval': [4, 5]})
display(left)
display(right)
In [13]:
pd.merge(left, right, on='key', how="inner",
suffixes=("_left", "_right"), indicator=True)
Out[13]:
In [14]:
pd.merge(left, right, on='key', how="outer",
suffixes=("_left", "_right"), indicator=True)
Out[14]:
压缩存储 DataFrame¶
DataFrame 较大时可节省磁盘空间
In [15]:
df.to_csv('data/dataset.csv')
In [16]:
! head - 3 dataset.csv
In [17]:
df.to_csv('data/dataset.gz', compression='gzip')
保存为压缩文件不影响 csv 的读取,Pandas 可直接读取压缩文件
In [18]:
pd.read_csv('data/dataset.gz')
Out[18]: