所有数据都是宝贵的,大多数时候,我们不希望丢弃原始数据,而是补全缺失值。
fillna是补全缺失值的方法,为它提供一个固定值即可:
In [31]: df = pd.DataFrame(np.random.randn(7,3)) In [32]: df Out[32]: 0 1 2 0 -0.229682 -0.483246 -0.063835 1 0.716649 1.593639 -1.364550 2 -1.362614 1.628310 -1.617992 3 1.128828 -1.120265 -0.657313 4 1.078143 1.136835 -0.427125 5 0.441696 0.219477 0.695700 6 -0.501183 1.453678 -2.734985 In [33]: df.iloc[:4, 1] = NA In [35]: df.iloc[:2, 2] = NA In [36]: df Out[36]: 0 1 2 0 -0.229682 NaN NaN 1 0.716649 NaN NaN 2 -1.362614 NaN -1.617992 3 1.128828 NaN -0.657313 4 1.078143 1.136835 -0.427125 5 0.441696 0.219477 0.695700 6 -0.501183 1.453678 -2.734985 In [37]: df.fillna(0) Out[37]: 0 1 2 0 -0.229682 0.000000 0.000000 1 0.716649 0.000000 0.000000 2 -1.362614 0.000000 -1.617992 3 1.128828 0.000000 -0.657313 4 1.078143 1.136835 -0.427125 5 0.441696 0.219477 0.695700 6 -0.501183 1.453678 -2.734985
也可以提供一个字典,为不同的列设定不同的填充值。
In [38]: df.fillna({1:1, 2:2}) Out[38]: 0 1 2 0 -0.229682 1.000000 2.000000 1 0.716649 1.000000 2.000000 2 -1.362614 1.000000 -1.617992 3 1.128828 1.000000 -0.657313 4 1.078143 1.136835 -0.427125 5 0.441696 0.219477 0.695700 6 -0.501183 1.453678 -2.734985
当然,fillna也不会原地修改数据,如果你想,请使用inplace参数:
In [39]: _ = df.fillna(0, inplace=True) In [40]: df Out[40]: 0 1 2 0 -0.229682 0.000000 0.000000 1 0.716649 0.000000 0.000000 2 -1.362614 0.000000 -1.617992 3 1.128828 0.000000 -0.657313 4 1.078143 1.136835 -0.427125 5 0.441696 0.219477 0.695700 6 -0.501183 1.453678 -2.734985
也可以使用ffill和bfill这种插值法填充缺失值:
In [41]: df = pd.DataFrame(np.random.randn(6,3)) In [42]: df.iloc[2:, 1]=NA In [43]: df.iloc[4:, 2]=NA In [44]: df Out[44]: 0 1 2 0 -0.858762 0.083342 -0.315598 1 -0.211846 0.076648 1.188298 2 -0.513364 NaN 0.079216 3 0.398399 NaN -0.290225 4 -1.375898 NaN NaN 5 0.932812 NaN NaN In [45]: df.fillna(method='ffill') # 使用前一个值进行填充 Out[45]: 0 1 2 0 -0.858762 0.083342 -0.315598 1 -0.211846 0.076648 1.188298 2 -0.513364 0.076648 0.079216 3 0.398399 0.076648 -0.290225 4 -1.375898 0.076648 -0.290225 5 0.932812 0.076648 -0.290225 In [46]: df.fillna(method='ffill',limit=2) # 限制填充次数 Out[46]: 0 1 2 0 -0.858762 0.083342 -0.315598 1 -0.211846 0.076648 1.188298 2 -0.513364 0.076648 0.079216 3 0.398399 0.076648 -0.290225 4 -1.375898 NaN -0.290225 5 0.932812 NaN -0.290225 In [47]: df.fillna(method='bfill') # 后向填充此时无效 Out[47]: 0 1 2 0 -0.858762 0.083342 -0.315598 1 -0.211846 0.076648 1.188298 2 -0.513364 NaN 0.079216 3 0.398399 NaN -0.290225 4 -1.375898 NaN NaN 5 0.932812 NaN NaN
其实使用fillna有很多技巧,需要大家平时多收集多尝试,比如使用平均值来填充:
In [48]: s = pd.Series([1, NA, 3.5, NA, 7]) In [49]: s.fillna(s.mean()) Out[49]: 0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64