concat方法可以实现对象在轴向的的粘合或者堆叠。
In [55]: s1 = pd.Series([0, 1], index=['a', 'b']) In [56]: s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e']) In [57]: s3 = pd.Series([5, 6], index=['f', 'g']) In [58]: pd.concat([s1, s2, s3]) # 要以列表的方式提供参数 Out[58]: a 0 b 1 c 2 d 3 e 4 f 5 g 6 dtype: int64 In [59]: pd.concat([s1, s2, s3], axis=1) # 横向堆叠,但出现警告信息 C:\ProgramData\Anaconda3\Scripts\ipython:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. ...... Out[59]: 0 1 2 a 0.0 NaN NaN b 1.0 NaN NaN c NaN 2.0 NaN d NaN 3.0 NaN e NaN 4.0 NaN f NaN NaN 5.0 g NaN NaN 6.0 In [60]: pd.concat([s1, s2, s3], axis=1,sort=True) # 按人家的要求做 Out[60]: 0 1 2 a 0.0 NaN NaN b 1.0 NaN NaN c NaN 2.0 NaN d NaN 3.0 NaN e NaN 4.0 NaN f NaN NaN 5.0 g NaN NaN 6.0
对于DataFrame,默认情况下都是按行往下合并的,当然也可以设置axis参数:
In [66]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'], ...: columns=['one', 'two']) ...: In [67]: df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'], ...: columns=['three', 'four']) ...: In [68]: df1 Out[68]: one two a 0 1 b 2 3 c 4 5 In [69]: df2 Out[69]: three four a 5 6 c 7 8 In [71]: pd.concat([df1, df2], sort=True) Out[71]: four one three two a NaN 0.0 NaN 1.0 b NaN 2.0 NaN 3.0 c NaN 4.0 NaN 5.0 a 6.0 NaN 5.0 NaN c 8.0 NaN 7.0 NaN In [72]: pd.concat([df1, df2], axis=1, sort=True) Out[72]: one two three four a 0 1 5.0 6.0 b 2 3 NaN NaN c 4 5 7.0 8.0
有这么种场景,某个对象里缺失的值,拿另外一个对象的相应位置的值来填补。在Numpy层面,可以这么做:
In [74]: a = pd.Series([np.nan, 2.5, 0, 3.5, 4.5, np.nan], ...: index=['f', 'e', 'd', 'c', 'b', 'a']) In [75]: b = pd.Series([0, np.nan, 2.1, np.nan, np.nan, 5], index=list('abcdef')) In [76]: a Out[76]: f NaN e 2.5 d 0.0 c 3.5 b 4.5 a NaN dtype: float64 In [77]: b Out[77]: a 0.0 b NaN c 2.1 d NaN e NaN f 5.0 dtype: float64 In [78]: np.where(pd.isnull(a), b, a) Out[78]: array([0. , 2.5, 0. , 3.5, 4.5, 5. ])
np.where(pd.isnull(a), b, a)
,这一句里,首先去pd.isnull(a)
种判断元素,如果是True,从b里拿数据,否则从a里拿,得到最终结果。
实际上,Pandas为这种场景提供了一个专门的combine_first
方法:
In [80]: b.combine_first(a) Out[80]: a 0.0 b 4.5 c 2.1 d 0.0 e 2.5 f 5.0 dtype: float64
对于DataFrame对象,combine_first
逐列做相同的操作,因此你可以认为它是根据你传入的对象来‘修补’调用对象的缺失值。
In [81]: df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan], ...: 'b': [np.nan, 2., np.nan, 6.], ...: 'c': range(2, 18, 4)}) ...: In [82]: df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.], ...: 'b': [np.nan, 3., 4., 6., 8.]}) ...: In [83]: df1 Out[83]: a b c 0 1.0 NaN 2 1 NaN 2.0 6 2 5.0 NaN 10 3 NaN 6.0 14 In [84]: df2 Out[84]: a b 0 5.0 NaN 1 4.0 3.0 2 NaN 4.0 3 3.0 6.0 4 7.0 8.0 In [85]: df1.combine_first(df2) Out[85]: a b c 0 1.0 NaN 2.0 1 4.0 2.0 6.0 2 5.0 4.0 10.0 3 3.0 6.0 14.0 4 7.0 8.0 NaN
np.where要求两个对象遵循广播机制,combine_first没有这一要求。
水平堆叠