粘合与堆叠

阅读: 4396 评论：2

一、轴向连接

concat方法可以实现对象在轴向的的粘合或者堆叠。

In [55]: s1 = pd.Series([0, 1], index=['a', 'b'])

In [56]: s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])

In [57]: s3 = pd.Series([5, 6], index=['f', 'g'])

In [58]: pd.concat([s1, s2, s3]) # 要以列表的方式提供参数
Out[58]:
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

In [59]: pd.concat([s1, s2, s3], axis=1) # 横向堆叠，但出现警告信息
C:\ProgramData\Anaconda3\Scripts\ipython:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
......

Out[59]:
     0    1    2
a  0.0  NaN  NaN
b  1.0  NaN  NaN
c  NaN  2.0  NaN
d  NaN  3.0  NaN
e  NaN  4.0  NaN
f  NaN  NaN  5.0
g  NaN  NaN  6.0

In [60]: pd.concat([s1, s2, s3], axis=1,sort=True) # 按人家的要求做
Out[60]:
     0    1    2
a  0.0  NaN  NaN
b  1.0  NaN  NaN
c  NaN  2.0  NaN
d  NaN  3.0  NaN
e  NaN  4.0  NaN
f  NaN  NaN  5.0
g  NaN  NaN  6.0

对于DataFrame，默认情况下都是按行往下合并的，当然也可以设置axis参数：

In [66]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
    ...:                    columns=['one', 'two'])
    ...:

In [67]: df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
    ...:                    columns=['three', 'four'])
    ...:

In [68]: df1
Out[68]:
   one  two
a    0    1
b    2    3
c    4    5

In [69]: df2
Out[69]:
   three  four
a      5     6
c      7     8


In [71]: pd.concat([df1, df2], sort=True)

Out[71]:
   four  one  three  two
a   NaN  0.0    NaN  1.0
b   NaN  2.0    NaN  3.0
c   NaN  4.0    NaN  5.0
a   6.0  NaN    5.0  NaN
c   8.0  NaN    7.0  NaN

In [72]: pd.concat([df1, df2], axis=1, sort=True)
Out[72]:
   one  two  three  four
a    0    1    5.0   6.0
b    2    3    NaN   NaN
c    4    5    7.0   8.0

二、联合叠加

有这么种场景，某个对象里缺失的值，拿另外一个对象的相应位置的值来填补。在Numpy层面，可以这么做：

In [74]: a = pd.Series([np.nan, 2.5, 0, 3.5, 4.5, np.nan],
    ...:               index=['f', 'e', 'd', 'c', 'b', 'a'])

In [75]: b = pd.Series([0, np.nan, 2.1, np.nan, np.nan, 5], index=list('abcdef'))

In [76]: a
Out[76]:
f    NaN
e    2.5
d    0.0
c    3.5
b    4.5
a    NaN
dtype: float64

In [77]: b
Out[77]:
a    0.0
b    NaN
c    2.1
d    NaN
e    NaN
f    5.0
dtype: float64

In [78]: np.where(pd.isnull(a), b, a)
Out[78]: array([0. , 2.5, 0. , 3.5, 4.5, 5. ])

np.where(pd.isnull(a), b, a)，这一句里，首先去pd.isnull(a)种判断元素，如果是True，从b里拿数据，否则从a里拿，得到最终结果。

实际上，Pandas为这种场景提供了一个专门的combine_first方法：

In [80]: b.combine_first(a)
Out[80]:
a    0.0
b    4.5
c    2.1
d    0.0
e    2.5
f    5.0
dtype: float64

对于DataFrame对象，combine_first逐列做相同的操作，因此你可以认为它是根据你传入的对象来‘修补’调用对象的缺失值。

In [81]: df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
    ...:                     'b': [np.nan, 2., np.nan, 6.],
    ...:                     'c': range(2, 18, 4)})
    ...:

In [82]: df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
    ...:                     'b': [np.nan, 3., 4., 6., 8.]})
    ...:

In [83]: df1
Out[83]:
     a    b   c
0  1.0  NaN   2
1  NaN  2.0   6
2  5.0  NaN  10
3  NaN  6.0  14

In [84]: df2
Out[84]:
     a    b
0  5.0  NaN
1  4.0  3.0
2  NaN  4.0
3  3.0  6.0
4  7.0  8.0

In [85]: df1.combine_first(df2)
Out[85]:
     a    b     c
0  1.0  NaN   2.0
1  4.0  2.0   6.0
2  5.0  4.0  10.0
3  3.0  6.0  14.0
4  7.0  8.0   NaN

合并连接重塑

评论总数： 2

点击登录后方可评论

np.where要求两个对象遵循广播机制，combine_first没有这一要求。

By 用户5888865032 On 2019年8月22日 14:12 回复

水平堆叠

By 用户1382844313 On 2019年4月5日 17:53 回复