合并连接

阅读: 4726     评论:2

可以通过多种方式将Pandas对象联合到一起:

  • pandas.merge: 根据一个或多个键进行连接。类似SQL的连接操作
  • pandas.concat:使对象在轴向上进行粘合或者‘堆叠’
  • combine_first:将重叠的数据拼接在一起,使用一个对象中的值填充另一个对象中的缺失值

merge连接

merge方法将两个pandas对象连接在一起,类似SQL的连接操作。默认情况下,它执行的是内连接,也就是两个对象的交集。通过参数how,还可以指定外连接、左连接和右连接。参数on指定在哪个键上连接,参数left_onright_on分别指定左右对象的连接键。

  • 外连接:并集
  • 内连接:交集
  • 左连接:左边对象全部保留
  • 右连接:右边对象全部保留
In [23]: df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
    ...:                     'data1': range(7)})
    ...:

In [24]: df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
    ...:                     'data2': range(3)})
    ...:

In [25]: df1
Out[25]:
  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   a      5
6   b      6

In [26]: df2
Out[26]:
  key  data2
0   a      0
1   b      1
2   d      2

In [27]: pd.merge(df1,df2) # 默认内链接,并智能地查找连接的键
Out[27]:
  key  data1  data2
0   b      0      1
1   b      1      1
2   b      6      1
3   a      2      0
4   a      4      0
5   a      5      0

In [28]: pd.merge(df1,df2,on='key') # 最好是显式地指定连接的键
Out[28]:
  key  data1  data2
0   b      0      1
1   b      1      1
2   b      6      1
3   a      2      0
4   a      4      0
5   a      5      0

In [30]: pd.merge(df1, df2, how='outer') # 外连接
Out[30]:
  key  data1  data2
0   b    0.0    1.0
1   b    1.0    1.0
2   b    6.0    1.0
3   a    2.0    0.0
4   a    4.0    0.0
5   a    5.0    0.0
6   c    3.0    NaN
7   d    NaN    2.0

In [31]: pd.merge(df1, df2, how='left')  # 左连接
Out[31]:
  key  data1  data2
0   b      0    1.0
1   b      1    1.0
2   a      2    0.0
3   c      3    NaN
4   a      4    0.0
5   a      5    0.0
6   b      6    1.0

In [32]: pd.merge(df1, df2, how='right') #右连接
Out[32]:
  key  data1  data2
0   b    0.0      1
1   b    1.0      1
2   b    6.0      1
3   a    2.0      0
4   a    4.0      0
5   a    5.0      0
6   d    NaN      2

In [33]: df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
    ...:                     'data1': range(7)})
    ...:

In [34]: df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
    ...:                     'data2': range(3)})
    ...:

In [35]: pd.merge(df3, df4, left_on='lkey', right_on='rkey') # 指定两边的键
Out[35]:
  lkey  data1 rkey  data2
0    b      0    b      1
1    b      1    b      1
2    b      6    b      1
3    a      2    a      0
4    a      4    a      0
5    a      5    a      0

多对多的merge连接是行的笛卡儿积。比如左边如果有3个‘b’行,右边有2个‘b’行,那么结果是3x2,6个‘b’行。

也可以同时指定多个键进行连接:

In [36]: left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
    ...:                      'key2': ['one', 'two', 'one'],
    ...:                      'lval': [1, 2, 3]})

In [37]: right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
    ...:                       'key2': ['one', 'one', 'one', 'two'],
    ...:                       'rval': [4, 5, 6, 7]})

In [38]: pd.merge(left, right, on=['key1', 'key2'], how='outer')
Out[38]:
  key1 key2  lval  rval
0  foo  one   1.0   4.0
1  foo  one   1.0   5.0
2  foo  two   2.0   NaN
3  bar  one   3.0   6.0
4  bar  two   NaN   7.0

merge操作中还有一个重叠列名的问题,比如上面的left和right两个数据,为此,我们可以使用suffixes参数,手动指定为重复的列名添加后缀:

In [41]: pd.merge(left, right, on='key1')
Out[41]:
  key1 key2_x  lval key2_y  rval
0  foo    one     1    one     4
1  foo    one     1    one     5
2  foo    two     2    one     4
3  foo    two     2    one     5
4  bar    one     3    one     6
5  bar    one     3    two     7

In [42]: pd.merge(left, right, on='key1', suffixes=('_left', '_right'))
Out[42]:
  key1 key2_left  lval key2_right  rval
0  foo       one     1        one     4
1  foo       one     1        one     5
2  foo       two     2        one     4
3  foo       two     2        one     5
4  bar       one     3        one     6
5  bar       one     3        two     7

有时候,用于merge合并的键可能是某个对象的行索引:

In [43]: left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
    ...:                       'value': range(6)})
    ...:

In [44]: right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

In [45]: left1
Out[45]:
  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5

In [46]: right1
Out[46]:
   group_val
a        3.5
b        7.0

In [47]: pd.merge(left1, right1, left_on='key', right_index=True)
Out[47]:
  key  value  group_val
0   a      0        3.5
2   a      2        3.5
3   a      3        3.5
1   b      1        7.0
4   b      4        7.0

In [48]: pd.merge(left1, right1, left_on='key', right_index=True, how='outer')
Out[48]:
  key  value  group_val
0   a      0        3.5
2   a      2        3.5
3   a      3        3.5
1   b      1        7.0
4   b      4        7.0
5   c      5        NaN

使用right_index=True参数显式地指出,右边的对象right1使用它的行索引作为连接的键。

事实上Pandas有一个join方法,可以帮助我们直接用行索引进行连接:

In [49]: left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
    ...:                      index=['a', 'c', 'e'],
    ...:                      columns=['Ohio', 'Nevada'])
    ...:

In [50]: right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
    ...:                       index=['b', 'c', 'd', 'e'],
    ...:                       columns=['Missouri', 'Alabama'])
    ...:

In [51]: left2
Out[51]:
   Ohio  Nevada
a   1.0     2.0
c   3.0     4.0
e   5.0     6.0

In [52]: right2
Out[52]:
   Missouri  Alabama
b       7.0      8.0
c       9.0     10.0
d      11.0     12.0
e      13.0     14.0

In [53]: pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
Out[53]:
   Ohio  Nevada  Missouri  Alabama
a   1.0     2.0       NaN      NaN
b   NaN     NaN       7.0      8.0
c   3.0     4.0       9.0     10.0
d   NaN     NaN      11.0     12.0
e   5.0     6.0      13.0     14.0

In [54]: left2.join(right2, how='outer') # 与上面的操作效果一样
Out[54]:
   Ohio  Nevada  Missouri  Alabama
a   1.0     2.0       NaN      NaN
b   NaN     NaN       7.0      8.0
c   3.0     4.0       9.0     10.0
d   NaN     NaN      11.0     12.0
e   5.0     6.0      13.0     14.0

 分层索引进阶 粘合与堆叠 

评论总数: 2


点击登录后方可评论

Series对象name属性的值必须不为None才能使用pd.merge连接,且必须设置参数left_index=True,right_index=True



垂直合并