Home >Backend Development >Python Tutorial >5 Pandas data merging skills that Alibaba's data analyst with an annual salary of 700,000 must know

5 Pandas data merging skills that Alibaba's data analyst with an annual salary of 700,000 must know

Python当打之年forward: 2023-08-10 15:20:521256browse

Not long ago, a friend in our technical exchange group mentioned that he was recently interviewing for the data position of Alibaba 700,000 General Contractor, and the other party asked Pandas5data merging functions, but he only answered 2.

So, which five are they? Today, we will take you to find out!

Directory:

1. concat
2. append
3. merge
4. join
5. combine
Summary

5 Pandas data merging skills that Alibaba's data analyst with an annual salary of 700,000 must know

1. concat

##concat is a function specifically used for data connection merging in pandas. It is very powerful. Supports vertical merge and horizontal merge. The default is vertical merge, which can be set through parameters.

pd.concat(
    objs: &#39;Iterable[NDFrame] | Mapping[Hashable, NDFrame]&#39;,
    axis=0,
    join=&#39;outer&#39;,
    ignore_index: &#39;bool&#39; = False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity: &#39;bool&#39; = False,
    sort: &#39;bool&#39; = False,
    copy: &#39;bool&#39; = True,
) -> &#39;FrameOrSeriesUnion&#39;

In the function method, the meaning of each parameter is as follows:

objs: The data used for connection can Is a list of DataFrame or Series ##axis=0

: The connection method, the default is 0 which is vertical connection, optional 1 is horizontal connectionjoin='outer'

: Merge method, the default is inner which is intersection, optionalouter is the union<code style='padding: 2px 4px;border-radius: 4px;margin-right: 2px;margin-left: 2px;font-family: "Operator Mono", Consolas, Monaco, Menlo, monospace;word-break: break-all;color: rgb(228, 105, 24);background-color: rgb(239, 239, 239);font-size: 0.875em;line-height: 1.6 !important;'>ignore_index: Whether to retain the original index

keys=None : Connection relationship, use the passed value as the first-level index

levels=None: Used to construct multi-level index

names=None: The name of the index

verify_integrity: Check whether the index is duplicated. If it is True, an error will be reported if there is a duplicate index.

sort: Merge merge method Next, sort the columns

##copy: Whether to deep copy

Next, we Let’s demonstrate the function

Basic connection

In [1]: import pandas as pd

In [2]: s1 = pd.Series([&#39;a&#39;, &#39;b&#39;])

In [3]: s2 = pd.Series([&#39;c&#39;, &#39;d&#39;])

In [4]: s1
Out[4]: 
0    a
1    b
dtype: object

In [5]: s2
Out[5]: 
0    c
1    d
dtype: object

In [6]: pd.concat([s1, s2])
Out[6]: 
0    a
1    b
0    c
1    d
dtype: object

In [7]: df1 = pd.DataFrame([[&#39;a&#39;, 1], [&#39;b&#39;, 2]],
   ...:                     columns=[&#39;letter&#39;, &#39;number&#39;])

In [8]: df2 = pd.DataFrame([[&#39;c&#39;, 3], [&#39;d&#39;, 4]],
   ...:                     columns=[&#39;letter&#39;, &#39;number&#39;])

In [9]: pd.concat([df1, df2])
Out[9]: 
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

横向连接

In [10]: pd.concat([df1, df2], axis=1)
Out[10]: 
  letter  number letter  number
0      a       1      c       3
1      b       2      d       4

默认情况下，concat是取并集，如果两个数据中有个数据没有对应行或列，则会填充为空值NaN。

合并交集

In [11]: df3 = pd.DataFrame([[&#39;c&#39;, 3, &#39;cat&#39;], [&#39;d&#39;, 4, &#39;dog&#39;]],
    ...:                     columns=[&#39;letter&#39;, &#39;number&#39;, &#39;animal&#39;])

In [12]: df1
Out[12]: 
  letter  number
0      a       1
1      b       2

In [13]: df3
Out[13]: 
  letter  number animal
0      c       3    cat
1      d       4    dog

In [14]: pd.concat([df1, df3], join=&#39;inner&#39;)
Out[14]: 
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

索引重置（不保留原有索引）

In [15]: pd.concat([df1, df3], join=&#39;inner&#39;, ignore_index=True)
Out[15]: 
  letter  number
0      a       1
1      b       2
2      c       3
3      d       4
# 以下方式和上述的输出结果等价
In [16]: pd.concat([df1, df3], join=&#39;inner&#39;).reset_index(drop=True)
Out[16]: 
  letter  number
0      a       1
1      b       2
2      c       3
3      d       4

指定索引

In [17]: pd.concat([df1, df3], keys=[&#39;df1&#39;,&#39;df3&#39;])
Out[17]: 
      letter  number animal
df1 0      a       1    NaN
    1      b       2    NaN
df3 0      c       3    cat
    1      d       4    dog

In [18]: pd.concat([df1, df3], keys=[&#39;df1&#39;,&#39;df3&#39;], names=[&#39;df名称&#39;,&#39;行ID&#39;])
Out[18]: 
         letter  number animal
df名称 行ID                      
df1  0        a       1    NaN
     1        b       2    NaN
df3  0        c       3    cat
     1        d       4    dog

检测重复

如果索引出现重复，则无法通过检测，会报错

In [19]: pd.concat([df1, df3], verify_integrity=True)
Traceback (most recent call last):
...
ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype=&#39;int64&#39;)

合并并集下columns排序

In [21]: pd.concat([df1, df3], sort=True)
Out[21]: 
  animal letter  number
0    NaN      a       1
1    NaN      b       2
0    cat      c       3
1    dog      d       4

DataFrame与Series合并

In [22]: pd.concat([df1, s1])
Out[22]: 
  letter  number    0
0      a     1.0  NaN
1      b     2.0  NaN
0    NaN     NaN    a
1    NaN     NaN    b

In [23]: pd.concat([df1, s1], axis=1)
Out[23]: 
  letter  number  0
0      a       1  a
1      b       2  b
# 新增列一般可选以下两种方式
In [24]: df1.assign(新增列=s1)
Out[24]: 
  letter  number 新增列
0      a       1   a
1      b       2   b

In [25]: df1[&#39;新增列&#39;] = s1

In [26]: df1
Out[26]: 
  letter  number 新增列
0      a       1   a
1      b       2   b

以上就concat函数方法的一些功能，相比之下，另外一个函数append也可以用于数据追加（纵向合并）

2. append

append主要用于追加数据，是比较简单直接的数据合并方式。

df.append(
    other,
    ignore_index: &#39;bool&#39; = False,
    verify_integrity: &#39;bool&#39; = False,
    sort: &#39;bool&#39; = False,
) -> &#39;DataFrame&#39;

在函数方法中，各参数含义如下：

other: 用于追加的数据，可以是DataFrame或Series或组成的列表

ignore_index: 是否保留原有的索引

verify_integrity: 检测索引是否重复，如果为True则有重复索引会报错

sort: 并集合并方式下，对columns排序

接下来，我们就对该函数功能进行演示

基础追加

In [41]: df1.append(df2)
Out[41]: 
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

In [42]: df1.append([df1,df2,df3])
Out[42]: 
  letter  number animal
0      a       1    NaN
1      b       2    NaN
0      a       1    NaN
1      b       2    NaN
0      c       3    NaN
1      d       4    NaN
0      c       3    cat
1      d       4    dog

columns重置（不保留原有索引）

In [43]: df1.append([df1,df2,df3], ignore_index=True)
Out[43]: 
  letter  number animal
0      a       1    NaN
1      b       2    NaN
2      a       1    NaN
3      b       2    NaN
4      c       3    NaN
5      d       4    NaN
6      c       3    cat
7      d       4    dog

检测重复

如果索引出现重复，则无法通过检测，会报错

In [44]: df1.append([df1,df2], verify_integrity=True)
Traceback (most recent call last):
...
ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype=&#39;int64&#39;)

索引排序

In [46]: df1.append([df1,df2,df3], sort=True)
Out[46]: 
  animal letter  number
0    NaN      a       1
1    NaN      b       2
0    NaN      a       1
1    NaN      b       2
0    NaN      c       3
1    NaN      d       4
0    cat      c       3
1    dog      d       4

追加Series

In [49]: s = pd.Series({&#39;letter&#39;:&#39;s1&#39;,&#39;number&#39;:9})

In [50]: s
Out[50]: 
letter    s1
number     9
dtype: object

In [51]: df1.append(s)
Traceback (most recent call last):
...
TypeError: Can only append a Series if ignore_index=True or if the Series has a name

In [53]: df1.append(s, ignore_index=True)
Out[53]: 
  letter  number
0      a       1
1      b       2
2     s1       9

追加字典

这个在爬虫的时候比较好使，每爬取一条数据就合并到DataFrame类似数据中存储起来

In [54]: dic = {&#39;letter&#39;:&#39;s1&#39;,&#39;number&#39;:9}

In [55]: df1.append(dic, ignore_index=True)
Out[55]: 
  letter  number
0      a       1
1      b       2
2     s1       9

3. merge

merge函数方法类似SQL里的join，可以是pd.merge或者df.merge，区别就在于后者待合并的数据是

pd.merge(
    left: &#39;DataFrame | Series&#39;,
    right: &#39;DataFrame | Series&#39;,
    how: &#39;str&#39; = &#39;inner&#39;,
    on: &#39;IndexLabel | None&#39; = None,
    left_on: &#39;IndexLabel | None&#39; = None,
    right_on: &#39;IndexLabel | None&#39; = None,
    left_index: &#39;bool&#39; = False,
    right_index: &#39;bool&#39; = False,
    sort: &#39;bool&#39; = False,
    suffixes: &#39;Suffixes&#39; = (&#39;_x&#39;, &#39;_y&#39;),
    copy: &#39;bool&#39; = True,
    indicator: &#39;bool&#39; = False,
    validate: &#39;str | None&#39; = None,
) -> &#39;DataFrame&#39;

在函数方法中，关键参数含义如下：

left: 用于连接的左侧数据

right: 用于连接的右侧数据

how: 数据连接方式，默认为 inner，可选outer、left和right

on: 连接关键字段，左右侧数据中需要都存在，否则就用left_on和right_on

left_on: 左侧数据用于连接的关键字段

right_on: 右侧数据用于连接的关键字段

left_index: True表示左侧索引为连接关键字段

right_index: True表示右侧索引为连接关键字段

suffixes: 'Suffixes' = ('_x', '_y'),可以自由指定，就是同列名合并后列名显示后缀

indicator: 是否显示合并后某行数据的归属来源

接下来，我们就对该函数功能进行演示

基础合并

In [55]: df1 = pd.DataFrame({&#39;key&#39;: [&#39;foo&#39;, &#39;bar&#39;, &#39;bal&#39;],
    ...:                     &#39;value2&#39;: [1, 2, 3]})

In [56]: df2 = pd.DataFrame({&#39;key&#39;: [&#39;foo&#39;, &#39;bar&#39;, &#39;baz&#39;],
    ...:                     &#39;value1&#39;: [5, 6, 7]})

In [57]: df1.merge(df2)
Out[57]: 
   key  value2  value1
0  foo       1       5
1  bar       2       6

其他连接方式

In [58]: df1.merge(df2, how=&#39;left&#39;)
Out[58]: 
   key  value2  value1
0  foo       1     5.0
1  bar       2     6.0
2  bal       3     NaN

In [59]: df1.merge(df2, how=&#39;right&#39;)
Out[59]: 
   key  value2  value1
0  foo     1.0       5
1  bar     2.0       6
2  baz     NaN       7

In [60]: df1.merge(df2, how=&#39;outer&#39;)
Out[60]: 
   key  value2  value1
0  foo     1.0     5.0
1  bar     2.0     6.0
2  bal     3.0     NaN
3  baz     NaN     7.0

In [61]: df1.merge(df2, how=&#39;cross&#39;)
Out[61]: 
  key_x  value2 key_y  value1
0   foo       1   foo       5
1   foo       1   bar       6
2   foo       1   baz       7
3   bar       2   foo       5
4   bar       2   bar       6
5   bar       2   baz       7
6   bal       3   foo       5
7   bal       3   bar       6
8   bal       3   baz       7

指定连接键

可以指定单个连接键，也可以指定多个连接键

In [62]: df1 = pd.DataFrame({&#39;lkey1&#39;: [&#39;foo&#39;, &#39;bar&#39;, &#39;bal&#39;],
    ...:                     &#39;lkey2&#39;: [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;],
    ...:                     &#39;value2&#39;: [1, 2, 3]})

In [63]: df2 = pd.DataFrame({&#39;rkey1&#39;: [&#39;foo&#39;, &#39;bar&#39;, &#39;baz&#39;],
    ...:                     &#39;rkey2&#39;: [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;],
    ...:                     &#39;value2&#39;: [5, 6, 7]})
    
In [64]: df1
Out[64]: 
  lkey1 lkey2  value2
0   foo     a       1
1   bar     b       2
2   bal     c       3

In [65]: df2
Out[65]: 
  rkey1 rkey2  value2
0   foo     a       5
1   bar     b       6
2   baz     c       7

In [66]: df1.merge(df2, left_on=&#39;lkey1&#39;, right_on=&#39;rkey1&#39;)
Out[66]: 
  lkey1 lkey2  value2_x rkey1 rkey2  value2_y
0   foo     a         1   foo     a         5
1   bar     b         2   bar     b         6

In [67]: df1.merge(df2, left_on=[&#39;lkey1&#39;,&#39;lkey2&#39;], right_on=[&#39;rkey1&#39;,&#39;rkey2&#39;])
Out[67]: 
  lkey1 lkey2  value2_x rkey1 rkey2  value2_y
0   foo     a         1   foo     a         5
1   bar     b         2   bar     b         6

指定索引为键

Out[68]: df1.merge(df2, left_index=True, right_index=True)
Out[68]: 
  lkey1 lkey2  value2_x rkey1 rkey2  value2_y
0   foo     a         1   foo     a         5
1   bar     b         2   bar     b         6
2   bal     c         3   baz     c         7

设置重复列后缀

In [69]: df1.merge(df2, left_on=&#39;lkey1&#39;, right_on=&#39;rkey1&#39;, suffixes=[&#39;左&#39;,&#39;右&#39;])
Out[69]: 
  lkey1 lkey2  value2左 rkey1 rkey2  value2右
0   foo     a        1   foo     a        5
1   bar     b        2   bar     b        6

连接指示

新增一列用于显示数据来源

In [70]: df1.merge(df2, left_on=&#39;lkey1&#39;, right_on=&#39;rkey1&#39;, suffixes=[&#39;左&#39;,&#39;右&#39;], how=&#39;outer&#39;,
    ...:           indicator=True
    ...:       )
Out[70]: 
  lkey1 lkey2  value2左 rkey1 rkey2  value2右      _merge
0   foo     a      1.0   foo     a      5.0        both
1   bar     b      2.0   bar     b      6.0        both
2   bal     c      3.0   NaN   NaN      NaN   left_only
3   NaN   NaN      NaN   baz     c      7.0  right_only

4. join

join就有点想append之于concat，用于数据合并

df.join(
    other: &#39;FrameOrSeriesUnion&#39;,
    on: &#39;IndexLabel | None&#39; = None,
    how: &#39;str&#39; = &#39;left&#39;,
    lsuffix: &#39;str&#39; = &#39;&#39;,
    rsuffix: &#39;str&#39; = &#39;&#39;,
    sort: &#39;bool&#39; = False,
) -> &#39;DataFrame&#39;

在函数方法中，关键参数含义如下：

other: 用于合并的右侧数据

on: 连接关键字段，左右侧数据中需要都存在，否则就用left_on和right_on

how: 数据连接方式，默认为 inner，可选outer、left和right

lsuffix: 左侧同名列后缀

rsuffix：右侧同名列后缀

接下来，我们就对该函数功能进行演示

In [71]: df = pd.DataFrame({&#39;key&#39;: [&#39;K0&#39;, &#39;K1&#39;, &#39;K2&#39;, &#39;K3&#39;, &#39;K4&#39;, &#39;K5&#39;],
    ...:                     &#39;A&#39;: [&#39;A0&#39;, &#39;A1&#39;, &#39;A2&#39;, &#39;A3&#39;, &#39;A4&#39;, &#39;A5&#39;]})

In [72]: other = pd.DataFrame({&#39;key&#39;: [&#39;K0&#39;, &#39;K1&#39;, &#39;K2&#39;],
    ...:                        &#39;B&#39;: [&#39;B0&#39;, &#39;B1&#39;, &#39;B2&#39;]})

In [73]: df
Out[73]: 
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5

In [74]: other
Out[74]: 
  key   B
0  K0  B0
1  K1  B1
2  K2  B2

In [75]: df.join(other, on=&#39;key&#39;)
Traceback (most recent call last):
...
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

如果想用key关键字，则需要key是索引。。。

指定key

In [76]: df.set_index(&#39;key&#39;).join(other.set_index(&#39;key&#39;))
Out[76]: 
      A    B
key         
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN

In [77]: df.join(other.set_index(&#39;key&#39;), on=&#39;key&#39;)
Out[77]: 
  key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K2  A2   B2
3  K3  A3  NaN
4  K4  A4  NaN
5  K5  A5  NaN

指定重复列后缀

In [78]: df.join(other, lsuffix=&#39;_左&#39;, rsuffix=&#39;右&#39;)
Out[78]: 
  key_左   A key右    B
0    K0  A0   K0   B0
1    K1  A1   K1   B1
2    K2  A2   K2   B2
3    K3  A3  NaN  NaN
4    K4  A4  NaN  NaN
5    K5  A5  NaN  NaN

其他参数就不多做介绍了，和merge基本一样。

5. combine

在数据合并的过程中，我们可能需要对对应位置的值进行一定的计算，pandas提供了combine和combine_first函数方法来进行这方面的合作操作。

df.combine(
    other: &#39;DataFrame&#39;,
    func,
    fill_value=None,
    overwrite: &#39;bool&#39; = True,
) -> &#39;DataFrame&#39;

比如，数据合并的时候取单元格最小的值

In [79]: df1 = pd.DataFrame({&#39;A&#39;: [0, 0], &#39;B&#39;: [4, 4]})

In [80]: df2 = pd.DataFrame({&#39;A&#39;: [1, 1], &#39;B&#39;: [3, 3]})

In [81]: df1
Out[81]: 
   A  B
0  0  4
1  0  4

In [82]: df2
Out[82]: 
   A  B
0  1  3
1  1  3

In [83]: take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2

In [84]: df1.combine(df2, take_smaller)
Out[84]: 
   A  B
0  0  3
1  0  3

# 也可以调用numpy的函数
In [85]: import numpy as np

In [86]: df1.combine(df2, np.minimum)
Out[86]: 
   A  B
0  0  3
1  0  3

fill_value填充缺失值

In [87]: df1 = pd.DataFrame({&#39;A&#39;: [0, 0], &#39;B&#39;: [None, 4]})

In [87]: df2 = pd.DataFrame({&#39;A&#39;: [1, 1], &#39;B&#39;: [3, 3]})

In [88]: df1
Out[88]: 
   A    B
0  0  NaN
1  0  4.0

In [89]: df2
Out[89]: 
   A  B
0  1  3
1  1  3

In [90]: df1.combine(df2, take_smaller, fill_value=-88)
Out[90]: 
   A     B
0  0 -88.0
1  0   4.0

overwrite=False保留

In [91]: df1 = pd.DataFrame({&#39;A&#39;: [0, 0], &#39;B&#39;: [4, 4]})

In [92]: df2 = pd.DataFrame({&#39;B&#39;: [3, 3], &#39;C&#39;: [-10, 1], }, index=[1, 2])

In [93]: df1
Out[93]: 
   A  B
0  0  4
1  0  4

In [94]: df2
Out[94]: 
   B   C
1  3 -10
2  3   1

In [95]: df1.combine(df2, take_smaller)
Out[95]: 
    A    B     C
0 NaN  NaN   NaN
1 NaN  3.0 -10.0
2 NaN  3.0   1.0
# 保留A列原有的值
In [96]: df1.combine(df2, take_smaller, overwrite=False)
Out[96]: 
     A    B     C
0  0.0  NaN   NaN
1  0.0  3.0 -10.0
2  NaN  3.0   1.0

另外一个combine_first

df.combine_first(other: &#39;DataFrame&#39;) -> &#39;DataFrame&#39;

当df中元素为空采用other里的进行替换，结果为并集合并

In [97]: df1 = pd.DataFrame({&#39;A&#39;: [None, 0], &#39;B&#39;: [None, 4]})

In [98]: df2 = pd.DataFrame({&#39;A&#39;: [1, 1], &#39;B&#39;: [3, 3]})

In [99]: df1
Out[99]: 
     A    B
0  NaN  NaN
1  0.0  4.0

In [100]: df2
Out[100]: 
   A  B
0  1  3
1  1  3

In [101]: df1.combine_first(df2)
Out[101]: 
     A    B
0  1.0  3.0
1  0.0  4.0

In [102]: df1 = pd.DataFrame({&#39;A&#39;: [None, 0], &#39;B&#39;: [4, None]})

In [103]: df2 = pd.DataFrame({&#39;B&#39;: [3, 3], &#39;C&#39;: [1, 1]}, index=[1, 2])

In [104]: df1
Out[104]: 
     A    B
0  NaN  4.0
1  0.0  NaN

In [105]: df2
Out[105]: 
   B  C
1  3  1
2  3  1

In [106]: df1.combine_first(df2)
Out[106]: 
     A    B    C
0  NaN  4.0  NaN
1  0.0  3.0  1.0
2  NaN  3.0  1.0

总结

以上就本次介绍的关于Pandas数据合并的全部内容，相比之下我们可以发现：

append is mainly used to append data vertically, which is relatively simple and direct;
concat has the most powerful function , it can not only merge data vertically but also horizontally and support many other condition settings;
merge is mainly used to merge data horizontally, similar to join in SQL Join;
join is relatively simple, used to merge data horizontally, and the conditions are relatively harsh;
## combine is more like merging elements and merging data according to certain conditions (function rules).

The above is the detailed content of 5 Pandas data merging skills that Alibaba's data analyst with an annual salary of 700,000 must know. For more information, please follow other related articles on the PHP Chinese website!

sql pandas sort append copy 数据分析

Statement：

This article is reproduced at:Python当打之年. If there is any infringement, please contact admin@php.cn delete

Previous article：3000 words long article, Pandas beautifies your Excel table!Next article：3000 words long article, Pandas beautifies your Excel table!

See more

5 Pandas data merging skills that Alibaba's data analyst with an annual salary of 700,000 must know

Related articles