Pandas 筆記

合併資料集 - 01 pandas.concat() 函式

在實際的應用當中，資料集合併是常用的功能之一，直覺上就是把兩個或以上的資料集上下或左右串接再一起。Pandas 提供幾種不同的方法合併資料集，其中 pandas.concat() 函式在串接資料上扮演相當重要的角色。pandas.concat() 函式對 pandas.Series 物件或 pandas.DataFrame 物件沿著其中一軸串接，並對pandas.DataFrame 物件其他軸有選擇性的邏輯判斷，譬如：聯集（join 參數設定為 'outer'）、交集（join 參數設定為 'inner'）等等。

pandas.concat() 函式根據特定的「軸 axis」和參數設定合併 pandas.Series 和 pandas.DataFrame 物件。在這一小節裡包含了所有參數的說明，並酌以範例，讓讀者更容易了解 pandas.concat() 函式的使用。

1. 當要被合併的物件都是 pandas.Series 並隨著索引（index）方向（axis = 0）時，pandas.concat() 函式回覆的是 pandas.Series 物件。

2. 當要被合併的物件含有至少一個 pandas.DataFrame 物件時，pandas.concat() 函式回覆的是一個 pandas.DataFrame 物件。

3. 當被合併的物件是沿著欄（columns）的方向時，pandas.concat() 函式回覆的也是一個 pandas.DataFrame 物件。

我們先來看 pandas.concat() 函式包含哪些參數。

pandas.concat(objs,  
              axis = 0,  
              join = 'outer',  
              ignore_index = False,  
              keys = None,  
              levels = None,  
              names = None,  
              verify_integrity = False,  
              sort = False,  
              copy = True)

關於 pandas.concat() 函式更多詳細說明，可參考 Pandas 官方網站 https://pandas.pydata.org/docs/reference/api/pandas.concat.html?highlight=pandas%20concat#pandas.concat 。

在這一節中，我們分別以 pandas.Series 物件和 pandas.DataFrame 物件為例來說明 pandas.concat() 函式。

In [1]:

import pandas as pd
pd.__version__

Out[1]:

'1.0.3'

串接 pandas.Series 物件

在 pandas.concat() 函式的參數當中，只有 objs 參數是必要的，其餘參數都是 optinal。
objs 參數是一串的 pandas.Series 或 pandas.DataFrame 物件。

In [2]:

ser1 = pd.Series(['Albert', 'Bell', 'Cherry'],
                 index = [1, 2, 3])
ser1

Out[2]:

1    Albert
2      Bell
3    Cherry
dtype: object

In [3]:

ser2 = pd.Series(['David', 'Ellen', 'Flora'],
                 index = [4, 5, 6])
ser2

Out[3]:

4    David
5    Ellen
6    Flora
dtype: object

In [4]:

pd.concat([ser1, ser2])

Out[4]:

1    Albert
2      Bell
3    Cherry
4     David
5     Ellen
6     Flora
dtype: object

串接 pandas.Series - 重複索引

In [5]:

ser3 = pd.Series(['Glen', 'Hedy', 'Ian'],
                 index = [5, 6, 7])
ser3

Out[5]:

5    Glen
6    Hedy
7     Ian
dtype: object

In [6]:

pd.concat([ser1, ser2, ser3])

Out[6]:

1    Albert
2      Bell
3    Cherry
4     David
5     Ellen
6     Flora
5      Glen
6      Hedy
7       Ian
dtype: object

請注意，pandas.concat() 函式在預設上會保留原來物件的索引，在上面的例子中，因為物件 ser2 和 ser3 都有索引 5 和 6，在合併後的 pandas.Series 物件中 5 和 6 各出現了兩次。

串接 pandas.DataFrame 物件

為了說明 pandas.DataFrame 物件的例子，這裡借用 'Python Data Science Handbook' 裡 make_df() 函式定義，以方便建立不同的 pandas.DataFrame 物件。

In [7]:

def make_df(columns, index):
    """ DataFrame creator """
    data_dict = {c: [str(c) + '-' + str(i) for i in index]
                for c in columns}
    return pd.DataFrame(data_dict, index)

In [8]:

df1 = make_df(['Albert', 'Bell', 'Cherry'],
              [1, 2, 3])
df1

Out[8]:

	Albert	Bell	Cherry
1	Albert-1	Bell-1	Cherry-1
2	Albert-2	Bell-2	Cherry-2
3	Albert-3	Bell-3	Cherry-3

In [9]:

df2 = make_df(['Albert', 'Bell', 'Cherry'],
              [4, 5, 6])
df2

Out[9]:

	Albert	Bell	Cherry
4	Albert-4	Bell-4	Cherry-4
5	Albert-5	Bell-5	Cherry-5
6	Albert-6	Bell-6	Cherry-6

In [10]:

pd.concat([df1, df2])

Out[10]:

	Albert	Bell	Cherry
1	Albert-1	Bell-1	Cherry-1
2	Albert-2	Bell-2	Cherry-2
3	Albert-3	Bell-3	Cherry-3
4	Albert-4	Bell-4	Cherry-4
5	Albert-5	Bell-5	Cherry-5
6	Albert-6	Bell-6	Cherry-6

合併 pandas.DataFrame 在預設的情形下是對列（align with row）執行合併，後面會介紹透過參數設定為對欄（align with column）執行合併。

串接 pandas.DataFrame 物件 - 重複索引

In [11]:

df3 = make_df(['Albert', 'Bell', 'Cherry'],
              [5, 6, 7])
df3

Out[11]:

	Albert	Bell	Cherry
5	Albert-5	Bell-5	Cherry-5
6	Albert-6	Bell-6	Cherry-6
7	Albert-7	Bell-7	Cherry-7

In [12]:

pd.concat([df1, df2, df3])

Out[12]:

	Albert	Bell	Cherry
1	Albert-1	Bell-1	Cherry-1
2	Albert-2	Bell-2	Cherry-2
3	Albert-3	Bell-3	Cherry-3
4	Albert-4	Bell-4	Cherry-4
5	Albert-5	Bell-5	Cherry-5
6	Albert-6	Bell-6	Cherry-6
5	Albert-5	Bell-5	Cherry-5
6	Albert-6	Bell-6	Cherry-6
7	Albert-7	Bell-7	Cherry-7

上面例子與串接 Sries 物件相同，pandas.concat() 函式在預設上會保留原來物件的索引，在上面的例子中，因為物件 df2 和 df3 都有索引 5 和 6，在合併後的 pandas.DataFrame 物件中 5 和 6 各出現了兩次。

verify_integrity 參數

verify_integrity 參數是一個布林值，預設是 False。它決定是否檢查合併後的新索引（index）有無重複的索引，或是合併後的新欄名（columns）有無重複的欄。

在前一個例子中，預設的情形下是允許重複的索引的，對於龐大的資料集，無法用人工比對合併的資料集是否重複，但又想檢查是否有重複索引，這可以使用 verify_integrity 參數達到這個目的。

In [13]:

try:
    pd.concat([df2, df3], verify_integrity = True)
except ValueError as err:
    print(f"Exception: {err}")

Exception: Indexes have overlapping values: Int64Index([5, 6], dtype='int64')

axis 參數

axis 參數指定要對哪一軸（axis）執行合併，它的值 0 （或 'index'）表示對列合併，值 1 （或 'columns'）表示對欄合併，預設值是 0 （或 'index'）。

合併 pandas.DataFrame 在預設的情形下是對列（axis = 0 或 axis = 'index'）執行合併，就如同前面幾個例子，我們可以使用 axis 參數指定要對哪一軸執行合併。以下的例子是對欄（axis = 1 或 axis = 'columns'）執行合併。

In [14]:

df4 = make_df(['Albert', 'Bell', 'Cherry'],
              [1, 2, 3])
df4

Out[14]:

	Albert	Bell	Cherry
1	Albert-1	Bell-1	Cherry-1
2	Albert-2	Bell-2	Cherry-2
3	Albert-3	Bell-3	Cherry-3

In [15]:

df5 = make_df(['David', 'Ellen', 'Flora'],
              [1, 2, 3])
df5

Out[15]:

	David	Ellen	Flora
1	David-1	Ellen-1	Flora-1
2	David-2	Ellen-2	Flora-2
3	David-3	Ellen-3	Flora-3

In [16]:

pd.concat([df4, df5], axis = 1)

Out[16]:

	Albert	Bell	Cherry	David	Ellen	Flora
1	Albert-1	Bell-1	Cherry-1	David-1	Ellen-1	Flora-1
2	Albert-2	Bell-2	Cherry-2	David-2	Ellen-2	Flora-2
3	Albert-3	Bell-3	Cherry-3	David-3	Ellen-3	Flora-3

ignore_index 參數

ignore_index 參數是一個布林值，預設是 False。當設定為 True 時，合併後資料集將不使用合併前資料集的索引（index）或是欄（columns），取而代之的是數字 0、1、2、...。

在某些資料集的應用中，資料集原先的索引並不重要，資料集之間的索引是否有重複也不在意，這時候可以使用 ignore_index 參數，忽略原先的索引，合併後的新資料集會重新指定索引。

In [17]:

pd.concat([df2, df3], ignore_index = True)

Out[17]:

	Albert	Bell	Cherry
0	Albert-4	Bell-4	Cherry-4
1	Albert-5	Bell-5	Cherry-5
2	Albert-6	Bell-6	Cherry-6
3	Albert-5	Bell-5	Cherry-5
4	Albert-6	Bell-6	Cherry-6
5	Albert-7	Bell-7	Cherry-7

In [18]:

pd.concat([df4, df5], axis = 1, ignore_index = True)

Out[18]:

	0	1	2	3	4	5
1	Albert-1	Bell-1	Cherry-1	David-1	Ellen-1	Flora-1
2	Albert-2	Bell-2	Cherry-2	David-2	Ellen-2	Flora-2
3	Albert-3	Bell-3	Cherry-3	David-3	Ellen-3	Flora-3

join 參數

join 參數指定當列和欄名稱不完全相同時，如何處理其它軸上的索引（indexes）的資料。它的值有 'inner' 和 'outer'，預設是 'outer'。

當對列（axis = 0）執行合併時，如果遇到要合併的資料集的欄位不完全相同時，pandas.concat() 函式預設是擴大包含所有的欄位（join = 'outer'），相當於是原先資料集的聯集，這時在合併後的欄位與對應的索引沒有資料存在會以 NA 來取代。

如果要刪除沒有資料的整個欄位，可以使用 join = 'inner'，這相當於是原先資料集的交集。

In [19]:

df6 = make_df(['Albert', 'Bell', 'Cherry'],
              [1, 2, 3])
df6

Out[19]:

	Albert	Bell	Cherry
1	Albert-1	Bell-1	Cherry-1
2	Albert-2	Bell-2	Cherry-2
3	Albert-3	Bell-3	Cherry-3

In [20]:

df7 = make_df(['Bell', 'Cherry', 'David'],
              [4, 5, 6])
df7

Out[20]:

	Bell	Cherry	David
4	Bell-4	Cherry-4	David-4
5	Bell-5	Cherry-5	David-5
6	Bell-6	Cherry-6	David-6

In [21]:

pd.concat([df6, df7])

Out[21]:

	Albert	Bell	Cherry	David
1	Albert-1	Bell-1	Cherry-1	NaN
2	Albert-2	Bell-2	Cherry-2	NaN
3	Albert-3	Bell-3	Cherry-3	NaN
4	NaN	Bell-4	Cherry-4	David-4
5	NaN	Bell-5	Cherry-5	David-5
6	NaN	Bell-6	Cherry-6	David-6

pd.concat([df6, df7]) 相當於 pd.concat([df6, df7], join = 'outer')，這時 pandas.concat() 函式會擴大包含所有的欄位，包括沒有資料的部份。而 pd.concat([df6, df7], join = 'inner') 則會刪除沒有資料的欄位。

In [22]:

pd.concat([df6, df7], join = 'inner')

Out[22]:

	Bell	Cherry
1	Bell-1	Cherry-1
2	Bell-2	Cherry-2
3	Bell-3	Cherry-3
4	Bell-4	Cherry-4
5	Bell-5	Cherry-5
6	Bell-6	Cherry-6

keys 參數

keys 參數是一個序列（sequence），預設值是 None。在建立的階層式索引資料集中這 keys 參數被當成最外層的名稱。

在合併資料集時，有時候想明確區隔合併前原先的資料集，這時可以使用 keys 參數建立階層式索引（MultiIndex）的 pandas.Series 和 pandas.DataFrame 物件，在 level 0 的地方指定階層名稱。

在下面這個例子當中，對列執行合併，'df2' 和 'df3' 分別是階層式索引 level 0 的索引名稱，而 level 1 的索引名稱則沿用合併前原先資料集的索引。

In [23]:

df = pd.concat([df2, df3], keys = ['df2', 'df3'])
df

Out[23]:

		Albert	Bell	Cherry
df2	4	Albert-4	Bell-4	Cherry-4
	5	Albert-5	Bell-5	Cherry-5
	6	Albert-6	Bell-6	Cherry-6
df3	5	Albert-5	Bell-5	Cherry-5
	6	Albert-6	Bell-6	Cherry-6
	7	Albert-7	Bell-7	Cherry-7

In [24]:

df.index.levels

Out[24]:

FrozenList([['df2', 'df3'], [4, 5, 6, 7]])

在這個例子中合併後的資料集是一個列的多重索引（MultiIndex），它 level 0 的索引名稱是 'df2'、 'df3'，level 1 的索引名稱是 4、 5、 6、 7。

keys 參數同樣可以用在對欄（axis = 1 或 axis = 'columns'）執行合併時，這時，keys 參數指定的是多重欄 level 0 的名稱，而 level 1 則沿用合併前原先資料集的欄名。

In [25]:

df = pd.concat([df4, df5], axis = 1, keys = ['df4', 'df5'])
df

Out[25]:

	df4			df5
	Albert	Bell	Cherry	David	Ellen	Flora
1	Albert-1	Bell-1	Cherry-1	David-1	Ellen-1	Flora-1
2	Albert-2	Bell-2	Cherry-2	David-2	Ellen-2	Flora-2
3	Albert-3	Bell-3	Cherry-3	David-3	Ellen-3	Flora-3

In [26]:

df.columns.levels

Out[26]:

FrozenList([['df4', 'df5'], ['Albert', 'Bell', 'Cherry', 'David', 'Ellen', 'Flora']])

在這個範例中，合併後的資料集是一個欄的多重索引（MultiIndex），它 level 0 的索引名稱是 'df4'、 'df5'，level 1 的索引名稱是 'Albert'、'Bell', 'Cherry'、'David'、'Ellen'、'Flora'。

levels 參數

levels 參數是一個序列形成的串列（a list of sequence），預設值是 None。

前面 keys 參數設定讓合併後的資料集變成一個階層式索引（Hierarchical Indexing）物件，它擁有兩個或以上的列索引或欄名稱，這些多重索引列或欄可以透過 pandas.DataFrame.index.levels 或 pandas.DataFrame.columns.levels 屬性列出列或欄的階層索引名稱。

在這建構出的階層式索引物件的列或欄的階層索引名稱之外，如果要指定其它索引名稱，可以透過 levels 參數達到這個目的。

In [27]:

df = pd.concat([df2, df3], keys = ['df2', 'df3'],
               levels = [['df1', 'df2', 'df3', 'df6']])
df

Out[27]:

		Albert	Bell	Cherry
df2	4	Albert-4	Bell-4	Cherry-4
	5	Albert-5	Bell-5	Cherry-5
	6	Albert-6	Bell-6	Cherry-6
df3	5	Albert-5	Bell-5	Cherry-5
	6	Albert-6	Bell-6	Cherry-6
	7	Albert-7	Bell-7	Cherry-7

In [28]:

df.index.levels

Out[28]:

FrozenList([['df1', 'df2', 'df3', 'df6'], [4, 5, 6, 7]])

請注意，'df1' 和 'df6' 是新增其它索引名稱，雖然它並沒有出現在合併後的資料集中，但它的屬性卻已包含這些名稱。

In [29]:

df = pd.concat([df4, df5], axis = 1, keys = ['df4', 'df5'],
              levels = [['df0', 'df4', 'df5', 'df7']])
df

Out[29]:

	df4			df5
	Albert	Bell	Cherry	David	Ellen	Flora
1	Albert-1	Bell-1	Cherry-1	David-1	Ellen-1	Flora-1
2	Albert-2	Bell-2	Cherry-2	David-2	Ellen-2	Flora-2
3	Albert-3	Bell-3	Cherry-3	David-3	Ellen-3	Flora-3

In [30]:

df.columns.levels

Out[30]:

FrozenList([['df0', 'df4', 'df5', 'df7'], ['Albert', 'Bell', 'Cherry', 'David', 'Ellen', 'Flora']])

這欄的多重索引也可使用 levels 參數，這個範例中 'df0', 'df7' 是新增其它的欄名稱，也只出現在物件屬性中。

names 參數

names 參數式一個串列（list），預設值是 None。

names 參數指定階層式索引的列和欄多重索引的名稱。

In [31]:

df = pd.concat([df4, df5], axis = 1, keys = ['df4', 'df5'],
              levels = [['df0', 'df4', 'df5', 'df7']],
              names = ['upper', 'lower'])
df

Out[31]:

upper	df4			df5
lower	Albert	Bell	Cherry	David	Ellen	Flora
1	Albert-1	Bell-1	Cherry-1	David-1	Ellen-1	Flora-1
2	Albert-2	Bell-2	Cherry-2	David-2	Ellen-2	Flora-2
3	Albert-3	Bell-3	Cherry-3	David-3	Ellen-3	Flora-3

在這個範例中 'upper' 是 level 0 這層索引的名稱，而 'lower' 是 level 1 這層索引的名稱。

sort 參數

sort 參數式一個布林值，是在 0.23.0 版新增的參數，預設值是 None，在 1.0.0 版時變更預設值為 False。

在合併資料集時，當 join 參數為 'outer' 時，sort 參數設為 True 會對非串接的那一軸排序。

In [32]:

df8 = make_df(['Albert', 'Cherry', 'David', 'Bell'],
              [1, 2, 3, 4])
df8

Out[32]:

	Albert	Cherry	David	Bell
1	Albert-1	Cherry-1	David-1	Bell-1
2	Albert-2	Cherry-2	David-2	Bell-2
3	Albert-3	Cherry-3	David-3	Bell-3
4	Albert-4	Cherry-4	David-4	Bell-4

In [33]:

df9 = make_df(['Bell', 'Flora', 'Ellen', 'David'],
              [2, 3, 5, 6])
df9

Out[33]:

	Bell	Flora	Ellen	David
2	Bell-2	Flora-2	Ellen-2	David-2
3	Bell-3	Flora-3	Ellen-3	David-3
5	Bell-5	Flora-5	Ellen-5	David-5
6	Bell-6	Flora-6	Ellen-6	David-6

In [34]:

pd.concat([df8, df9], sort = False)

Out[34]:

	Albert	Cherry	David	Bell	Flora	Ellen
1	Albert-1	Cherry-1	David-1	Bell-1	NaN	NaN
2	Albert-2	Cherry-2	David-2	Bell-2	NaN	NaN
3	Albert-3	Cherry-3	David-3	Bell-3	NaN	NaN
4	Albert-4	Cherry-4	David-4	Bell-4	NaN	NaN
2	NaN	NaN	David-2	Bell-2	Flora-2	Ellen-2
3	NaN	NaN	David-3	Bell-3	Flora-3	Ellen-3
5	NaN	NaN	David-5	Bell-5	Flora-5	Ellen-5
6	NaN	NaN	David-6	Bell-6	Flora-6	Ellen-6

在上面這個範例中，資料集合併的軸是列（axis = 0），因為 sort = False ，因此並不會對欄這個方向排序。

而下面這個範例中，sort = True ，因此資料集合併之後會再對欄排序。

In [35]:

pd.concat([df8, df9], sort = True)

Out[35]:

	Albert	Bell	Cherry	David	Ellen	Flora
1	Albert-1	Bell-1	Cherry-1	David-1	NaN	NaN
2	Albert-2	Bell-2	Cherry-2	David-2	NaN	NaN
3	Albert-3	Bell-3	Cherry-3	David-3	NaN	NaN
4	Albert-4	Bell-4	Cherry-4	David-4	NaN	NaN
2	NaN	Bell-2	NaN	David-2	Ellen-2	Flora-2
3	NaN	Bell-3	NaN	David-3	Ellen-3	Flora-3
5	NaN	Bell-5	NaN	David-5	Ellen-5	Flora-5
6	NaN	Bell-6	NaN	David-6	Ellen-6	Flora-6

依照官網說明，當 join = 'inner' 時，sort 參數沒有任何作用，因為它已經保留了非串接軸的順序。（Pandas 官網：This has no effect when join='inner', which already preserves the order of the non-concatenation axis.）

但，以下的範例結果似乎與官網說明不同。

In [36]:

pd.concat([df8, df9], join = 'inner')

Out[36]:

	David	Bell
1	David-1	Bell-1
2	David-2	Bell-2
3	David-3	Bell-3
4	David-4	Bell-4
2	David-2	Bell-2
3	David-3	Bell-3
5	David-5	Bell-5
6	David-6	Bell-6

當 join = 'inner' 時，沒有指定 sort 參數值， sort 參數使用預設值 False，得到的合併結果是「沒有排序」。

In [37]:

pd.concat([df8, df9], join = 'inner', sort = False)

Out[37]:

	David	Bell
1	David-1	Bell-1
2	David-2	Bell-2
3	David-3	Bell-3
4	David-4	Bell-4
2	David-2	Bell-2
3	David-3	Bell-3
5	David-5	Bell-5
6	David-6	Bell-6

當 join = 'inner'，sort = False 時，得到的合併結果和 sort 參數使用預設值一樣是「沒有排序」。

In [38]:

pd.concat([df8, df9], join = 'inner', sort = True)

Out[38]:

	Bell	David
1	Bell-1	David-1
2	Bell-2	David-2
3	Bell-3	David-3
4	Bell-4	David-4
2	Bell-2	David-2
3	Bell-3	David-3
5	Bell-5	David-5
6	Bell-6	David-6

但當 join = 'inner'，sort = True 時，得到的合併結果和 sort 參數使用預設值是「有重新排序」。

copy 參數

copy 參數是一個布林值，預設值是 True。

按照官方網站的解釋，如果 copy = False 的話，將不會拷貝不必要的資料。（If False, do not copy data unnecessarily.）筆者試著搜尋官方和非官方網站包含追朔原始碼，但並沒有找到明確的文件說明 copy 參數的用途。

關於 copy 參數的說明就留給其他 Pandas 專家來解釋。

參考資料：

pandas 官方網站（https://pandas.pydata.org/）
'Python Data Science Handbook', Jake VanderPlas

老驥於 2020/5/7

人工智慧 Python 機器學習資料分析科學計算 Pandas 合併資料集

phd.chi

maximaChi's blog

phd.chi 發表在痞客邦留言(0) 人氣()

E-mail轉寄

«	二月 2025					»
日	一	二	三	四	五	六
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

«	二月 2025					»
日	一	二	三	四	五	六
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

maximaChi's blog

frame of mind

合併資料集 - 01 pandas.concat() 函式

Pandas 筆記

合併資料集 - 01 pandas.concat() 函式

串接 pandas.Series 物件

串接 pandas.Series - 重複索引

串接 pandas.DataFrame 物件

串接 pandas.DataFrame 物件 - 重複索引

verify_integrity 參數

axis 參數

ignore_index 參數

join 參數

keys 參數

levels 參數

names 參數

sort 參數

copy 參數

留言列表

站方公告

活動快報

【寵物...

我的好友

熱門文章

文章分類

科普 (1)

Pandas 筆記 (2)

Python (0)

教育 (1)

紀錄 (1)

最新文章

最新留言

動態訂閱

文章精選

文章搜尋

新聞交換(RSS)

誰來我家

參觀人氣

QR Code

POWERED BY

月曆

«	二月 2025					»
日	一	二	三	四	五	六
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28