Pandas data analysis – detailed explanation of merge data mosaic

Time:2020-1-14

WeChat public address: “Python reads money”
If there are any questions or suggestions, please leave a message for the public.

In order to facilitate maintenance, the data of general companies are stored in separate tables in the database, such as using a table to store basic information of all users, and a table to store user consumption. Therefore, in daily data processing, two tables are often neededSplice upUse. This operation corresponds to SQLjoinIn pandasmergeTo achieve. Let’s talk about this articlemergeThe main principle of.

The introduction above saysmergeIt is used to splice two tables, so the user information is naturally needed during splicingOne-to-one correspondenceTherefore, the two tables to be spliced need to have a commonKey (key)。 In summary, the wholemergeThe process is to send informationOne to one matchingThe process ofmergeThe four types of are'inner''left''right'and'outer'

I. inner

mergeOf'inner'The type of is calledInternal connection, it will takeKey intersection of two tablesSplicing. What do you mean? The next step is to disassemble in a graphic way.

First of all, we have the following data. The data on the left and right represent the user’sBasic informationandConsumption information, the key to connect the two tables isuserid

Pandas data analysis - detailed explanation of merge data mosaic

Now use'inner'Bymerge

In [6]: df_1.merge(df_2,how='inner',on='userid')
Out[6]:
  userid  age  payment
0      a   23     2000
1      c   32     3500

Process diagram:

① Take the key of two tablesintersectionHeredf_1anddf_2OfuseridThe intersection is{a,c}

Pandas data analysis - detailed explanation of merge data mosaic

② Corresponding match

Pandas data analysis - detailed explanation of merge data mosaic

Results

Pandas data analysis - detailed explanation of merge data mosaic

Process summary:

Pandas data analysis - detailed explanation of merge data mosaic

It is believed that the whole process is not difficult to understand. What is shown above is that under the same key, two tables correspond to only one data (one user corresponds to one consumption record). Then, ifOne user corresponds to multiple consumption recordsThen, how is it spliced?

Let’s assume that the current data is as follows:df_2There are two andaCorresponding data:

Pandas data analysis - detailed explanation of merge data mosaic

Same useinnerBymerge

In [12]: df_1.merge(df_2,how='inner',on='userid')
Out[12]:
  userid  age  payment
0      a   23     2000
1      a   23      500
2      b   46     1000
3      c   32     3500

The whole process exceptCorresponding matching stage, the others are basically the same as above.

Process diagram:

① Take the key of two tablesintersectionHeredf_1anddf_2OfuseridThe intersection is{a,b,c}

Pandas data analysis - detailed explanation of merge data mosaic

② When matching, due to theaThere are two corresponding consumption records, so when splicing, the user’s basic information tableaCorresponding dataCopy one more line to match the right

Pandas data analysis - detailed explanation of merge data mosaic

Results

Pandas data analysis - detailed explanation of merge data mosaic

2、 Left and right

'left'and'right'OfmergeIn fact, the ways are similar. They are calledLeft joinandRight join。 These two methods can be converted to each other, so they are introduced here.

'left'mergeWhenThe key of the left table is the benchmarkPair with the missing value if the key in the left table does not exist on the rightNaNFill.

'right'mergeWhenThe key of the table on the right is the benchmarkPair. If the key in the table on the right does not exist on the left, use the missing valueNaNFill.

What do you mean? Let’s use an example to explain. This is the data of the demonstration

Pandas data analysis - detailed explanation of merge data mosaic

Now use'left'Bymerge

In [21]: df_1.merge(df_2,how='left',on='userid')
Out[21]:
  userid  age  payment
0      a   23   2000.0
1      b   46      NaN
2      c   32   3500.0
3      d   19      NaN

Process diagram:

① Pair based on all keys in the table on the left. In the figure, because theeIt is not in the left table, so it will not be paired.

Pandas data analysis - detailed explanation of merge data mosaic

② If thepaymentColumns are merged into the left table with missing values for those without matching valuesNaNFill

Pandas data analysis - detailed explanation of merge data mosaic

Process summary:

Pandas data analysis - detailed explanation of merge data mosaic

about'right'Typemergeand'left'In fact, it is almost the same. As long as you change the positions of the two tables, the results returned by the two methods are the same (), as follows:

In [22]: df_2.merge(df_1,how='right',on='userid')
Out[22]:
  userid  payment  age
0      a   2000.0   23
1      c   3500.0   32
2      b      NaN   46
3      d      NaN   19

As for'left'and'right'As for the following'outer')The connected key is one to many case, principle and upper'inner'It’s similar, so I won’t repeat it here.

Three, outer

'outer'yesExternal connectionIn the process of splicing, it will takeUnion of keys of two tablesSplicing. Look at the text is not intuitive, or the example!

Use the demo data used above

Pandas data analysis - detailed explanation of merge data mosaic

This use'outer'Conductmerge

In [24]: df_1.merge(df_2,how='outer',on='userid')
Out[24]:
  userid   age  payment
0      a  23.0   2000.0
1      b  46.0      NaN
2      c  32.0   3500.0
3      d  19.0      NaN
4      e   NaN    600.0

The diagram is as follows:

① Take the union of two table keys, here is{a,b,c,d,e}

Pandas data analysis - detailed explanation of merge data mosaic

② Put the data columns of the two tables together, and use the missing values where there is no matchingNaNFill in

Pandas data analysis - detailed explanation of merge data mosaic

I think the little friends who can read here have basically understoodmergeThe whole process, in summary,mergeThe difference is that when splicing,The key sets of the two selected tables are different。 About pandasmergeThat’s it!

Originality is not easy. If you think it’s a little bit useful, I hope you can give me some praise. Thank you old fellow.

Scan code is concerned about the public number “Python reading money”, the first time to get dry goods, you can also add Python learning exchange group!
Pandas data analysis - detailed explanation of merge data mosaic