Pandas data analysis — detailed explanation of super easy to use groupby

Time:2021-2-26

WeChat official account: “Python reads money”



If there are any questions or suggestions, please official account message.

In the daily data analysis, it is often necessary to analyze the dataDivide into different groups according to one (more) fieldFor example, in the field of e-commerce, the total sales of the whole country are divided by provinces, and the changes of sales in each province are analyzed. In the social field, the users are subdivided according to their portraits (gender and age), and the usage and preferences of users are studied. In pandas, the above data processing operations are mainly usedgroupbyFinish, this article will introducegroupbyThe basic principle and the correspondingaggtransformandapplyOperation.

For the convenience of the following illustration, 10 sample data generated by simulation are used. The code and data are as follows:

company=["A","B","C"]

data=pd.DataFrame({
    "company":[company[x] for x in np.random.randint(0,len(company),10)],
    "salary":np.random.randint(5,50,10),
    "age":np.random.randint(15,50,10)
}
)
company salary age
0 C 43 35
1 C 17 25
2 C 8 30
3 A 20 22
4 B 10 17
5 B 21 40
6 A 23 33
7 C 49 19
8 B 8 30

1、 Basic principles of groupby

In panda, the code to implement grouping operation is very simple, only one line of code is needed. Here, the data set above is grouped according to thecompanyField division:

In [5]: group = data.groupby("company")

Enter the above codeipythonAfter that, you’ll get oneDataFrameGroupByobject

In [6]: group
Out[6]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002B7E2650240>

So this is generatedDataFrameGroupByWhat is it? yesdataIt’s been donegroupbyWhat happened after that?ipythonThe result returned is its memory address, not conducive to intuitive understanding, in order to seegroupWhat’s inside? Here’s what’s insidegroupconvert tolistLet’s take a look at the following forms:

In [8]: list(group)
Out[8]:
[('A',   company  salary  age
  3       A      20   22
  6       A      23   33), 
 ('B',   company  salary  age
  4       B      10   17
  5       B      21   40
  8       B       8   30), 
 ('C',   company  salary  age
  0       C      43   35
  1       C      17   25
  2       C       8   30
  7       C      49   19)]

After converting to the form of a list, you can see that the list consists of three tuples. In each tuple, the first element is a groupcompanyIn the end, they are divided into two groupsA,B,C)The second element is under the corresponding groupDataFrameThe whole process can be illustrated as follows:

Pandas data analysis -- detailed explanation of super easy to use groupby

In conclusion,groupbyThe process is to change the originalDataFrameaccording togroupbyIn this casecompany)It is divided into several partsGrouping dataframeThere are as many groups as there areGrouping dataframeSo, ingroupbyAfter a series of operations (such asaggapplyAnd so on)Sub dataframeThe operation of.After understanding this, we can basically find out what is in pandasgroupbyThe main principle of operation. Let’s talk about itgroupbyCommon operations after.

2、 AGG aggregation operation

The aggregation operation isgroupbyAfter a very common operation, will writeSQLMy friends should be very familiar with this. Aggregation operations can be used to sum, average, maximum, minimum, etc. the following table lists the common aggregation operations in pandas.

function purpose
min minimum value
max Maximum
sum Sum up
mean mean value
median median
std standard deviation
var variance
count count

For the sample data set, if I want to find the average age and average salary of employees in different companies, I can follow the following code:

In [12]: data.groupby("company").agg('mean')
Out[12]:
         salary    age
company
A         21.50  27.50
B         13.00  29.00
C         29.25  27.25

If you want to find different values for different columns, for example, to calculate the average age and median salary of employees in different companies, you can use the dictionary to specify the aggregation operation

In [17]: data.groupby('company').agg({'salary':'median','age':'mean'})
Out[17]:
         salary    age
company
A          21.5  27.50
B          10.0  29.00
C          30.0  27.25

aggThe polymerization process can be illustrated as follows (the second example is an example)

Pandas data analysis -- detailed explanation of super easy to use groupby

3、 Transform

transformWhat kind of data operation is it? andaggWhat’s the difference? For better understandingtransformandaggThe following is a comparison from the actual application scenarios.

It’s on the topaggIn, we learned how to calculate the average salary of employees in different companies. If we need to add a new column in the original data set nowavg_salary, on behalf ofAverage salary of the company in which the employee works (employees in the same company have the same average salary)How to realize it? If you calculate according to the normal steps, you need to first get the average salary of different companies, and then fill in the corresponding position according to the corresponding relationship between employees and companiestransformThe implementation code is as follows:

In [21]: avg_salary_dict = data.groupby('company')['salary'].mean().to_dict()

In [22]: data['avg_salary'] = data['company'].map(avg_salary_dict)

In [23]: data
Out[23]:
  company  salary  age  avg_salary
0       C      43   35       29.25
1       C      17   25       29.25
2       C       8   30       29.25
3       A      20   22       21.50
4       B      10   17       13.00
5       B      21   40       13.00
6       A      23   33       21.50
7       C      49   19       29.25
8       B       8   30       13.00

If usedtransformIf so, only one line of code is required:

In [24]: data['avg_salary'] = data.groupby('company')['salary'].transform('mean')

In [25]: data
Out[25]:
  company  salary  age  avg_salary
0       C      43   35       29.25
1       C      17   25       29.25
2       C       8   30       29.25
3       A      20   22       21.50
4       B      10   17       13.00
5       B      21   40       13.00
6       A      23   33       21.50
7       C      49   19       29.25
8       B       8   30       13.00

Let’s take a look at it graphicallygroupbyaftertransformIn order to show more intuitively, we add thecompanyColumn, actually according to the above code onlysalaryColumn:

Pandas data analysis -- detailed explanation of super easy to use groupby

The big box in the picture istransformandaggWhat’s different, rightaggAs far as accounting is concerned, it can be calculatedABCThe corresponding mean value of the company is returned directlytransformIn other words, it willFor each data to get the corresponding results, the same group of samples will have the same valueAfter calculating the average value within the group, theIn the order of the original indexReturn the result. If you don’t understand, you can take this picture andaggCompare that one.

4、 Apply

applyIt should be an old friend of everyone. It’s betteraggandtransformIt is more flexible, and can pass in any custom function to realize complex data operation. stayThree axes of pandas data processing

)In this paper, we introduceapplyHow to use itgroupbyAfter useapplyWhat’s the difference from what I’ve described before?

There are some differences, but the whole implementation principle is basically the same. The difference between the two is that forgroupbyAfterapplyAfter groupingSub dataframeThe basic unit of operation passed into the specified function as an argument isDataFrameAnd what I’ve described beforeapplyThe basic unit of operation isSeries. Or is it a casegroupbyAfterapplyUsage.

Suppose I need to obtain the data of the oldest employees in each company, how can I achieve this? It can be implemented with the following code:

In [38]: def get_oldest_staff(x):
    ...:     df = x.sort_values(by = 'age',ascending=True)
    ...:     return df.iloc[-1,:]
    ...:

In [39]: oldest_staff = data.groupby('company',as_index=False).apply(get_oldest_staff)

In [40]: oldest_staff
Out[40]:
  company  salary  age  
0       A      23   33       
1       B      21   40       
2       C      43   35

In this way, we can get the data of the oldest employees in each company. The whole process is illustrated as follows:

Pandas data analysis -- detailed explanation of super easy to use groupby

As you can see, theapplyIt is basically consistent with the principle introduced in the previous article, except that the parameters of the input function are controlled by theSeriesIt’s hereGrouping dataframe

Finally, aboutapplyHere’s a little suggestion, thoughapplyMore flexibility, butapplyWill be more efficient thanaggandtransformIt’s slower. So,groupbyIt can be used lateraggandtransformThe problem to be solved is to give priority to these two methods, and only when they can’t be solved can they be consideredapplyDo the operation.

Scan code is concerned about the official account “Python reading money”, dry cargo for the first time, and can also add Python learning exchange group!

Pandas data analysis -- detailed explanation of super easy to use groupby

This work adoptsCC agreementReprint must indicate the author and the link of this article