Common operation details of Python using pandas Library

Time:2020-10-23

This article describes the common operations of Python using pandas library. For your reference, the details are as follows:

1. Overview

PandasIt is the core data analysis support library of python, which provides fast, flexible and clear data structure, and aims to deal with relational and marked data simply and intuitively. Pandas is commonly used to process matrix data with row and column labels and tabular data similar to SQL or excel tables. It is used in data sorting and cleaning, data analysis and modeling, data visualization and tabulation in the fields of finance, statistics, social science, engineering, etc.

Data type: Pandas does not change the original input data, but copies the data to generate new objects. The one-dimensional array composed of ordinary objects becomes series. The two-dimensional array table composed of series is called dataframe, and its row is called index, and its column is colum.

Installation: if you use the anaconda integrated environment, you will automatically install numpy, SciPy, pandas and other data science packages. You can also install pandas through the python package management tool


pip install pandas

2. Creation of data objects

A series object can be created by wrapping a one-dimensional array with the series() function, where the elements of the array can be of various types.

A dataframe object can be created by wrapping a two-dimensional array with the dataframe() function. Row labels and column labels can be specified by the parameters index and columns. You can also initialize the dataframe through the dictionary type of python, whose key name is the column label by default

import pandas as pd
import numpy as np
 
#Initializing series through a one-dimensional array
s = pd.Series([1, 2.0, np.nan, 'test'])
print(s)
 
#Initializing the dataframe through a two-dimensional array
arr = np.random.randn(6, 4)
arr_df = pd.DataFrame(arr, index=np.arange(1, 7), columns=list('ABCD'))
print(arr_df)
#Initializing dataframe through dictionary Dict
dic = {'A': 1.,
    'B': pd.Timestamp('20130102'),
    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
    'D': np.array([3] * 4, dtype='int32'),
    'E': pd.Categorical(["test", "train", "test", "train"])
    }
dic_df = pd.DataFrame(dic)
print(dic_df)

The operation results are as follows:

#Series data
0    1
1    2
2   NaN
3  test
dtype: object
#Dataframe of two dimensional array
     A     B     C     D
1 -0.085417 -0.816502 1.495134 -0.277742
2 1.657144 -0.203346 0.631930 -1.182239
3 -2.303923 -0.535696 1.315379 0.129682
4 0.133198 -0.239664 -2.004494 0.119965
5 -1.454717 2.114255 -0.538678 -0.580361
6 -0.759183 0.141554 -0.243270 2.840325
#Dict dictionary dataframe
   A     B  C D   E
0 1.0 2013-01-02 1.0 3  test
1 1.0 2013-01-02 1.0 3 train
2 1.0 2013-01-02 1.0 3  test
3 1.0 2013-01-02 1.0 3 train

3. View data

The function head (n) can view the data of the first n rows of the dataframe, and tail (n) can view the data of the last n rows

Index () looks at the row labels of the dataframe, and columns displays the column labels

Describe() displays the statistical information of data by column, including count, mean, variance, minimum and maximum value, etc.

The function mean () shows the mean of all columns, and mean (1) shows the mean of all rows

Sum () calculates the mean value of all columns and sum (1) calculates the mean value of all rows

Dataframe has an empty property to determine whether it is empty and returns true if it is empty


arr = np.random.randn(6, 4)
df = pd.DataFrame(arr, index=np.arange(1, 7), columns=list('ABCD'))
print(df.head(3))
print(df.index)
print(df.describe())

give the result as follows

#View the first three rows of data
A     B     C     D
1 3.260449 -0.619396 0.070877 1.586914
2 -0.529708 0.071917 -1.919316 1.845727
3 -1.005765 2.176579 -0.323483 -1.295067
#View row labels
Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')
#View statistics
       A     B     C     D
count 6.000000 6.000000 6.000000 6.000000
mean -0.184606 -0.487184 0.079433 0.855810
std  1.721394 1.800460 1.379498 1.128764
min  -1.443635 -3.091446 -1.919316 -1.295067
25%  -0.967105 -1.430192 -0.281188 0.778729
50%  -0.694488 -0.273739 -0.041713 1.150944
75%  -0.531744 0.197755 0.355731 1.508475
max  3.260449 2.176579 2.352142 1.845727

4. Data selection

You can select columns or rows directly through the dataframe object,

#Select a column a, which is equivalent to DF ['a ']
print(df.A)
#Select lines 1 to 3 with subscripts starting from 0
print(df[1:3])
 
'''
#The column labeled a
1  0.644427
2  0.643149
3  1.374668
4  -0.154465
5  -0.338085
6  -1.989284
Name: A, dtype: float64
#Lines 1-3
     A     B     C     D
2 0.643149 1.769001 -0.166028 -0.036854
3 1.374668 -0.118593 -0.202222 0.308353
'''

Through the LOC [] method, you can select a row, a column, several rows or columns or a specific value of the dataframe through the tag

#Take the line labeled 2
print(df.loc[2])
#Take the contents with trip labels 1 ~ 3 and column labels as' a 'and' B '
print(df.loc[1:3, ['A', 'B']])
#Get the specific value of row label 1 and column label a, which is equivalent to df.at [1,'A']
print(df.loc[1, 'A'])
 
'''
#A line labeled 2
A  0.681469
B  -0.053046
C  -1.384877
D  -0.447700
Name: 2, dtype: float64
#Contents with labels of 1 ~ 3 and column labels of 'a' and 'B'
     A     B
1 0.710907 -0.950896
2 0.681469 -0.053046
3 0.781981 0.123072
#The row label is 1 and the column label is the specific value of 'a'
0.7109074858947351
'''

In addition to the row and column label, the value can also be obtained by the position of the row and column array. The method name is iloc []

#Take out the first line, and the line subscript starts at 0
print(df.iloc[0])
#Show columns 0 and 2 of rows 1, 2, 4
print(df.iloc[[1, 2, 4], [0, 2]])
#Display the specific value of row 1 and column 1, equivalent to df.iat [1,1]
print(df.iloc[1, 1])

You can also filter the data at the time of selection

#Output all rows with column a greater than 0
print(df[df.A > 0])
df['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
#The output e column contains all rows of two and four
print(df[df['E'].isin(['two', 'four'])])
 
'''
     A     B     C     D
3 0.168998 -0.732362 -0.098542 0.413128
5 0.513677 -0.163231 -0.098037 -0.606693

     A     B     C     D   E
3 0.168998 -0.732362 -0.098542 0.413128  two
5 0.513677 -0.163231 -0.098037 -0.606693 four
'''

5. Operation data

Through the insert () method, you can achieve the specified locationInsert a columnYou can also assign an array directly to the dataframe, which is added to the last column by default

You can find the specified row and column through the previous selection methods LOC and iloc, and then assign values directly. If there is data in this location, it will be modified, otherwise, it will be added

Through the drop () methoddeleteFor the specified data, the index attribute specifies the deleted row, columns specifies the deleted column, and whether the inplace attribute operates on the original dataset. The default value is false. In this case, a variable is required to receive the deleted result

df = pd.DataFrame(data = [['lisa','f',22],['joy','f',22],['tom','m','21']],
         index = [1,2,3],columns = ['name','sex','age'])
citys = ['ny','zz','xy']
#In column 0, add a number with the column name of city and the value of cities.
df.insert(0,'city',citys)
jobs = ['student','AI','teacher']
#By default, data with column name of job and value of jobs is added to the last column of DF.
df['job'] = jobs
#If there is no line with index "4" in DF, add it; otherwise, modify it
df.loc[4] = ['zz', 'mason', 'm', 24, 'engineer']
print(df)
#Delete row with row label 1
dp=df.drop(index=1)
print(dp)
#Delete columns with column label sex on the original dataset
df.drop(columns=['sex'],inplace=True)
print(df)

The results were as follows:

#Added data
 city  name sex age    job
1  ny  lisa  f 22  student
2  zz  joy  f 22    AI
3  xy  tom  m 21  teacher
4  zz mason  m 24 engineer
#Delete first line
 city  name sex age    job
2  zz  joy  f 22    AI
3  xy  tom  m 21  teacher
4  zz mason  m 24 engineer
#Delete sex column
 city  name age    job
1  ny  lisa 22  student
2  zz  joy 22    AI
3  xy  tom 21  teacher
4  zz mason 24 engineer

On the dataframeTranspositionOperation, call. T

sort_ Index (axis = 1, ascending = false)sort, axis = 0 stands for sorting by row labels, axis = 1 for sorting by column labels

sort_ Values (by’a ‘) sorts the data by the value of a column, where the column label is a

The apply() function applies a function to each row of the dataframe

print(df.T)
si=df.sort_index(axis=1, ascending=False)
print(si)
sv=df.sort_values(by='A')
print(sv)
#The anonymous function is used to subtract the minimum value from the maximum value of each column
df.apply(lambda x: x.max() - x.min())
print(df)
'''
#Data transposition
     1     2     3     4     5     6
A -1.176180 -1.301768 0.907088 -1.528101 1.098978 -1.280193
B -0.461954 -0.749642 1.169118 -0.297765 0.531088 -0.999842
C -1.715094 -0.512856 0.511861 -0.247240 1.696772 -0.902995
D 1.336999 0.209091 2.254337 0.649625 -0.049886 -1.514815
#Reverse column labels
    D     C     B     A
1 1.336999 -1.715094 -0.461954 -1.176180
2 0.209091 -0.512856 -0.749642 -1.301768
3 2.254337 0.511861 1.169118 0.907088
4 0.649625 -0.247240 -0.297765 -1.528101
5 -0.049886 1.696772 0.531088 1.098978
6 -1.514815 -0.902995 -0.999842 -1.280193
#Sort rows incrementally by the value of column a
    A     B     C     D
4 -1.528101 -0.297765 -0.247240 0.649625
2 -1.301768 -0.749642 -0.512856 0.209091
6 -1.280193 -0.999842 -0.902995 -1.514815
1 -1.176180 -0.461954 -1.715094 1.336999
3 0.907088 1.169118 0.511861 2.254337
5 1.098978 0.531088 1.696772 -0.049886
#Application of function 
A  2.073961
B  2.671590
C  1.785291
D  0.000000
F  4.000000
dtype: float64
'''

Panda’s concat function can carry out two dataframes of the same type on the dimension of the rowSplicing

The merge() function can splice different dataframes into columns

The append() function can be at the end of the dataframeAdditional

#Splice the first and last lines
print(pd.concat([df[:1], df[-2:-1]]))
#Append line 4 to the end
print(df.append(df.iloc[3]))
#Splicing two dataframes into columns
df1 = pd.DataFrame({'row1': ['foo', 'bar'], 'row2': [1, 2]})
df2 = pd.DataFrame({'row1': ['foo', 'bar'], 'row3': [4, 5]})
print(pd.merge(df1, df2))
 
'''
#Stitching by row
     A     B     C     D
1 -0.527221 -0.754650 -2.385270 -2.569586
5 0.054059 1.443911 -0.240856 -1.501045
#Additional
     A     B     C     D
1 -0.527221 -0.754650 -2.385270 -2.569586
2 2.123332 -0.013431 -0.574359 -0.548838
3 -0.244057 -0.267805 1.089026 -0.022174
4 -0.789228 1.171906 0.526318 0.046655
5 0.054059 1.443911 -0.240856 -1.501045
6 0.756844 0.623305 -0.597299 0.034326
4 -0.789228 1.171906 0.526318 0.046655
#Splicing by column
 row1 row2 row3
0 foo   1   4
1 bar   2   5
'''

The groupby function can group the data by columns, and the grouped results can be iterated by using the for loop. Each grouping in the iteration is a (index, dataframe) tuple, and the dataframe can be further operated on.

Stack() can compress multi column data into two column display

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar'],
          'B': ['one', 'two', 'one', 'three'],
          'C': np.random.randn(4),
          'D': np.random.randn(4)})
#Group by column A and column B
dg=df.groupby(['A', 'B'])
for (index,df) in dg:
    print(df)
#Compression
print(df.stack())
 
'''
#Group by column
   A   B     C     D
3 bar three 0.802027 1.338614
   A  B     C     D
1 bar two -0.567295 0.608978
   A  B    C     D
0 foo one -0.17592 -0.191991
2 foo one -0.72258 0.711312
#Compression为两列
0 A     foo
  B     one
  C   0.622471
  D   0.10633
1 A     bar
  B     two
  C   0.065516
  D  -0.844223
2 A     foo
  B     one
  C  0.0013226
  D   -1.3328
3 A     bar
  B    three
  C  -0.678077
  D   0.785117
dtype: object
'''

Pandas mainly uses values np.nan To expressdefectData. You can use the dropna (how’any ‘) method to delete all rows with null values, and dropna (axis = 1) to delete columns with null values. Fillna (value = x) fills all null values with the specified value X.

6. Others

Pandas can be easily converted from other format files

#Write dataframe to CSV file
df.to_csv('foo.csv')
#Reading data from a CSV file
df = pd.read_csv('foo.csv')
#Reading and writing excel file
df = pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
df.to_excel('foo.xlsx', sheet_name='Sheet1')

Pandas provides a convenient time dimension generating function date_ Range(), the first parameter is the start time, periods = generated quantity, freq = time interval, and the unit is days by default

#Starting from January 1, 2019, five times are generated in seconds
rng = pd.date_range('1/1/2019', periods=5, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
print(ts)
 
'''
2019-01-01 00:00:01  161
2019-01-01 00:00:02  214
2019-01-01 00:00:03  110
2019-01-01 00:00:04  265
Freq: S, dtype: int32
'''

Pandas combined with matplot can easily draw data

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
#Appending data to an array for unified display
ts=ts.cumsum()
#Call matplot to draw graph
ts.plot()

More interested readers about Python related content can view the special topics of this website: summary of Python mathematical operation skills, python data structure and algorithm tutorial, python function use skills summary, Python string operation skills summary, python introduction and advanced classic tutorial and python file and directory operation skills summary

I hope this article will be helpful to python programming.