Pre machine learning (4): master the usage of pandas in one article

Time:2021-9-18

Pandas provides fast, flexible and expressive servicesdata structure, is powerfulData analysisPython library.

This article is included inMachine learning pre tutorial series

1、 Series and dataframe

Pandas is built on numpy, moreNumPyFor relevant knowledge points, please refer to the article I wrote beforePre machine learning (III): master common numpy usage in 30 minutes
Pandas is especially suitable for processing tabular data, such as SQL tables and excel tables. Ordered or disordered time series. Arbitrary matrix data with row and column labels.

Open Jupiter notebook, import numpy and pandas to start our tutorial:

import numpy as np
import pandas as pd

1. pandas.Series

Series is a one-dimensional array of ndarrays with indexes. The index value may not be unique, but it must be hashable.

pd.Series([1, 3, 5, np.nan, 6, 8])

Output:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

We can see that the default index values are numbers such as 0, 1, 2, 3, 4 and 5. add toindexProperty, specifying ‘C’, ‘a’, ‘I’, ‘Yong’, ‘J’, ‘I’.

pd.Series([1, 3, 5, np.nan, 6, 8], index=['c','a','i','yong','j','i'])

The output is as follows. We can see that the index is repeatable.

c       1.0
a       3.0
i       5.0
yong    NaN
j       6.0
i       8.0
dtype: float64

2. pandas.DataFrame

A dataframe is a table structure with rows and columns. It can be understood as a dictionary structure of multiple series objects.

pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), index=['i','ii','iii'], columns=['A', 'B', 'C'])

The output table is as follows, whereindexThe row corresponding to it,columnsThe column corresponding to it.

A B C
i 1 2 3
ii 4 5 6
iii 7 8 9

2、 Common usage of pandas

1. Access data

Prepare the data and randomly generate a two-dimensional array of 6 rows and 4 columns. The row labels are the dates from 20210101 to 20210106, and the column labels are a, B, C and D.

import numpy as np
import pandas as pd
np.random.seed(20201212)
df = pd.DataFrame(np.random.randn(6, 4), index=pd.date_range('20210101', periods=6), columns=list('ABCD'))
df

The display table is as follows:

A B C D
2021-01-01 0.270961 -0.405463 0.348373 0.828572
2021-01-02 0.696541 0.136352 -1.64592 -0.69841
2021-01-03 0.325415 -0.602236 -0.134508 1.28121
2021-01-04 -0.33032 -1.40384 -0.93809 1.48804
2021-01-05 0.348708 1.27175 0.626011 -0.253845
2021-01-06 -0.816064 1.30197 0.656281 -1.2718

1.1 head() and tail()

To view the first few rows of the table:

df.head(2)

The display table is as follows:

A B C D
2021-01-01 0.270961 -0.405463 0.348373 0.828572
2021-01-02 0.696541 0.136352 -1.64592 -0.69841

View the last few rows of the table:

df.tail(3)

The display table is as follows:

A B C D
2021-01-04 -0.33032 -1.40384 -0.93809 1.48804
2021-01-05 0.348708 1.27175 0.626011 -0.253845
2021-01-06 -0.816064 1.30197 0.656281 -1.2718

1.2 describe()

describeMethod is used to generate descriptive statistics for the dataframe. You can easily view the distribution of data sets. Note that the statistical distribution here does not includeNaNValue.

df.describe()

The display is as follows:

A B C D
count 6 6 6 6
mean 0.0825402 0.0497552 -0.181309 0.22896
std 0.551412 1.07834 0.933155 1.13114
min -0.816064 -1.40384 -1.64592 -1.2718
25% -0.18 -0.553043 -0.737194 -0.587269
50% 0.298188 -0.134555 0.106933 0.287363
75% 0.342885 0.987901 0.556601 1.16805
max 0.696541 1.30197 0.656281 1.48804

Let’s first review the mathematical formulas we have mastered.

Average (mean)

$$\bar x = \frac{\sum_{i=1}^{n}{x_i}}{n}$$

Variance:

$$s^2 = \frac{\sum_{i=1}^{n}{(x_i -\bar x)^2}}{n}$$

Standard deviation (STD):

$$s = \sqrt{\frac{\sum_{i=1}^{n}{(x_i -\bar x)^2}}{n}}$$

Let’s explain the meaning of the attributes of the describe statistics of pandas. We onlyAAs an example.

  • countIndicates the count. 6 data in column a are not empty.
  • meanRepresents the average value. The average value of all non empty data in column A is 0.0825402.
  • stdRepresents the standard deviation. The standard deviation of column A is 0.551412.
  • minRepresents the minimum value. The minimum value of column A is -0.816064. That is, 0% of the data is smaller than -0.816064.
  • 25%Represents the quartile. The quartile of column A is -0.18. That is, 25% of the data is smaller than -0.18.
  • 50%Represents the half quantile. The quartile of column A is 0.298188. That is, 50% of the data is smaller than 0.298188.
  • 75%Represents the third quartile. The third quartile of column A is 0.342885. That is, 75% of the data is smaller than 0.342885.
  • maxIndicates the maximum value. The maximum value of column A is 0.696541. That is, 100% of the data is smaller than 0.696541.

1.3 T

TGeneral representationTransposeAn abbreviation for transpose. Row column conversion.

df.T

The display table is as follows:

2021-01-01 2021-01-02 2021-01-03 2021-01-04 2021-01-05 2021-01-06
A 0.270961 0.696541 0.325415 -0.33032 0.348708 -0.816064
B -0.405463 0.136352 -0.602236 -1.40384 1.27175 1.30197
C 0.348373 -1.64592 -0.134508 -0.93809 0.626011 0.656281
D 0.828572 -0.69841 1.28121 1.48804 -0.253845 -1.2718

1.4 sort_values()

Specify a column to sort according to the following codeCSort columns in positive order.

df.sort_values(by='C')

The display table is as follows:

A B C D
2021-01-02 0.696541 0.136352 -1.64592 -0.69841
2021-01-04 -0.33032 -1.40384 -0.93809 1.48804
2021-01-03 0.325415 -0.602236 -0.134508 1.28121
2021-01-01 0.270961 -0.405463 0.348373 0.828572
2021-01-05 0.348708 1.27175 0.626011 -0.253845
2021-01-06 -0.816064 1.30197 0.656281 -1.2718

1.5 nlargest()

Select the largest n rows of data in a column. For example:df.nlargest(2,'A')Indicates that the maximum 2 rows of data in column a are returned.

df.nlargest(2,'A')

The display table is as follows:

A B C D
2021-01-02 0.696541 0.136352 -1.64592 -0.69841
2021-01-05 0.348708 1.27175 0.626011 -0.253845

1.6 sample()

sampleMethod to view random sample data.

df.sample(5)Indicates that random 5 rows of data are returned.

df.sample(5)

parameterfracIt means fraction. Frac = 0.01, i.e. 1% random data is returned as an example.

df.sample(frac=0.01)

2. Select data

2.1 select according to label

We inputdf['A']The command selects column a.

df['A']

Output column a data, which is also a series object:

2021-01-01    0.270961
2021-01-02    0.696541
2021-01-03    0.325415
2021-01-04   -0.330320
2021-01-05    0.348708
2021-01-06   -0.816064
Name: A, dtype: float64

df[0:3]This code is the same asdf.head(3)Similarly. butdf[0:3]This is the array selection method of numpy, which shows that pandas has good support for numpy.

df[0:3]

The display table is as follows:

A B C D
2021-01-01 0.270961 -0.405463 0.348373 0.828572
2021-01-02 0.696541 0.136352 -1.64592 -0.69841
2021-01-03 0.325415 -0.602236 -0.134508 1.28121

Specify row and column labels through the LOC method.

df.loc['2021-01-01':'2021-01-02', ['A', 'B']]

The display table is as follows:

A B
2021-01-01 0.270961 -0.405463
2021-01-02 0.696541 0.136352

2.2 selection according to location

ilocAndlocDifferent.locSpecify a specific label, andilocSpecifies the index location of the label.df.iloc[3:5, 0:3]Indicates to select rows with indexes 3 and 4 and columns with indexes 0, 1 and 2. That is, rows 4 and 5, columns 1, 2 and 3.
Note that the index sequence number starts at 0. The colon indicates the interval, and the left and right sides indicate the beginning and end respectively. as3:5Indicates the left open right closed interval[3,5), that is, it does not contain 5 itself.

df.iloc[3:5, 0:3]
A B C
2021-01-04 -0.33032 -1.40384 -0.93809
2021-01-05 0.348708 1.27175 0.626011
df.iloc[:, 1:3]
B C
2021-01-01 -0.405463 0.348373
2021-01-02 0.136352 -1.64592
2021-01-03 -0.602236 -0.134508
2021-01-04 -1.40384 -0.93809
2021-01-05 1.27175 0.626011
2021-01-06 1.30197 0.656281

2.3 Boolean index

Dataframe can be filtered according to conditions. When conditions are judgedTrueWhen, return. When the condition is judged asFalseWhen, filter it out.

We set up a filter to judge whether column A is greater than 0.

filter = df['A'] > 0
filter

The output results are as follows2021-01-04and2021-01-06The behavior of is false.

2021-01-01     True
2021-01-02     True
2021-01-03     True
2021-01-04    False
2021-01-05     True
2021-01-06    False
Name: A, dtype: bool

We view the dataset through a filter.

df[filter]
# df[df['A'] > 0]

Looking at the table, we can find that,2021-01-04and2021-01-06The rows are filtered out.

A B C D
2021-01-01 0.270961 -0.405463 0.348373 0.828572
2021-01-02 0.696541 0.136352 -1.64592 -0.69841
2021-01-03 0.325415 -0.602236 -0.134508 1.28121
2021-01-05 0.348708 1.27175 0.626011 -0.253845

3. Processing missing values

Prepare data.

df2 = df.copy()
df2.loc[:3, 'E'] = 1.0
f_series = {'2021-01-02': 1.0,'2021-01-03': 2.0,'2021-01-04': 3.0,'2021-01-05': 4.0,'2021-01-06': 5.0}
df2['F'] = pd.Series(f_series)
df2

The display table is as follows:

A B C D F E
2021-01-01 0.270961 -0.405463 0.348373 0.828572 nan 1
2021-01-02 0.696541 0.136352 -1.64592 -0.69841 1 1
2021-01-03 0.325415 -0.602236 -0.134508 1.28121 2 1
2021-01-04 -0.33032 -1.40384 -0.93809 1.48804 3 nan
2021-01-05 0.348708 1.27175 0.626011 -0.253845 4 nan
2021-01-06 -0.816064 1.30197 0.656281 -1.2718 5 nan

3.1 dropna()

Use the dropna method to clear the Nan value. Note: the Dropa method returns a new dataframe and does not change the original dataframe.

df2.dropna(how='any')

The above code indicates that when any value in the row data is empty, it will be deleted.

A B C D F E
2021-01-02 0.696541 0.136352 -1.64592 -0.69841 1 1
2021-01-03 0.325415 -0.602236 -0.134508 1.28121 2 1

3.2 fillna()

Use the filna command to fill in the Nan value.

df2.fillna(df2.mean())

The above code indicates that the average value of each column is used to fill the vacancy. Similarly, fillna will not update the original dataframe. If you need to update the original dataframe usage codedf2 = df2.fillna(df2.mean())

The display table is as follows:

A B C D F E
2021-01-01 0.270961 -0.405463 0.348373 0.828572 3 1
2021-01-02 0.696541 0.136352 -1.64592 -0.69841 1 1
2021-01-03 0.325415 -0.602236 -0.134508 1.28121 2 1
2021-01-04 -0.33032 -1.40384 -0.93809 1.48804 3 1
2021-01-05 0.348708 1.27175 0.626011 -0.253845 4 1
2021-01-06 -0.816064 1.30197 0.656281 -1.2718 5 1

4. Operation method

4.1 agg()

AGG is the abbreviation of aggregate, which means aggregation.

Common polymerization methods are as follows:

  • mean(): Compute mean of groups
  • sum(): Compute sum of group values
  • size(): Compute group sizes
  • count(): Compute count of group
  • std(): Standard deviation of groups
  • var(): Compute variance of groups
  • sem(): Standard error of the mean of groups
  • describe(): Generates descriptive statistics
  • first(): Compute first of group values
  • last(): Compute last of group values
  • nth() : Take nth value, or a subset if n is a list
  • min(): Compute min of group values
  • max(): Compute max of group values
df.mean()

Returns the average value of each column

A    0.082540
B    0.049755
C   -0.181309
D    0.228960
dtype: float64

You can view the row average value by adding the axis parameter.

df.mean(axis=1)

Output:

2021-01-01    0.260611
2021-01-02   -0.377860
2021-01-03    0.217470
2021-01-04   -0.296053
2021-01-05    0.498156
2021-01-06   -0.032404
dtype: float64

What if we want to view multiple aggregation statistics for a column?
Then we can callaggmethod:

df.agg(['std','mean'])['A']

The return result shows the standard deviation STD and mean:


std     0.551412
mean    0.082540
Name: A, dtype: float64

Apply different aggregate functions to different columns:

df.agg({'A':['max','mean'],'B':['mean','std','var']})

The returned results are as follows:

A B
max 0.696541 nan
mean 0.0825402 0.0497552
std nan 1.07834
var nan 1.16281

4.2 apply()

Apply () is a call to a method.
asdf.apply(np.sum)It means that np.sum method is called for each column to return the numeric sum of each column.

df.apply(np.sum)

The output result is:

A    0.495241
B    0.298531
C   -1.087857
D    1.373762
dtype: float64

The apply method supports lambda expressions.

df.apply(lambda n: n*2)
A B C D
2021-01-01 0.541923 -0.810925 0.696747 1.65714
2021-01-02 1.39308 0.272704 -3.29185 -1.39682
2021-01-03 0.65083 -1.20447 -0.269016 2.56242
2021-01-04 -0.66064 -2.80768 -1.87618 2.97607
2021-01-05 0.697417 2.5435 1.25202 -0.50769
2021-01-06 -1.63213 2.60393 1.31256 -2.5436

4.3 value_counts()

value_ The counts method is used to view the repeated statistics of the values in each row and column.
We regenerate some integer data to ensure certain data duplication.

np.random.seed(101)
df3 = pd.DataFrame(np.random.randint(0,9,size = (6,4)),columns=list('ABCD'))
df3
A B C D
0 1 6 7 8
1 4 8 5 0
2 5 8 1 3
3 8 3 3 2
4 8 3 7 0
5 7 8 4 3

Call value_ Count() method.

df3['A'].value_counts()

Looking at the output, we can see that there are two numbers 8 in column a, and the number of other numbers is 1.

8    2
7    1
5    1
4    1
1    1
Name: A, dtype: int64

4.4 str

Pandas built-in string processing method.

names = pd.Series(['andrew','bobo','claire','david','4'])
names.str.upper()

Through the above code, we set all strings in series to uppercase.

0    ANDREW
1      BOBO
2    CLAIRE
3     DAVID
4         4
dtype: object

Initial capital:

names.str.capitalize()

Output is:

0    Andrew
1      Bobo
2    Claire
3     David
4         4
dtype: object

Judge whether it is a number:

names.str.isdigit()

Output is:

0    False
1    False
2    False
3    False
4     True
dtype: bool

String split:

tech_finance = ['GOOG,APPL,AMZN','JPM,BAC,GS']
tickers = pd.Series(tech_finance)
tickers.str.split(',').str[0:2]

The string is separated by commas, and the result is:

0    [GOOG, APPL]
1      [JPM, BAC]
dtype: object

5. Consolidation

5.1 concat()

Concat is used to concatenate data sets. Let’s prepare the data first.

data_one = {'Col1': ['A0', 'A1', 'A2', 'A3'],'Col2': ['B0', 'B1', 'B2', 'B3']}
data_two = {'Col1': ['C0', 'C1', 'C2', 'C3'], 'Col2': ['D0', 'D1', 'D2', 'D3']}
one = pd.DataFrame(data_one)
two = pd.DataFrame(data_two)

Concatenate the two data sets using the concat method.

pt(pd.concat([one,two]))

Get the table:

Col1 Col2
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
0 C0 D0
1 C1 D1
2 C2 D2
3 C3 D3

5.2 merge()

Merge is equivalent to the join method in SQL operation, which is used to connect two data sets through some relationship

registrations = pd.DataFrame({'reg_id':[1,2,3,4],'name':['Andrew','Bobo','Claire','David']})
logins = pd.DataFrame({'log_id':[1,2,3,4],'name':['Xavier','Andrew','Yolanda','Bobo']})

Our basisnameTo connect two tables. The connection method isouter

pd.merge(left=registrations, right=logins, how='outer',on='name')

The returned result is:

reg_id name log_id
0 1 Andrew 2
1 2 Bobo 4
2 3 Claire nan
3 4 David nan
4 nan Xavier 1
5 nan Yolanda 3

Let’s note,how : {‘left’, ‘right’, ‘outer’, ‘inner’}There are four connection modes. Indicates whether to select the Nan value of the left and right tables. For example, left means to retain all data in the left table. When the data in the right table is Nan value, the data on the right will not be displayed.
Simply put, consider the left table and the right table as two sets.

  • Left means to take all the sets of the left table + the intersection of the two tables
  • Right means to take all the sets of the right table + the intersection of the two tables
  • Outer means to take the union of two tables
  • Inner means to take the intersection of two tables

6. Group by

The grouping function in pandas is very similar to SQL statementsSELECT Column1, Column2, mean(Column3), sum(Column4)FROM SomeTableGROUP BY Column1, Column2。 It doesn’t matter even if you haven’t contacted SQL. Grouping is equivalent to splitting, counting and merging table data according to a column.

Prepare data.

np.random.seed(20201212)
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})
df

As you can see, our columns a and B have a lot of duplicate data. At this time, we can group according to foo / bar or one / two.

A B C D
0 foo one 0.270961 0.325415
1 bar one -0.405463 -0.602236
2 foo two 0.348373 -0.134508
3 bar three 0.828572 1.28121
4 foo two 0.696541 -0.33032
5 bar two 0.136352 -1.40384
6 foo one -1.64592 -0.93809
7 foo three -0.69841 1.48804

6.1 single column grouping

Our applicationgroupbyMethod groups the data in the table above.

df.groupby('A')

Executing the above code, you can see that the groupby method returns aDataFrameGroupByObject. We can’t view it directly. We need to apply the aggregate function. Refer to Section 4.1 of this document.

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000014C6742E248>

Let’s try the aggregate function sum.

df.groupby('A').sum()

The display table is as follows:

A C D
bar 0.559461 -0.724868
foo -1.02846 0.410533

6.2 multi column grouping

groupbyMethod supports passing multiple columns as parameters.

df.groupby(['A', 'B']).sum()

The results displayed after grouping are as follows:

A B C D
bar one -0.405463 -0.602236
one -0.405463 -0.602236
three 0.828572 1.28121
two 0.136352 -1.40384
foo one -1.37496 -0.612675
three -0.69841 1.48804
two 1.04491 -0.464828

6.3 application of multi polymerization method

Our applicationagg(), pass the aggregate method array as a parameter into the method. The code below is classified according to a and only countsCThe value of the column.

df.groupby('A')['C'].agg([np.sum, np.mean, np.std])

You can see the results of aggregation functions of bar group and foo group as follows:

A sum mean std
bar 0.559461 0.186487 0.618543
foo -1.02846 -0.205692 0.957242

6.4 aggregate statistics for different columns

The following code performs different aggregation statistics for columns C and D, sums column C, and performs standard deviation statistics for column D.

df.groupby('A').agg({'C': 'sum', 'D': lambda x: np.std(x, ddof=1)})

The output is as follows:

A C D
bar 0.559461 1.37837
foo -1.02846 0.907422

6.5 more

More about pandasgoupbyPlease refer to the official website:https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

3、 Pandas advanced usage

1. reshape

reshapeRepresents a reshaped table. For complex tables, we need to transform them into something suitable for our understanding, such as separate statistics after grouping according to some attributes.

1.1 stack() and unstack()

stackMethod divides the table into two parts: index and data. Index columns, data retentionStackplace.

Prepare data.

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

According to the code above, we create a composite index.

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

We create a dataframe with a composite index.

np.random.seed(20201212)
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df

The output is as follows:

A B C D
bar one 0.270961 -0.405463
two 0.348373 0.828572
baz one 0.696541 0.136352
two -1.64592 -0.69841
foo one 0.325415 -0.602236
two -0.134508 1.28121
qux one -0.33032 -1.40384
two -0.93809 1.48804

We executestackmethod.

stacked = df.stack()
stacked

The output stacked (compressed) table is as follows.Note: the output you use Jupiter notebook / lab may be different from the following results. The output below has been adjusted to facilitate the display in markdown.

first  second   
bar    one     A    0.942502
bar    one     B    0.060742
bar    two     A    1.340975
bar    two     B   -1.712152
baz    one     A    1.899275
baz    one     B    1.237799
baz    two     A   -1.589069
baz    two     B    1.288342
foo    one     A   -0.326792
foo    one     B    1.576351
foo    two     A    1.526528
foo    two     B    1.410695
qux    one     A    0.420718
qux    one     B   -0.288002
qux    two     A    0.361586
qux    two     B    0.177352
dtype: float64

We execute unstack to expand the data.

stacked.unstack()

Output the original table.

A B C D
bar one 0.270961 -0.405463
two 0.348373 0.828572
baz one 0.696541 0.136352
two -1.64592 -0.69841
foo one 0.325415 -0.602236
two -0.134508 1.28121
qux one -0.33032 -1.40384
two -0.93809 1.48804

We add parameterslevel

stacked.unstack(level=0)
#stacked.unstack(level=1)

Whenlevel=0You can try the following outputlevel=1What will be output when.

second first bar baz foo qux
one A 0.942502 1.89927 -0.326792 0.420718
one B 0.060742 1.2378 1.57635 -0.288002
two A 1.34097 -1.58907 1.52653 0.361586
two B -1.71215 1.28834 1.4107 0.177352

1.2 pivot_table()

pivot_ Table represents a pivot table, which is a table format for dynamic data layout and subtotal.

We generate dataframes without indexed columns.

np.random.seed(99)
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                    'B': ['A', 'B', 'C'] * 4,
                    'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                    'D': np.random.randn(12),
                    'E': np.random.randn(12)})
df

The display table is as follows:

A B C D E
0 one A foo -0.142359 0.0235001
1 one B foo 2.05722 0.456201
2 two C foo 0.283262 0.270493
3 three A bar 1.32981 -1.43501
4 one B bar -0.154622 0.882817
5 one C bar -0.0690309 -0.580082
6 two A foo 0.75518 -0.501565
7 three B foo 0.825647 0.590953
8 one C foo -0.113069 -0.731616
9 one A bar -2.36784 0.261755
10 two B bar -0.167049 -0.855796
11 three C bar 0.685398 -0.187526

By observing the data, we can clearly conclude that columns a, B and C have certain attribute meanings. We executepivot_tablemethod.

pd.pivot_table(df, values=['D','E'], index=['A', 'B'], columns=['C'])

The code above means that columns D and E are used as data columns, and a and B are used as composite row indexes,Data value of CAs a column index.

(‘D’, ‘bar’) (‘D’, ‘foo’) (‘E’, ‘bar’) (‘E’, ‘foo’)
(‘one’, ‘A’) -2.36784 -0.142359 0.261755 0.0235001
(‘one’, ‘B’) -0.154622 2.05722 0.882817 0.456201
(‘one’, ‘C’) -0.0690309 -0.113069 -0.580082 -0.731616
(‘three’, ‘A’) 1.32981 nan -1.43501 nan
(‘three’, ‘B’) nan 0.825647 nan 0.590953
(‘three’, ‘C’) 0.685398 nan -0.187526 nan
(‘two’, ‘A’) nan 0.75518 nan -0.501565
(‘two’, ‘B’) -0.167049 nan -0.855796 nan
(‘two’, ‘C’) nan 0.283262 nan 0.270493

2. Time series

date_rangeThis is the method of generating date interval provided by pandas. We execute the following code:

rng = pd.date_range('1/1/2021', periods=100, freq='S')
pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

date_ The range method starts from 0 seconds on January 1, 2021, and performs 100 time period division with 1 second as the time interval. The output results are as follows:

2021-01-01 00:00:00    475
2021-01-01 00:00:01    145
2021-01-01 00:00:02     13
2021-01-01 00:00:03    240
2021-01-01 00:00:04    183
                      ... 
2021-01-01 00:01:35    413
2021-01-01 00:01:36    330
2021-01-01 00:01:37    272
2021-01-01 00:01:38    304
2021-01-01 00:01:39    151
Freq: S, Length: 100, dtype: int32

We willfreqTry changing the parameter value from s (second) to m (month).

rng = pd.date_range('1/1/2021', periods=100, freq='M')
pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

Output:

2021-01-31    311
2021-02-28    256
2021-03-31    327
2021-04-30    151
2021-05-31    484
             ... 
2028-12-31    170
2029-01-31    492
2029-02-28    205
2029-03-31     90
2029-04-30    446
Freq: M, Length: 100, dtype: int32

We set that date generation can be performed with quarter as frequency.

prng = pd.period_range('2018Q1', '2020Q4', freq='Q-NOV')
pd.Series(np.random.randn(len(prng)), prng)

Output all quarters from the first quarter of 2018 to the fourth quarter of 2020.

2018Q1    0.833025
2018Q2   -0.509514
2018Q3   -0.735542
2018Q4   -0.224403
2019Q1   -0.119709
2019Q2   -1.379413
2019Q3    0.871741
2019Q4    0.877493
2020Q1    0.577611
2020Q2   -0.365737
2020Q3   -0.473404
2020Q4    0.529800
Freq: Q-NOV, dtype: float64

3. Classification

Pandas has a special data type called “directory”, that is, dtype = “category”. We classify according to setting some columns as directories.

Prepare data.

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
df
id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e

Let’s add a new columngradeAnd set its data type tocategory

df["grade"] = df["raw_grade"].astype("category")
df["grade"]

We can seegradeThe column has only three values a, B, e.

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

We replace a, B and e with very good, good and very bad in order.

df["grade"].cat.categories = ["very good", "good", "very bad"]

The table at this time is:

id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 e very bad

We sort the table:

df.sort_values(by="grade", ascending=False)
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good

To view the quantity of each category:

df.groupby("grade").size()

The above code output is:

grade
very good    3
good         2
very bad     1
dtype: int64

4. IO

Pandas supports reading and writing data directly from files, such as CSV, JSON, Excel and other file formats. Pandas supports the following file formats.

Format Type Data Description Reader Writer
text CSV read_csv to_csv
text Fixed-Width Text File read_fwf
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
MS Excel read_excel to_excel
binary OpenDocument read_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary ORC Format read_orc
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary SPSS read_spss
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google BigQuery read_gbq to_gbq

We will only take the CSV file as an example. Please refer to the table above for other formats.

We import data from a CSV file.You don’t have to pay special attention to the domain name address of the website below

df = pd.read_csv("http://blog.caiyongji.com/assets/housing.csv")

View the first 5 rows of data:

df.head(5)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41 880 129 322 126 8.3252 452600 NEAR BAY
1 -122.22 37.86 21 7099 1106 2401 1138 8.3014 358500 NEAR BAY
2 -122.24 37.85 52 1467 190 496 177 7.2574 352100 NEAR BAY
3 -122.25 37.85 52 1274 235 558 219 5.6431 341300 NEAR BAY
4 -122.25 37.85 52 1627 280 565 259 3.8462 342200 NEAR BAY

5. Drawing

Pandas supports Matplotlib, a powerful Python visualization tool. This section only briefly introduces the drawing methods supported by pandas. We will introduce Matplotlib in detail in the next article.In order not to miss the update, welcome to pay attention to me.

np.random.seed(999)
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])

We call directlyplotMethod.
Here are two things to note:

  1. The plot method is the plot method called through pandas, not Matplotlib.
  2. We know that Python language does not need semicolons to end statements. The semicolon here indicates that after drawing rendering is performedDirect displayImage.
df.plot();

Pre machine learning (4): master the usage of pandas in one article

df.plot.bar();

Pre machine learning (4): master the usage of pandas in one article

df.plot.bar(stacked=True);

Pre machine learning (4): master the usage of pandas in one article

4、 More

Our next article will explain the relevant knowledge points of Matplotlib. Welcome to pay attentionMachine learning pre tutorial series, or my personal bloghttp://blog.caiyongji.com/Synchronize updates.