Pandas from introduction to mastery (1) – Basics

Time:2021-12-5

We all know that Python can occupy a place in the field of data science, mainly due to the three swordsmen of data analysis: numpy, pandas and Matplotlib. Among the three libraries, I think pandas is the core and most used. Whether dealing with data or playing games, it is required to be able to skillfully apply pandas. Based on this, the author participated in the pandas study organized by datawhale open source community. The goal is to explode the liver for one month and be proficient in pandas!

From the beginning of this issue, we began to systematically learn and sort out the knowledge of pandas. We will proceed in 10 phases according to the following outline. I believe that through 10 periods of rich case study, mastering and skillfully using pandas will come naturally.

  • Pandas Foundation
  • Indexes
  • grouping
  • deformation
  • connect
  • missing data
  • Text data
  • Classified data
  • Time series data
  • Comprehensive exercise

As the first issue, this issue is mainly to be familiar with some basic knowledge and prepare for later learning. It mainly includes some common functions in Python and some operations of numpy library.

1.1 list derivation

List derivation is a major feature of Python language, which can create lists quickly and concisely.

1.1.1 basic format:

[* for I in k]: * can be a function, the variable is I (or independent of I), and K is an iterative object, such as a list.
Application: 1. A sentence of code outputs a cube of 1 to 5

  1. One sentence of code creates a list containing 10 random integers of 60-100
#A sentence of code outputs a cube of 1 to 5
[i**3 for i in range(1,6)]
>>>[1, 8, 27, 64, 125]
#One sentence of code creates a list containing 10 random integers of 60-100 (simulating student grades)
import random
[random.randint(60,100) for _ in range(10)]
>>> [76, 89, 62, 83, 61, 80, 89, 99, 76, 78]

1.1.2 for loop nesting

The for loop in the list derivation supports nesting.
For example, there are three lists that save the customer’s name, clothing color and size respectively, and output the combination of all customers and clothing color and size with one sentence code

names = ['zhangsan', 'lisi', 'wangba']
color = ['red', 'yellow']
size = ['S', 'M', 'L']
[name + '-' + c + '-' + s for name in names for c in color for s in size]
>>>
['zhangsan-red-S',
 'zhangsan-red-M',
 'zhangsan-red-L',
 'zhangsan-yellow-S',
 'zhangsan-yellow-M',
 'zhangsan-yellow-L',
 'lisi-red-S',
 'lisi-red-M',
 'lisi-red-L',
 'lisi-yellow-S',
 'lisi-yellow-M',
 'lisi-yellow-L',
 'wangba-red-S',
 'wangba-red-M',
 'wangba-red-L',
 'wangba-yellow-S',
 'wangba-yellow-M',
 'wangba-yellow-L']

The above code is equivalent to:

for name in names:
    for c in color:
        for s in size:
            print(name + '-' + c + '-' + 's')
>>>
zhangsan-red-s
zhangsan-red-s
zhangsan-red-s
zhangsan-yellow-s
zhangsan-yellow-s
zhangsan-yellow-s
lisi-red-s
lisi-red-s
lisi-red-s
lisi-yellow-s
lisi-yellow-s
lisi-yellow-s
wangba-red-s
wangba-red-s
wangba-red-s
wangba-yellow-s
wangba-yellow-s
wangba-yellow-s

1.1.3 filtering function

If (or if… Else…) can also be added after the for loop in the list derivation for filtering.
For example, a sentence of code outputs an integer that can be divided by 7 within 0-100

#Output the number that can be divided by 7 in 1-100:
[i for i in range(1,101) if i%7 == 0]
>>>
[7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84, 91, 98]

Based on the above cases, we can clearly see the simplicity and elegance of list derivation! It also reflects the power of Python.

1.2 lambda anonymous function

We all know that functions belong to first-class citizens in the python world and have high permissions. For code blocks that often need to be reused, it is generally preferred to implement them through functions. But when we want to use a function that is simply defined or only needs to be called once or twice, it is redundant to name and write a complete function block. At this time, lambda anonymous functions have a place to play.
Format: lambda [arg1 [, arg2,… Argn]]: expression
Lamdbda here is the system reserved keyword, [arg1 [, arg2,… Argn]] is the parameter list, and its structure is the same as that of the function parameter list in Python. Expression is an expression about parameters. The parameter appearing in the expression needs to be in argument_ List is defined, and the expression can only be single line.
Example: for example, we define a function to output all letters in the string in uppercase

def str_capital(s):
    return str.upper(s)

str_capital('datawhale')
>>>
'DATAWHALE'

If you use anonymous function instead:

upper = lambda x: str.upper(x)
upper('datawhale')
>>>
'DATAWHALE'

By comparison, we can see that anonymous functions have the following advantages:

  • It can be defined directly where it is used. If it needs to be modified, you can find the modification directly to facilitate future code maintenance
  • The syntax structure is simple. You don’t need to use def function name (parameter name): it can be defined in this way. You can directly use lambda parameter: return value definition
    However, it should be noted that the anonymous lambda function makes the program concise, but it does not make the program efficient. This is also the reason why many programmers oppose the use of lambda.

1.3 map () method

In Python, the anonymous function lambda is often used in conjunction with map (), reduce () and filter () three built-in functions applied to sequences to traverse, recursively calculate and filter sequences. Among them, the most commonly used is the map method. In Python, the essence of the map () function is a mapping, that is, a defined mapping is performed on each element in the iteratable object (list) input therein. For example, we write a function that outputs a given string in uppercase. When using this function, we output several strings in uppercase

def str_capital(s):
    return str.upper(s)
L1 = ['I', 'like', 'Datawhale']
L2 = []
for s in L1:
    L2.append(str_capital(s))
L2
>>> 
['I', 'LIKE', 'DATAWHALE']

If we replace the for loop with map ():

L3 = map(str_capital, L1)
list(L3)
>>>
['I', 'LIKE', 'DATAWHALE']

You can see more concise! It should be noted that the map () method returns a map () object, and the list () method needs to be used to output the elements in it. As mentioned above, map is often used in combination with lambda anonymous functions, as follows:

L4 = map(lambda x: str.upper(x), L1)
list(L4)
>>>
['I', 'LIKE', 'DATAWHALE']

Elegant!

1.4 zip method

We all know that zip is a file decompression program. Similarly, in Python, the zip () function is a bit similar to the feeling of decompressing a package: pass in a list or other iteratable objects, and then select one from them to form a new tuple output. The following examples:

a = [3,4,5,6]
b = ['a', 'b', 'c']
s1 = {'zhangsan': 20, 'lisi': 25}
print(zip(a))
print('*' * 10)
print(list(zip(a)))
print(list(zip(b)))
print(list(zip(s1)))
>>>
<zip object at 0x000001A7D4FF7940>
**********
[(3,), (4,), (5,), (6,)]
[('a',), ('b',), ('c',)]
[('zhangsan',), ('lisi',)]

You can see that the output of zip is also a zip object. You need to use list to view the elements in it.
When the zip () function has two parameters, such as zip (a, b), take one element from a and B respectively to form tuples, and then combine the tuples into a new iterator. For example:

print(list(zip(a,b)))
>>>
[(3, 'a'), (4, 'b'), (5, 'c')]

This design has a special purpose for addition, subtraction and point multiplication of matrices (two-dimensional arrays). Examples are as follows:

import numpy as np
m = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
n = [[2, 2, 2], [3, 3, 3], [4, 4, 4]]
#Matrix point multiplication
Print ('= *' * 10 + "matrix point multiplication" + '= *' * 10)
print(np.array([x*y for a, b in zip(m, n) for x, y in zip(a, b)]).reshape(3,3))
#Matrix addition and subtraction are the same
Print ('= *' * 10 + "matrix addition and subtraction" + '= *' * 10)
print(np.array([x+y for a, b in zip(m, n) for x, y in zip(a, b)]).reshape(3,3))
>>>
=*=*=*=*=*=*=*=*=*=*Matrix point multiplication=*=*=*=*=*=*=*=*=*=*
[[ 2  4  6]
 [12 15 18]
 [28 32 36]]
=*=*=*=*=*=*=*=*=*=*Matrix addition and subtraction=*=*=*=*=*=*=*=*=*=*
[[ 3  4  5]
 [ 7  8  9]
 [11 12 13]]

Knowledge link: matrix point multiplication
Matrix point multiplication: the corresponding elements are multiplied. It is required that the shapes of the two matrices must be the same. This should be distinguished from matrix cross multiplication.

2. Numpy review

Pandas is based on numpy to achieve efficient computing, so it is necessary to review the knowledge of numpy before learning pandas. Here are some common knowledge points of numpy

2.1 np.array

The most basic data structure in NP is array. The structure is also very simple. NP. Array can be used. Several special arrays are summarized below

  1. Arithmetic Sequence
  • NP. Linspace (start, end (including), number of samples): it is applicable to knowing how many samples need to be created in advance
  • NP. Range (start, end (not included), step size): applicable to the case where the adjacent interval is known in advance
    be carefulDon’t confuse the range in np.range and python arrays. Range can only generate integer series, while np.range can generate decimal series
import numpy as np
a = np.linspace(1,100,10)
b = np.arange(1,10,1.5)
print(a)
print(b)
>>>
[  1.  12.  23.  34.  45.  56.  67.  78.  89. 100.]
[1.  2.5 4.  5.5 7.  8.5]
    1. Special matrices, including zeros / ones / eye / full, etc
      Direct code reference:
Print ('All 0 matrix with 3 rows and 4 columns')
print(np.zeros((3,4)))
print('*' * 10)
Print ('full 1 matrix with 3 rows and 3 columns')
print(np.ones((3, 3)))
print('*' * 10)
Print ('identity matrix of 3 rows and 3 columns')
print(np.eye(3))
print('*' * 10)
Print ('numeric / fill matrix of specified dimension ')
print(np.full((2,3), 6))
>>>
All 0 matrix with 3 rows and 4 columns
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
**********
Full 1 matrix with 3 rows and 3 columns
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
**********
Identity matrix with 3 rows and 3 columns
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
**********
指定维度的/数值填充矩阵
[[6 6 6]
 [6 6 6]]
    1. 随机矩阵
  • np.random.rand() : 取值0-1之间的随机分布,这里不要传元组,直接指定不同维度的个数即可
  • np.random.randn(): 0~1标准正态分布
  • np.random.randint(low,high,size) :指定生成随机整数的最小值最大值和维度大小
  • np.random.choice(): 可以从给定的列表中,以一定概率和方式抽取结果,当不指定概率时为均匀采样,默认抽取方式为有放回抽样
  • np.random.seed(0) : 设置种子,就相当是设定了随机值,之后每次随机都一样

2. 练习题:

  1. 使用列表推导式完成矩阵乘法:
    矩阵乘法定义:

    Pandas from introduction to mastery (1) - Basics

    image.png

    一般的矩阵乘法根据公式,可以由三重循环写出:

    Pandas from introduction to mastery (1) - Basics

    image.png

使用列表推导式来替代for循环完成

# 先定义零个矩阵
M1 = np.random.randint(1,10,10).reshape(2,5)
M2 = np.random.randint(1,10,10).reshape(5,2)
print(M1)
print('-' * 5)
print(M2)
[email protected] # 矩阵乘法
>>>
[[6 1 2 8 5]
 [6 1 7 9 4]]
-----
[[6 2]
 [7 7]
 [1 4]
 [7 1]
 [8 3]]
array([[141,  50],
       [145,  68]])
# 使用列表推导式来完成
[[sum([M1[i][k] * M2[k][j] for k in range(M1.shape[1])]) for j in range(M2.shape[1])] for i in range(M1.shape[0])]
>>>
[[141, 50], [145, 68]]
  1. 更新矩阵
    设矩阵 Am×n ,现在对 A 中的每一个元素进行更新生成矩阵 B ,更新方法是

    Pandas from introduction to mastery (1) - Basics

    image.png

例如下面的矩阵为 A ,则 =5×(1/4+1/5+1/6)=37/12 ,请利用 Numpy 高效实现。

Pandas from introduction to mastery (1) - Basics

image.png

解答:

A = np.arange(1,10).reshape(3,3)
B = A*(1/A).sum(1).reshape(-1,1)
Pandas from introduction to mastery (1) - Basics

image.png

使用内置的函数

B = A.sum(0) * A.sum(1).reshape(-1,1) / A.sum()
print(B)
res = ((A-B) ** 2 / B).sum()
print(res)

参考:开源内容Joyful Pandas, 作者 DataWhale耿远昊
另外,更多精彩内容也可以微信搜索,并关注公众号:‘Python数据科学家之路“ ,期待您的到来和我交流