[pandas learning notes 01] powerful tool set for analyzing structured data

Time:2021-12-2

Author: Huan Hao

Source:Hang Seng light cloud community

Background introduction

In the process of quantitative analysis, we always need to use a large number of data bases to mine the association between data, and finally find the data we need. Data analysis only through Python is very complex. Is there a simpler tool to help us analyze data efficiently and quickly?

Today we will introduce pandas, a powerful tool set for analyzing structured data.

This article is mainly for students who have a certain Python syntax foundation. Students who need to learn Python can find tutorials in the community to recharge(https://developer.hs.net/cour…)。

[pandas learning notes 01] powerful tool set for analyzing structured data

Basic concepts

PandasLibrary is a free and open-source third-party Python library. It is one of the indispensable tools for Python data analysis. It provides Python data analysis with high-performance and easy-to-use data structures, namely series and dataframe.

PandasThe use basis is numpy (providing high-performance matrix operation); It is used for data mining and data analysis, and also provides data cleaning function.

PandasThe library is based on the python numpy library, so it can be used with Python’s scientific computing library.

PandasSince its birth, it has been applied in many fields, such as finance, statistics, social science, construction engineering and so on.

Through the above introduction, we must have a basic understanding of what pandas does. Pandas is equivalent to excel in Python: it uses tables (that is, dataframe) and can make various transformations on data, but it also has many other functions.

data structure

DataFrame

Dataframe is a tabular data structure. It contains a set of ordered columns. Each column can be of different value types (numeric value, string, Boolean value). Dataframe has both row indexes and column indexes. It can be regarded as a dictionary composed of series (using a common index).

[pandas learning notes 01] powerful tool set for analyzing structured data

The construction method of dataframe is as follows:

pandas.DataFrame( data, index, columns, dtype, copy)

Parameter Description:

  • data: a set of data (darray, series, map, lists, dict, etc.).
  • index: index value, or can be called row label.
  • columns: column label. The default is rangeindex (0, 1, 2,…, n).
  • dtype: data type.
  • copy: copy data. The default value is false.

Series

Series is similar to a column in a table, similar to a one-dimensional array, and can hold any data type.

[pandas learning notes 01] powerful tool set for analyzing structured data

Series consists of an index and a column. The functions are as follows:

pandas.Series( data, index, dtype, name, copy)

Parameter Description:

  • data: a set of data (type ndarray).
  • index: data index label. If it is not specified, it starts from 0 by default.
  • dtype: data type. You can judge by yourself by default.
  • name: set the name.
  • copy: copy data. The default value is false.

Get started quickly

Introduction component

Introduce pandas components into the code:

import pandas as pd

If it cannot be imported, there is a problem with the environment configuration or you haven’t downloaded it at all. Download the components in the following ways:

pip install Pandas

Series object operations

Create a series object through the series() function, which can call corresponding methods and properties:

import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print (s)

Dataframe object operation

adoptDataFrame()The syntax format for creating objects is as follows:

import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)

Read file data

Can passread_csv()Function on local.csvFormat file to read:

data = pd.read_csv('file.csv')
data = pd.read_csv('file.csv', nrows=1000, skiprows=[1,5], encoding= gbk)

Parameter meaning:

  • 'file.csv': indicates the read file name, which can be added to the system location for reading
  • nrows: indicates the number of rows of data before reading
  • skiprows: indicates that the number of unread lines will be automatically skipped when reading the file.
  • encoding: indicates the encoding format of the read file

Andread_csv, there are similar methodsread_excelRead excel file data.

Write file data

PandasProvidedto_csv()The function is used toDataFrameConvert toCSVdata If you want toCSVTo write data to a file, just pass a file object to the function. Otherwise,CSVThe data will be returned in string format.

data.to_csv(‘my_new_file.csv’, index=None)

Parameter meaning:

  • index: indicates whether an index needs to be added. The index will be added automatically by default

Andto_csv, there are similar methodsto_excelWrite excel file data.

summary

This article mainly introduces the basic knowledge of pandas toolset. Learning pandas can help us quickly process and analyze data. Practical operations will continue to be updated in the future. Please look forward to it.