Pandas operation excel learning notes (2) — reading files

Time:2021-11-26

Through the pandas library, you can read data from various formats of data files, or write the processed data to these files. Pd.read in pandas Library_ Excel () is simpler to read tables than the previous xlrd library. Pandas library is a good choice for general batch data processing.

1、 Functions for reading Excel files

  • The read function is pd.read_ Some parameters of excel() function are:
pd.read_excel(io, sheet_name=0,header=0,skiprows=None,index_col=None,names=None,
                arse_cols=None,date_parser=None,na_values=None,thousands=None, 
                convert_float=True,has_index_names=None,converters=None,dtype=None,
                true_values=None,false_values=None,engine=None,squeeze=False,**kwds)

2、 Read the relevant parameters of the function

io

  • IO refers to the storage path of Excel file to be read
  • It can be an absolute path or a relative path.

sheet_name

  • sheet_ name:sheet_ The default value of name is 0, that is, the first worksheet is returned, and multiple sheets are returned_ Name = [0,1] or sheet_ Name = [“worksheet 1”, “worksheet 2”], if sheet_ Name = none returns all worksheets.
  • Note: int / String returns dataframe, while none and list return Dict Of dataframe.
  • Case:
import pandas as pd
pd.read_ Excel ("excel file to open. Xlsx",
                sheet_ Name = 1) # read the second worksheet in Excel Workbook
pd.read_ Excel ("excel file to open. Xlsx",
                sheet_ Name = "February") # read the worksheet named 'February' in Excel Workbook
pd.read_ Excel ("excel file to open. Xlsx",
                sheet_ Name = none) # reads all worksheets in Excel workbook, which is a dictionary of tables.
df_dict = pd.read_ Excel ("excel file to open. Xlsx",
                sheet_ Name = [1, "February") # read the second worksheet in Excel workbook and the worksheet named 'February'
print(df_dict[1])
Print (df_dict ["February"])

header

  • Header: Specifies the row as the column index (i.e. the header row in general). The default is 0, that is, the first row is taken, and the data is the data below the column name row; If the data does not contain a header row, set header = none;

index_col

  • index_ Col: Specifies the column that is the row index.

usecols

  • Usecols: specifies which columns are used only
    –None: all (default)
    –STR: for example: usecols = ‘a, C: F’ # here a, C, f refer to column a in Excel table.
    –Int list: for example: usecols = [0,2]
    –STR list: for example: usecols = [‘name’, ‘grade’]
    –Lambda function: for example: lambda x: x = = “name”
  • The str list method is recommended. Because when the excel table structure changes, it is easy to maintain the code.

skiprows

  • Skirows: omit data for the specified number of rows

    Pandas operation excel learning notes (2) -- reading files

    Sample picture
import pandas as pd
pd.read_ Excel ("excel file to open. Xlsx",
                Skirows = [0,2]) # skips the two rows with indexes 0 and 2, expressed as a list.

names

Names: Specifies the list of header row and column names to be used. If the table does not contain header rows, header = none should be explicitly passed

dtype

Dtype: is a dictionary representing the method {‘column name’: ‘type’}, which is used to set the data type of the column.

  • Int8 / int16 / int32 / Int64 (default): integer
  • Float16 / float32 / float64 (default): floating point
  • STR / String: String
  • Bool: Boolean
  • Category: Category
  • Datetime64 [ns]: timestamp (nanosecond)
  • Period [y / M / D]: time period (year / month / day)
  • Object: Python object mixed type
#Type view
import pandas as pd
df = pd.read_ Excel ("excel file to open. Xlsx")
Df.dtypes # view the data type of the read data

It will be found here that character data is recognized as object type. It should be noted that Int64 type cannot be directly converted to STR type. In this case, the following code can be used for conversion:

DF ['column name'] = DF ['column name']. Astype ('string ')

parse_dates

parse_ Dates: converts the date data in the table to date type, for example:

Pandas operation excel learning notes (2) -- reading files

Data type recognized before conversion

Pandas operation excel learning notes (2) -- reading files

Data type recognized after conversion

A conversion is used here. The code is as follows:

parse_ Dates = [0,1,2,3,4,5,6] # converts the data in columns with column indexes of 0, 1, 2, 3, 4, 5, 6 to date type

Splice multiple columns of data into one date format type, for example:

Pandas operation excel learning notes (2) -- reading files

Separate data sheet for mm / DD / yy

Conversion code:

import pandas as pd
df = pd.read_ Excel ("excel file to open. Xlsx", parse_dates = [[0, 1, 2]])

In the above code, you can use column names instead of indexes, that is:

import pandas as pd
df = pd.read_ Excel ("excel file to open. Xlsx", parse_dates = [['year', 'month', 'Day']])

You can also specify the merged column names in the form of dictionaries, that is:

import pandas as pd
df = pd.read_ Excel ("excel file to open. Xlsx", parse_dates = {'date': ['year', 'month', 'Day']})

date_parser

date_ Parse must be the same as parse_ Dates can be used together to convert data in the following format into date format:

Pandas operation excel learning notes (2) -- reading files

Date in Chinese format

The conversion code is as follows:

import pandas as pd
df = pd.read_excel(
                    "Excel file to open. Xlsx",
                    parse_dates = [0],
                    date_parser = lambda x: pd.to_datetime(
                    x. Format = '% y year% m month% d'
                    )
)# here's parse_ Dates = [0], used to specify the column to convert

na_values

na_ Values: other strings identified as Nan (missing value).

  • Nan means not a number < float >
  • na_ Values basic usage:
    — na_values = 0
    — na_ Values = ‘null value’
    — na_ Values = [‘null value’, 0]
    — na_ Values = {‘column name’: [‘null value’, 0]}
    Small pit: if there are spaces in the cell, Nan will not be displayed after reading.
    Replace 0 in the cell with Nan
import pandas as pd
df = pd.read_excel(
                    "Excel file to open. Xlsx",
                    na_values  = 0
)

Replace the string ‘a’ in the cell with Nan

import pandas as pd
df = pd.read_excel(
                    "Excel file to open. Xlsx",
                    na_values  = 'a'
)

Pass in the form of list, and replace the specified value in the cell with Nan, for example, replace the string ‘a’, 0 and space in the cell with Nan

import pandas as pd
df = pd.read_excel(
                    "Excel file to open. Xlsx",
                    na_values  = [ 'a',0,' ']
)

Pass in the form of dictionary, and replace the specified value in the specified column cell with Nan, such as:

import pandas as pd
df = pd.read_excel(
                    "Excel file to open. Xlsx",
                    na_ Values = {'column name', ['a ', 0,']}

converters

Converters: a value conversion function. The default value is none, and the incoming value is a dictionary. For example:

Pandas operation excel learning notes (2) -- reading files

Formatted cell value

To remove these space symbols, you can use Converters:

import pandas as pd
df = pd.read_excel(
                    "Excel file to open. Xlsx",
                    converters  ={
                    'Article No.': lambda x: x.strip ()
                    }
)
#You can also use the lambda function:
import pandas as pd
df = pd.read_excel(
                    "Excel file to open. Xlsx",
                    converters  ={
                    'Article No.': str.strip
                    }
)

It can also be used for some simple calculations, such as adding 5 to all sales values

import pandas as pd
df = pd.read_excel(
                    "Excel file to open. Xlsx",
                    converters  ={
                    'sales volume ': lambda x: x + 5
                    }
)

That’s all for today. Some parameters are relatively simple. No notes.