Detailed Explanation of Pandas.read_csv Parameters (Summary)


Arrangement of Pandas.read_csv parameters

Read the CSV (comma splitting) file to the DataFrame

Partial import and selection iteration of files are also supported

For more help, see


filepath_or_buffer : str,pathlib。str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)

Can be a URL, available URL types include: http, ftp, S3 and files. For multiple files in preparation

Local file reading instance: //localhost/path/to/table.csv

sep : str, default ‘,’

Specifies a delimiter. If no parameters are specified, comma delimitation is attempted. The separator is longer than a character and is not’s+’, using python’s parser. And ignore commas in the data. Examples of regular expressions:’rt’

delimiter : str, default None

Delimiter, alternate delimiter (if specified, the SEP parameter is invalid)

delim_whitespace : boolean, default False.

Specifying whether spaces (such as”or”) are used as delimiters is equivalent to setting sep=’ s+’. If this parameter is set to Ture, the delimiter parameter fails.

Support in new version 0.18.1

header : int or list of ints, default ‘infer’

Specifies the number of rows to be used as column names and the number of rows to start the data. If there is no column name in the file, it defaults to 0, otherwise it is set to None. If header = 0 is explicitly set, the existing column name will be replaced. The header parameter can be a list such as: [0, 1, 3], which means that the rows in the file are treated as column headings (meaning that each column has multiple headings), and the rows in the middle are ignored (for example, 2 in this case; rows 1, 2, 4 in this case will appear as multi-level headings, and row 3 will be discarded, datafra). The data for me starts at line 5.

Note: If skip_blank_lines = True, the header parameter ignores comment lines and blank lines, so header = 0 represents the first line of data rather than the first line of a file.

names : array-like, default None

The list of column names used for results requires header = None to be executed if there are no column header rows in the data file. No duplication can occur in the default list unless the parameter mangle_dupe_cols = True is set.

index_col : int or sequence or False, default None

Column number or column name used as a row index. If a sequence is given, there are multiple row indexes.

If the file is irregular and there is a separator at the end of the line, you can set index_col=False to be the pandas that does not apply the first column as the row index.

usecols : array-like, default None

Returns a subset of data whose values must correspond to the location of the file (the number can correspond to the specified column) or the character passed as the column name in the file. For example, the usecols valid parameter may be [0, 1, 2] or [‘foo’,’bar’,’baz’]. Using this parameter can speed up loading and reduce memory consumption.

as_recarray : boolean, default False

Disapproval: This parameter will be removed in future versions. Please use pd. read_csv(…). to_records() instead.

Return a Numpy recarray to replace the DataFrame. If this parameter is set to True. The squeeze parameter will be preferred. And row indexes will no longer be available and index columns will be ignored.

squeeze : boolean, default False

If the file value contains a column, a Series is returned

prefix : str, default None

When there is no column heading, prefix the column. For example, add’X’to X0, X1,…

mangle_dupe_cols : boolean, default True

Repeated columns, express’X’…’X’as’X.0’…’X.N’. If set to false, all the re-listed columns will be overwritten.

dtype : Type name or dict of column -> type, default None

Data types for each column of data. For example {a’: np. float64, `b’: np. int32}

engine : {‘c’, ‘python’}, optional

Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.

The analysis engine used. You can choose C or python. C engine is fast, but Python engine is more complete.

converters : dict, default None

A Dictionary of column conversion functions. The key can be a column name or the serial number of the column.

true_values : list, default None

Values to consider as True

false_values : list, default None

Values to consider as False

skipinitialspace : boolean, default False

Ignore the blank after the separator (default is False, i.e. not ignored).

skiprows : list-like or integer, default None

Number of lines to be ignored (counted from the beginning of the file) or a list of line numbers to be skipped (starting from 0).

skipfooter : int, default 0

Ignore from the end of the file. (c engine does not support)

skip_footer : int, default 0

Not recommended: Skpfooter is recommended for the same function.

nrows : int, default None

Number of rows to read (starting from the file header).

na_values : scalar, str, list-like, or dict, default None

A set of values used to replace NA/NaN. If you pass parameters, you need to specify null values for specific columns. By default,’1. IND’,’1. QNAN’,’N/A’,’NA’,’NULL’,’NaN’,’nan’.

keep_default_na : bool, default True

If the na_values parameter is specified and keep_default_na = False, the default NaN will be overwritten, otherwise added.

na_filter : boolean, default True

Check for missing values (null strings or null values). For large files, there is no empty value in the data set. Setting na_filter=False can improve the reading speed.

verbose : boolean, default False

Whether to print the output information of various parsers, such as “Number of missing values in non-numeric columns”.

skip_blank_lines : boolean, default True

If True, skip empty lines; otherwise, NaN.

parse_dates : boolean or list of ints or names or list of lists or dict, default False

  • Boolean. True – > parsing index
  • List of ints or names. e.g. If [1, 2, 3] – > Resolves the values of columns 1, 2, 3 as independent date columns;
  • List of lists. e.g. If [[1, 3] – > Merge columns 1, 3 as a date column
  • Dict, e.g. {foo’: [1,3]} – > merges 1,3 columns and names the merged columns “foo”

infer_datetime_format : boolean, default False

If set to True and parse_dates are available, pandas will try to convert to a date type, and if so, transform the method and parse it. In some cases it will be 5 to 10 times faster.

keep_date_col : boolean, default False

If the connection has multiple column resolution dates, the columns participating in the connection are maintained. The default is False.

date_parser : function, default None

Dateutil. parser. parser is used by default for the function that parses the date. Pandas tries to parse in three different ways, and if you encounter problems, use the next one.

1. Use one or more arrays (specified by parse_dates) as parameters;

2. Connections specify multi-column strings as a column as parameters;

3. The date_parser function is called once per line to parse one or more strings (specified by parse_dates) as parameters.

dayfirst : boolean, default False

Date types in DD/MM format

iterator : boolean, default False

Returns a TextFileReader object to process files block by block.

chunksize : int, default None

File block size, See IO Tools docs for more information on iterator and chunksize.

compression : {‘infer’, ‘gzip’, ‘bz2′, ‘zip’, ‘xz’, None}, default ‘infer’

Direct use of compressed files on disk. If infer parameters are used, then gzip, bz2, Zip or files with suffixes of’. gz’,’. bz2′,’. zip’, or’xz’in the name of the decompressed file are used, otherwise they are not decompressed. If zip is used, then ZIP package China must contain only one file. Setting to None does not decompress.

New version 0.18.1 supports zip and XZ decompression

thousands : str, default None

Thousand-bit splitters, such as “,” or “.“

decimal : str, default ‘.’

Decimal points in characters (e.g. European data use’,’).

float_precision : string, default None

Specifies which converter the C engine should use for floating-point values. The options are None for the ordinary converter, high for the high-precision converter, and round_trip for the round-trip converter.


lineterminator : str (length 1), default None

Line splitter, used only under the C parser.

quotechar : str (length 1), optional

Quotation marks, which are used to mark the beginning and interpretation of characters, will be ignored for the partitioners within the quotation marks.

quoting : int or csv.QUOTE_* instance, default 0

Control quotation mark constants in csv. Optional QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONUMERIC (2) or QUOTE_NONE (3)

doublequote : boolean, default True

Double quotation marks, when a single quotation mark has been defined and the quoting parameter is not QUOTE_NONE, use double quotation marks to indicate that the element in the quotation mark is used as an element.

escapechar : str (length 1), default None

When quoting is QUOTE_NONE, specify a character so that it is not delimited.

comment : str, default None

Indicates that redundant rows are not parsed. If the character appears at the beginning of the line, the line will be ignored altogether. This parameter can only be a character, and blank lines (like skip_blank_lines = True) comment lines are ignored by headers and skiprows. For example, if you specify comment=’ to parse’ empty na, b, c n1, 2, 3’with header = 0, the result will be returned as’ a, b, c’.

encoding : str, default None

Specify the character set type, usually’utf-8′. List of Python standard encodings

dialect : str or csv.Dialect instance, default None

If no specific language is specified, it is ignored if the SEP is larger than one character. View the csv. Dialect document specifically

tupleize_cols : boolean, default False

Leave a list of tuples on columns as is (default is to convert to a Multi Index on the columns)

error_bad_lines : boolean, default True

If a row contains too many columns, the DataFrame will not be returned by default, and if set to false, the line changes will be eliminated (only used under the C parser).

warn_bad_lines : boolean, default True

If error_bad_lines = False and warn_bad_lines = True, all “bad lines” will be output (only for use under the C parser).

low_memory : boolean, default True

Blocks are loaded into memory and parsed in low memory consumption. But there may be type confusion. To ensure that the type is not confused, you need to set it to False. Or use the dtype parameter to specify the type. Note that chunksize or iterator parameter block reading will read the entire file into a Dataframe, ignoring the type (valid only in the C parser)

buffer_lines : int, default None

Not recommended, this parameter will be removed in future versions because its value is not recommended in parsers.

compact_ints : boolean, default False

Not recommended. This parameter will be removed in future versions.

If compact_ints=True is set, then any column with integer type will be stored according to the smallest integer type. Whether there are symbols or not will depend on the use_unsigned parameter.

use_unsigned : boolean, default False

Not recommended: This parameter will be removed in future versions

If the integer column is compressed (i.e. compact_ints = True), specify whether the compressed column is signed or unsigned.

memory_map : boolean, default False

If the file used is in memory, then the direct map file is used. Using this method can avoid the file to do IO operation again.

The above is the whole content of this article. I hope it will be helpful to everyone’s study, and I hope you will support developpaer more.

Recommended Today

Server Core Knowledge

Catalog 1. Computer: a tool to assist human brain 2. Composition of computer hardware 3. power supply 4.CPU 5. memory 6. Classification of computer use 7. Introduction to common Internet servers 8. Server classification 1. Computer: a tool to assist human brain Nowadays people touch computers almost all the time! Whether it’s a desktop computer, […]