Implementation of data filtering in pandas

Time:2021-9-23

Compiled by Amanda Iglesias Moreno VK source towards Data Science

Filtering data from data frames is one of the most common operations when cleaning up data.PandasProvides a series of methods for selecting data based on row and column positions and labels. In addition, pandas allows you to get a subset of data according to the column type and filter rows using Boolean indexes.

In this article, we will introduce the most common operations for selecting data subsets from pandas data frames:

  • Select a single column by label
  • Select multiple columns by label
  • Select columns by data type
  • Select a row by label
  • Select multiple rows by label
  • Select a row by location
  • Select multiple rows by location
  • Select both rows and columns
  • Select scalar value
  • Select rows using Boolean

data set

In this paper, we use a small data set for learning. In the real world, the data set used is much larger; However, the process used to filter data remains unchanged.

The data frame contains information about the company’s 10 employees: (1) ID card, (2) name, (3) last name, (4) department, (5) telephone, (6) salary, and (7) contract type.

import pandas as pd

Implementation of data filtering in pandas

1. Select a single column by label

To select a column in pandas, we can use the. Operator and [] operator.

Select a single column by label

df[string]

The following code uses these two methods to access the salary column.

#Use. Symbol to select column (salary)

Implementation of data filtering in pandas

As shown above, when a single column is retrieved, the result is a series object. To get a dataframe object when only one column is selected, we need to pass in a list, not just a string.

#Get a series object by passing a string to the index operator

Implementation of data filtering in pandas

In addition, it is important to remember that when column names contain spaces, we cannot use the. Notation to access specific columns of a data frame. If we do, we will make a grammatical error.

2. Select multiple columns by label

We can select multiple columns of a data frame by passing in a column name as shown in the following list.

Select multiple columns by label

df[list_of_strings]
#Select multiple columns by passing a list containing column names to the index operator

Implementation of data filtering in pandas

As shown above, the result is a dataframe object that contains only the columns provided in the list.

3. Select columns by data type

We can use the pandas. Dataframe. Select type (include = none, exclude = none) to select columns according to their data types. This method accepts a list or a single data type in the parameters include and exclude.

Remember that at least one of the parameters (include or exclude) must be provided, and they cannot contain overlapping elements.

Select columns by data type

df.select_dtypes(include=None, exclude=None)

In the following example, we pass innp.numberObject to the include parameter. Alternatively, we can provide a string ‘bynumber‘as input to get the same result.

As you can see, select_ The dtypes () method returns a dataframe object that includes the data types in the include parameter and excludes the data types in the exclude parameter.

import numpy as np

Implementation of data filtering in pandas

As mentioned earlier, select_ The dtypes () method can accept both a string and a numpy object as input. The following table shows the most common methods of referencing data types in pandas.

Implementation of data filtering in pandas

As a reminder, we can use the pandas.dataframe.info method or the pandas.dataframe.dttypes property. The former prints a concise summary of data frames, including column names and their data types, while the latter returns a sequence containing the data types of each column.

#A brief summary of the data framework, including column names and their data types

Implementation of data filtering in pandas

#Check the data type of the column

Implementation of data filtering in pandas

4. Select a single line by label

Data frames and sequences do not necessarily have a numeric index. By default, the index is an integer representing the row position; However, it can also be an alphanumeric string. In our current example, the index is the employee’s ID number.

#We can use the. Index method to check the index of the data frame

To select a row by ID number, we can use the. LOC [] indexer to provide a string (index name) as input.

Select a single line by label

df.loc[string]

The following code shows how to select an employee with ID number 478.

#Use the. LOC [] indexer to select the employee with ID number 478

Implementation of data filtering in pandas

As shown above, when a row is selected, the. LOC [] indexer returns a series object. However, we can also get a single row data frame by passing a single element list to the. LOC [] method, as shown below.

#Use the. LOC [] indexer to select the employee with ID number 478 and provide a single element list

Implementation of data filtering in pandas

5. Select multiple lines by label

We can use the. LOC [] indexer to select multiple rows. In addition to a single label, the indexer accepts a list or label fragment as input.

Select multiple rows by label

df.loc[list_of_strings]
df.loc[slice_of_strings]

Next, we obtain a subset of data frames containing employees with ID numbers 478 and 222, as shown below.

#Use the. LOC [] indexer to select employees with ID numbers 478 and 222

Implementation of data filtering in pandas

Note that the end index of the. LOC [] method is always included, which means that the selection includes the last label.

6. Select a single line by location

The iloc [] indexer is used to index data frames by location. To select a single row using the. Iloc [] attribute, we pass the row position (a single integer) to the indexer.

Select single line by location

df.iloc[integer]

In the following code block, we select the row with index 0. In this case, the first row of the data frame is returned because the index starts at 0 in pandas.

#Select the first row of the data frame

Implementation of data filtering in pandas

In addition, the. Iloc [] indexer also supports negative integers (starting from – 1) as relative positions relative to the end of the data frame.

#Select the last row of the data frame

Implementation of data filtering in pandas

As shown above, when a row is selected, the. Iloc [] indexer returns a series object indexed by the column name. However, as we did with the. LOC [] indexer, we can also get data frames by passing a single integer list to the indexer in the following ways.

#Select the last row of the data frame

Implementation of data filtering in pandas

Finally, remember that an indexer error is raised when trying to access an index that is out of bounds.

#Shape of data frame - 10 rows and 6 columns

7. Select through multiple locations

To extract multiple rows by location, we pass a list or slice object to the. Iloc [] indexer.

Select multiple rows by location

df.iloc[list_of_integers]
df.iloc[slice_of_integers]

The following code block demonstrates how to select the first five lines of a data frame using an integer list.

#Use the list to select the first 5 rows of the dataframe

Implementation of data filtering in pandas

Alternatively, we can use slice representation to get the same result.

#Use the slice to select the first 5 rows of the dataframe

Implementation of data filtering in pandas

As shown above, the python slicing rule (half open interval) applies to the. Iloc [] attribute, which means that the first index is included, but not the end index.

8. Select both rows and columns

So far, we have learned how to use the. LOC [] and. Iloc [] indexers to select rows in a data frame by label or position. However, the two indexers can select not only rows, but also rows and columns at the same time.

To do this, we must provide row and column labels / positions separated by commas, as follows:

Select both rows and columns

df.loc[row_labels, column_labels]
df.iloc[row_positions, column_positions]

Where row and column labels can be a single string, a string list, or a string fragment. Similarly, row and column positions can be single integers, integer lists, or integer slices.

The following example demonstrates how to extract rows and columns simultaneously using the. LOC [] and. Iloc [] indexers.

Select scalar value

We select the salary of the employee with ID 478 as follows.

#Select the salary of the employee with ID No. 478 by location

In this case, the output of both indexers is an integer.

Select single row and multiple columns

We select the name, last name and salary of the employee with ID No. 478 by taking a value as the first parameter and a value list as the second parameter to obtain a series object.

#Select the name, last name and salary of the employee with ID card No. 478 by position

Implementation of data filtering in pandas

Select disjoint rows and columns

To select multiple rows and columns, we need to pass two value lists to both indexers. The following code shows how to extract the name, last name and salary of employees with ID numbers 478 and 222.

#Select the name, last name and salary of employees with ID card numbers 478 and 222 by position

Implementation of data filtering in pandas

Unlike before, the output of both indexers is a dataframe object.

Select consecutive rows and columns

We can use slice representation to extract continuous rows and columns of data frames. The following code snippet shows how to select the name, last name, and salary of employees with ID numbers 128, 478, 257, and 299.

#Select the name, last name and salary of employees with ID numbers 128, 478, 257 and 299 by position

Implementation of data filtering in pandas

As shown above, we only use the slice representation to extract the rows of the data frame, because the ID number we want to select is continuous (the index is from 0 to 3).

It is important to remember that the. LOC [] indexer uses a closed interval to extract both start and stop tags. In contrast, the. Iloc [] indexer uses half open intervals and therefore does not include the value at the stop index.

9. Use the. At [] and. IAT [] indexers to select scalar values

As mentioned above, we can select scalar values by passing two comma separated Strings / integers to the. LOC [] and. Iloc [] indexers. In addition, pandas provides two optimization functions to extract scalar values from data frame objects:. At [] and. IAT [] operators. The former extracts a single value through a tag, while the latter accesses a single value through a location.

Select scalar values by label and location

df.at[string, string]
df.iat[integer, integer]

The following code shows how to use the. At [] and. IAT [] indexers to select the salary of an employee with ID number 478 by label and location.

#Select the salary of the employee with ID No. 478 by location

We can use%timeitMagic function to calculate the execution time of the two Python statements. As shown below, the. At [] and. IAT [] operators are much faster than the. LOC [] and. Iloc [] indexers.

#Execution time of LOC indexer

Implementation of data filtering in pandas

#Execution time of iloc indexer

Implementation of data filtering in pandas

Finally, it must be remembered that the. At [] and. IAT [] indexers can only be used to access a single value and cause type errors when trying to select multiple elements of a data frame.

#An exception is thrown when trying to select multiple elements

10. Use Boolean to select rows

So far, we have filtered the rows and columns in the data frame according to the label and position. Alternatively, we can use Boolean indexes to select a subset in pandas. Boolean selection includes selecting rows of data frames by providing Boolean values (true or false) for each row.

In most cases, this Boolean array is calculated by applying a condition to the values of one or more columns. The condition evaluates to true or false, depending on whether these values meet the condition. However, you can also manually create Boolean arrays using other sequences, numpy arrays, lists, or pandas series.

The Boolean sequence is then placed in square brackets [] to return the row associated with the true value.

Select rows using Boolean

df[sequence_of_booleans]

Boolean selection based on single column values

The most common way to filter data frames based on single column values is to use comparison operators.

The comparison operator evaluates the relationship between two operands (A and b) and returns true or false according to whether the condition is met. The following table contains the comparison operators available in Python.

Implementation of data filtering in pandas

These comparison operators can be used for a single column of data frames to obtain a sequence of Boolean values. For example, we use the greater than operator to determine whether an employee’s salary is greater than 45000, as shown below.

#Employees earning more than 45000

Implementation of data filtering in pandas

The output is a series of Boolean functions, where salary above 45000 is true and salary below or equal to 45000 is false. As you may notice, the Boolean series has the same index (ID number) as the original data frame.

You can pass this sequence to the index operator [] to return only rows that result in true.

#Select employees with salary higher than 45000

Implementation of data filtering in pandas

As shown above, we obtained a data frame object, which only contains employees with wages higher than 45000.

Boolean selection based on multi column values

Previously, we have filtered a data frame according to a condition. However, we can also use logical operators to combine multiple Boolean expressions.

In Python, there are three logical operators: and, or, and not. However, these keywords cannot be used to combine multiple Boolean conditions in pandas. Instead, use the following operators.

Implementation of data filtering in pandas

The following code shows how to select employees whose salary is higher than 45000 and have a permanent contract, which contains two Boolean expressions and the logical operator &.

#Select employees with salary higher than 45000 and long-term contract

Implementation of data filtering in pandas

As you know, in Python, comparison operators take precedence over logical operators. However, it does not apply to pandas where logical operators take precedence over comparison operators. Therefore, we need to wrap each Boolean expression in parentheses to avoid errors.

Boolean selection using pandas method

Pandas provides a series of built-in functions that return sequences of Boolean values, which is an attractive alternative to more complex Boolean expressions that combine comparison operators and logical operators.

Isin method

This pandas. Series. Isin method accepts a series of values and returns true where the values in the list match in the sequence.

This method allows us to check whether one or more elements exist in the column without using logical operators or. The following code shows how to use the logical operators or and isin methods to select employees with permanent or temporary contracts.

#Use logical operators or select employees with permanent or temporary contracts

Implementation of data filtering in pandas

As you can see, the isin method is very convenient when checking multiple or conditions in the same column. Plus, it’s faster!

#Use logical operator | execution time

Implementation of data filtering in pandas

Between method

This panda series method accepts two scalars separated by commas, which represent the upper and lower boundaries of a value range and return true at positions within the range.

The following code selects employees whose salary is higher than or equal to 30000 and less than or equal to 80000.

#Employees with salary higher than or equal to 30000 and lower than or equal to 80000

Implementation of data filtering in pandas

As you can see, both boundaries (30000 and 80000) are included. To exclude them, we must pass the include = false parameter as follows.

#Employees with salary above 30000 and below 80000

Implementation of data filtering in pandas

As you may notice, the above code is equivalent to writing two Boolean expressions and evaluating them using the logical operator and.

#Employees with salary higher than or equal to 30000 and lower than or equal to 80000

String method

In addition, we can use Boolean indexes with string methods as long as they return a sequence of Boolean values.

For example, the pandas.series.str.contains method checks whether there are substrings in all elements of the column and returns a series of Boolean values, which we can pass to the index operator to filter the data frame.

The following code shows how to select all phone numbers that contain 57.

#Select all phone numbers that contain 57

Implementation of data filtering in pandas

When the contains method calculates whether the substring is included in each element of the sequence. The pandas.series.str.startswitch function checks whether a substring exists at the beginning of a string. Similarly, pandas. Series. Str. endswitch tests for substrings at the end of a string.

The following code shows how to select an employee whose name begins with “a”.

#Select an employee whose name begins with "a"

Implementation of data filtering in pandas

abstract

In this article, we learn to select subsets from a dataframe. In addition, we provide several usage examples. Now? It’s time to apply these techniques when cleaning up your own data!

This is about pandasData filteringThat’s all for the article on implementation. For more relevant pandas data filtering content, please search the previous articles of developeppaper or continue to browse the relevant articles below. I hope you can support developeppaper in the future!

Recommended Today

Seven Python code review tools recommended

althoughPythonLanguage is one of the most flexible development languages at present, but developers often abuse its flexibility and even violate relevant standards. So PythoncodeThe following common quality problems often occur: Some unused modules have been imported Function is missing arguments in various calls The appropriate format indentation is missing Missing appropriate spaces before and after […]