Efficient implementation of conditional logic by Panda

Time:2021-4-14

By Louis Chan
Compile VK
Source: towards Data Science

Python is arguably the coolest programming language today (thanks to machine learning and Data Science), but it’s not very efficient compared to C, one of the best programming languages.

When developing machine learning models, it is very common that we need to update them programmatically based on hard coded rules derived from statistical analysis or the results of the last iteration. It’s no shame to admit that: I’ve been writing code with pandas apply until one day, when I got bored with nesting, I decided to explore (also known as Google) other, more maintainable and efficient methods

Demo data set

The dataset we are going to use is iris dataset, which you can get for free through panda or Seaborn.

import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

# import seaborn as sns
# iris = sns.load_dataset("iris")

The first five rows of iris dataset

Data statistics

Suppose that after the initial analysis, we want to label the dataset with the following logic:

  • If sepal length < 5.1, the label is 0;

  • Otherwise, if sepal width > 3.3 and sepal length < 5.8, the label is 1;

  • Otherwise, if sepal width > 3.3 and petal length > 5.1, the label is 2;

  • Otherwise, label 3 if sepal width > 3.3, petal length < 1.6 and sepal length < 6.4 or petal width < 1.3;

  • Otherwise, if sepal width > 3.3 and sepal length < 6.4 or petal width < 1.3, the label is 4;

  • Otherwise, if sepal width > 3.3, the label is 5;

  • Otherwise label 6

Before delving into the code, let’s quickly set a new label column to none:

iris['label'] = None

Pandas.iterrows +Nested if else blocks

If you are still using this, this blog post is definitely suitable for you!

%%timeit
for idx, row in iris.iterrows():
  if row['sepal_length'] < 5.1:
    iris.loc[idx, 'label'] = 0
  elif row['sepal_width'] > 3.3:
    if row['sepal_length'] < 5.8:
      iris.loc[idx, 'label'] = 1
    elif row['petal_length'] > 5.1:
      iris.loc[idx, 'label'] = 2
    elif (row['sepal_length'] < 6.4) or (row['petal_width'] < 1.3):
      if row['petal_length'] < 1.6:
        iris.loc[idx, 'label'] = 3
      else:
        iris.loc[idx, 'label'] = 4
    else:
      iris.loc[idx, 'label'] = 5
  else:
    iris.loc[idx, 'label'] = 6
1min 29s ± 8.91 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

It’s been a long time Well, let’s go on

Pandas .apply

Pandas.apply Used directly along the axis of the data frame orSeriesTo apply the function. For example, if we have a function f, it can be the sum of a sequence (for example, it can be a function f)list, np.array, tupleAnd pass it to the following data frames. We will sum them across lines

def f(numbers):
    return sum(numbers)
    
df['Row Subtotal'] = df.apply(f, axis=1)

Apply the function on axis = 1. By default, the apply parameter axis = 0 applies the function line by line, while axis = 1 applies the function column by column.

Now we’ve got to pandas.apply Now that we have a basic understanding, let’s write the logic code for assigning tags and see how long it runs

%%timeit
def rules(row):
  if row['sepal_length'] < 5.1:
    return 0
  elif row['sepal_width'] > 3.3:
    if row['sepal_length'] < 5.8:
      return 1
    elif row['petal_length'] > 5.1:
      return 2
    elif (row['sepal_length'] < 6.4) or (row['petal_width'] < 1.3):
      if row['petal_length'] < 1.6:
        return 3
      return 4
    return 5
  return 6

iris['label'] = iris.apply(rules, 1)
1.43 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

150000 rows only need 1.43s, which is a great improvement over the previous level, but it is still very slow.

Imagine that if you need to process a data set of millions of transaction data or credit approvals, it will take more than 14 seconds each time we apply a set of rules and a function to a column. Run enough columns and you may not have them in an afternoon.

Pandas.loc[]

If you are familiar with SQL, using. LOC [] to assign a value to a new column is actually an update statement with a where condition. Therefore, this should be much better than applying the function to each row or column.

%%timeit
iris['label'] = 6
iris.loc[iris['sepal_width'] > 3.3, 'label'] = 5
iris.loc[
  (iris['sepal_width'] > 3.3) & 
  ((iris['sepal_length'] < 6.4) | (iris['petal_width'] < 1.3)), 
  'label'] = 4
iris.loc[
  (iris['sepal_width'] > 3.3) & 
  ((iris['sepal_length'] < 6.4) | (iris['petal_width'] < 1.3)) & 
  (iris['petal_length'] < 1.6), 
  'label'] = 3
iris.loc[
  (iris['sepal_width'] > 3.3) & 
  (iris['petal_length'] > 5.1), 
  'label'] = 2
iris.loc[
  (iris['sepal_width'] > 3.3) & 
  (iris['sepal_length'] < 5.8), 
  'label'] = 1
iris.loc[
  (iris['sepal_length'] < 5.1), 
  'label'] = 0
13.3 ms ± 837 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now we’re only spending a tenth of the time we did last time, which means you don’t have more excuses to leave your desk when you’re working from home. At present, however, we only use the built-in functions of panda. Although pandas provides us with a very convenient high-level interface to interact with data tables, the efficiency may be reduced through layer by layer abstraction.

Numpy.where

Numpy has a lower level interface that allows more efficient interaction with n-dimensional iterables (i.e., vectors, matrices, tensors, etc.). Its method is usually based on C language. When it comes to more complex calculation, it uses optimized algorithm, which makes it faster than our re invented wheel.

According to numpy’s official documents,np.where()Accept the following syntax:

np.where(condition, return value if True, return value if False)

Essentially, this is a dichotomy in which conditions are evaluated as Boolean values and return values accordingly. The trick here is that the condition can actually be Iterable (Boolean ndarray type). This means that we can take DF [‘feature ‘] = = 1 as a condition and code where logic as:

np.where(
    df['feature'] == 1, 
    'It is one', 
    'It is not one'
)

So you might ask, how do we use an image np.where The answer is simple, but disturbing. nesting np.where ()

%%timeit
iris['label'] = np.where(
  iris['sepal_length'] < 5.1,
  0,
  np.where(
    iris['sepal_width'] > 3.3,
    np.where(
      iris['sepal_length'] < 5.8,
      1,
      np.where(
        iris['petal_length'] > 5.1,
        2,
        np.where(
          (iris['sepal_length'] < 6.4) | (iris['petal_width'] < 1.3),
          np.where(
            iris['petal_length'] < 1.6,
            3,
            4
          ),
          5
        )
      )
    ),
    6
  )
)
3.6 ms ± 149 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Congratulations. You’ve made it. I can’t tell you how many times it took me to calculate the right bracket, but hey, that’s it! We took another 10 milliseconds off pandas. loc[]。 However, this code fragment is not maintainable, which means that it is not acceptable.

Numpy.select

Numpy.select Different from. Where, it is a function used to implement multithreaded logic.

np.select(condlist, choicelist, default=0)

Its grammar is similar to np.where But the first parameter is now a condition list, which should be the same length as the option. One thing to remember when using np.select Is to select an option immediately after the first condition is met.

This means that if a superset rule appears before a subset rule in the list, the subset selection will never be selected. Specifically speaking:

condlist = [
    df['A'] <= 1,
    df['A'] < 1
]

choicelist = ['<=1', '<1']

selection = np.select(condlist, choicelist, default='>1')

Because all rows that hit DF [‘a ‘] < 1 will also be captured by DF [‘a’] < = 1, no rows are finally marked ‘< 1’. To avoid this, be sure to make a less specific rule before a more specific rule:

condlist = [
    DF ['a '] < 1, # < - ┬ exchange
    df['A'] <= 1 # < ───┘
]

Choicelist = ['< 1', '< = 1']; remember to update this too!

selection = np.select(condlist, choicelist, default='>1')

As you can see from the above, you need to update both condlist and choicesit to make sure the code runs smoothly. But seriously, it’s also a time-consuming step. By changing it to a dictionary, we’ll achieve roughly the same time and memory complexity, but with more maintainable code fragments:

%%timeit
rules = {
  0: (iris['sepal_length'] < 5.1),
  1: (iris['sepal_width'] > 3.3) & (iris['sepal_length'] < 5.8),
  2: (iris['sepal_width'] > 3.3) & (iris['petal_length'] > 5.1),
  3: (
    (iris['sepal_width'] > 3.3) & \
    ((iris['sepal_length'] < 6.4) | (iris['petal_width'] < 1.3)) & \
    (iris['petal_length'] < 1.6)
  ),
  4: (
    (iris['sepal_width'] > 3.3) & \
    ((iris['sepal_length'] < 6.4) | (iris['petal_width'] < 1.3))
  ),
  5: (iris['sepal_width'] > 3.3),
}

iris['label'] = np.select(rules.values(), rules.keys(), default=6)
6.29 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

About np.where But it not only saves you from debugging all kinds of nesting, but also makes choicelist change. I had forgotten to update choicelist so many times that I spent more than four times as much time debugging my machine learning model. believe me, np.select And dict. It’s a very good choice

Excellent function

  1. Vectorization operation of numpy: if your code involves looping and calculating unary functions, binary functions, or functions that operate on a sequence of numbers. You should refactor the code by converting the data into numpy ndarray, and make full use of numpy’s vectorization operation to greatly speed up the script. In numpy’s official documents, see examples of unary functions, binary functions, or functions that operate on digital sequences:https://www.pythonlikeyoumeanit.com/Module3_IntroducingNumpy/VectorizedOperations.html#NumPy’s-Mathematical-Functions

  2. np.vectorizeDon’t be fooled by the name of this function. This is just a convenient function and doesn’t make the code run faster. To use this function, you first need to code the logic as a callable function, and then run the np.vectorize (your functions) (your data series). Another big drawback is that the data frame needs to be converted into one-dimensional Iterable to be transferred to the “vectorization” function. Conclusion: if it is not convenient to use np.vectorize Don’t use it.

  3. numba.njitNow this is true vectorization. It tries to move any numpy value as close to the C language as possible to improve its efficiency. Although it can speed up numerical calculation, it also limits itself to numerical calculation, which means that there is no pandas series, no string index, and only the ndarray of numpy with int, float, datetime, bool, and category types. Conclusion: if you can easily use numpy’s ndarray and convert logic to numerical calculation or only to numerical calculation, it will be a very good choice. Learn more here:https://numba.pydata.org/numba-doc/dev/user/5minguide.html

ending

If possible, fight for it numba.njit Otherwise, use np.select And dict can help you sail. Remember, every improvement will help!

Link to the original text:https://towardsdatascience.com/efficient-implementation-of-conditional-logic-on-pandas-dataframes-4afa61eb7fce

Welcome to panchuang AI blog:
http://panchuang.net/

Sklearn machine learning official Chinese document:
http://sklearn123.com/

Welcome to pancreato blog Resource Hub:
http://docs.panchuang.net/