Playing with Python data processing in the way of tidyverse

Time:2022-6-20

preface

Today, we will introduce aPythonThird party Library——plydata, this library is based onRMediumdplyrtidyrandforcatsPackage. Many function names are borrowed directly

plydatause>>As a pipe operator, used to overrideply(data, *verbs)Function, currently only pandasDataFrameData structure, and database support may be added later

Let’s see how to use itplydataLet’s play with data manipulation

use

First, usepipInstall

pip install plydata

Let’s take a simple example and import the corresponding module first

import numpy as np
import pandas as pd
from plydata import define, query, if_else, ply

Create aDataFrame

df = pd.DataFrame({
    'x': [0, 1, 2, 3],
    'y': ['zero', 'one', 'two', 'three']}
)

usedefineFunction to add a column to the data frame (or use themutateFunction, both the same, corresponding totidyverseFunction with the same name in)

In [5]: df
Out[5]:
   x      y
0  0   zero
1  1    one
2  2    two
3  3  three
In [6]: df >> define(z='x')
Out[6]:
   x      y  z
0  0   zero  0
1  1    one  1
2  2    two  2
3  3  three  3

useif_elseTo add different values

In [7]: df >> define(z=if_else('x > 1', 1, 0))
Out[7]:
   x      y  z
0  0   zero  0
1  1    one  0
2  2    two  1
3  3  three  1

AndRUsed intidyverseSimilar to the functions in the package, you can also take the data frame as the first parameter of the function

In [8]: query(df, 'x > 1')
Out[8]:
   x      y
2  2    two
3  3  three
#Equivalent to the following operations
In [9]: df >> query('x > 1')
Out[9]:
   x      y
2  2    two
3  3  three

Or useplyFunctions instead of pipeline operations, for example

In [10]: ply(df,
    ...:     define(z=if_else('x > 1', 1, 0)),
    ...:     query('z == 1')
    ...: )
Out[10]:
   x      y  z
2  2    two  1
3  3  three  1

Pass each operation as a parameter toplyfunction

If andplotnine(based onggplot2ofPythonIt is easy toRDrawing code for, converting toPython

For example, forRCode, drawingsin(x)Function in[0,2𝜋]Graph of interval

library(tidyverse)

tibble(x = seq(0, 2*pi, length.out = 500)) %>%
  mutate(y = sin(x), sign = if_else(y >= 0, "positive", "negative")) %>%
  ggplot(aes(x, y)) +
  geom_line(aes(colour = sign), size = 1.5)
Playing with Python data processing in the way of tidyverse

Convert toPythoncode

from plotnine import ggplot, aes, geom_line

(
    pd.DataFrame({'x': np.linspace(0, 2*np.pi, 500)})
    >> define(y='np.sin(x)')
    >> define(sign=if_else('y >= 0', '"positive"', '"negative"'))
    >> (ggplot(aes('x', 'y'))
     + geom_line(aes(color='sign'), size=1.5))
)
Playing with Python data processing in the way of tidyverse

be careful: onPythonIn, operation expressions are placed inside quotation marks, and strings are expressed as nested quotation marks

usecallFunction to execute an external function orpd.DataFrameFunctions, for example, applying external functions

In [11]: df = pd.DataFrame({
    ...:     'A': {0: 'a', 1: 'b', 2: 'c'},
    ...:     'B': {0: 1, 1: 3, 2: 5},
    ...:     'C': {0: 2, 1: 4, 2: np.nan}
    ...: })

In [12]: df >> call(pd.melt)
Out[12]:
  variable value
0        A     a
1        A     b
2        A     c
3        B     1
4        B     3
5        B     5
6        C   2.0
7        C   4.0
8        C   NaN

In [13]: df >> call(pd.melt, id_vars=['A'], value_vars=['B'])
Out[13]:
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5

Apply object method

In [14]: df >> call('.dropna', axis=1)
Out[14]:
   A  B
0  a  1
1  b  3
2  c  5

In [15]: (df
    ...:  >> call(pd.melt)
    ...:  >> query('variable != "B"')
    ...:  >> call('.reset_index', drop=True)
    ...:  )
Out[15]:
  variable value
0        A     a
1        A     b
2        A     c
3        C   2.0
4        C   4.0
5        C   NaN

Example

1. single table operation

Single table operations mainly include the following functions

Playing with Python data processing in the way of tidyverse

Its functions are similar todplyrSame as the function with the same name in the package

For example,mutateFunction add column

In [16]: df >> mutate(x_sq = 'x**2')
Out[16]:
   x  x_sq
0  1     1
1  2     4
2  3     9

In [17]: df >> mutate(('x*2', 'x*2'), ('x*3', 'x*3'), x_cubed='x**3')
Out[17]:
   x  x*2  x*3  x_cubed
0  1    2    3        1
1  2    4    6        8
2  3    6    9       27

usearrangeFunction to sort the data frame

In [18]: df = pd.DataFrame({'x': [1, 5, 2, 2, 4, 0],
    ...:                    'y': [1, 2, 3, 4, 5, 6]})

In [19]: df >> arrange('x')
Out[19]:
   x  y
5  0  6
0  1  1
2  2  3
3  2  4
4  4  5
1  5  2

In [20]: df >> arrange('x', '-y')
Out[20]:
   x  y
5  0  6
0  1  1
3  2  4
2  2  3
4  4  5
1  5  2

In [21]: df >> arrange('np.sin(y)')
Out[21]:
   x  y
4  4  5
3  2  4
5  0  6
2  2  3
0  1  1
1  5  2

usegroup_byGroup

In [22]: df = pd.DataFrame({'x': [1, 5, 2, 2, 4, 0, 4],
    ...:                    'y': [1, 2, 3, 4, 5, 6, 5]})

In [23]: df >> group_by('x')
Out[23]:
groups: ['x']
   x  y
0  1  1
1  5  2
2  2  3
3  2  4
4  4  5
5  0  6
6  4  5

In [24]: df >> group_by('x') >> group_indices()
Out[24]: array([1, 4, 2, 2, 3, 0, 3])

be similar todefinegroup_byNew columns can be added

In [25]: df >> group_by('y-1', xplus1='x+1')
Out[25]:
groups: ['y-1', 'xplus1']
   x  y  y-1  xplus1
0  1  1    0       2
1  5  2    1       6
2  2  3    2       3
3  2  4    3       3
4  4  5    4       5
5  0  6    5       1
6  4  5    4       5

If subsequent verbs do not use grouping information, the newly generated columns will remain in the data frame

In [26]: df >> group_by('y-1', xplus1='x+1') >> select('y')
Out[26]:
groups: ['y-1', 'xplus1']
   y-1  xplus1  y
0    0       2  1
1    1       6  2
2    2       3  3
3    3       3  4
4    4       5  5
5    5       1  6
6    4       5  5

usequeryTo filter rows

In [27]: df = pd.DataFrame({'x': [0, 1, 2, 3, 4, 5],
    ...:                    'y': [0, 0, 1, 1, 2, 3]})

In [28]: df >> query('x % 2 == 0')
Out[28]:
   x  y
0  0  0
2  2  1
4  4  2

In [29]: df >> query('x % 2 == 0 and y > 0')
Out[29]:
   x  y
2  2  1
4  4  2

In [30]: df >> query('x % 2 == 0 & y > 0')
Out[30]:
   x  y
2  2  1
4  4  2

In [31]: df >> group_by('y') >> query('x == x.min()')
Out[31]:
groups: ['y']
   x  y
0  0  0
2  2  1
4  4  2
5  5  3

usesummarizeFunction to make statistics on data

In [32]: df = pd.DataFrame({'x': [1, 5, 2, 2, 4, 0, 4],
    ...:                    'y': [1, 2, 3, 4, 5, 6, 5],
    ...:                    'z': [1, 3, 3, 4, 5, 5, 5]})

In [33]: df >> summarize('np.sum(x)', max='np.max(x)')
Out[33]:
   np.sum(x)  max
0         18    5

In [34]: df >> group_by('y', 'z') >> summarize(mean_x='np.mean(x)')
Out[34]:
   y  z  mean_x
0  1  1     1.0
1  2  3     5.0
2  3  3     2.0
3  4  4     2.0
4  5  5     4.0
5  6  5     0.0

The following functions are supported:

  • min(x)numpy.amin()Alias for
  • max(x)numpy.amax()Alias for
  • sum(x)numpy.sum()Alias for
  • cumsum(x)numpy.cumsum()Alias for
  • mean(x)numpy.mean()Alias for
  • median(x)numpy.median()Alias for
  • std(x)numpy.std()Alias for
  • first(x)xFirst element of
  • last(x)xLast element of
  • nth(x, n)xOfnValues ornumpy.nan
  • n_distinct(x)xNumber of unique values in
  • n_unique(x)n_distinctAlias for
  • n()-Number of current groups
In [35]: df >> summarize('min(x)', 'max(x)', 'mean(x)', 'sum(x)',
    ...:                 'first(x)', 'last(x)', 'nth(x, 3)')
Out[35]:
   min(x)  max(x)   mean(x)  sum(x)  first(x)  last(x)  nth(x, 3)
0       0       5  2.571429      18         1        4          2

Group evaluation

In [36]: df >> group_by('y') >> summarize(y_count='n()')
Out[36]:
   y  y_count
0  1        1
1  2        1
2  3        1
3  4        1
4  5        2
5  6        1

In [37]: df >> group_by('y') >> summarize('mean(x)')
Out[37]:
   y  mean(x)
0  1      1.0
1  2      5.0
2  3      2.0
3  4      2.0
4  5      4.0
5  6      0.0

2. double meter operation

Dual table operations mainly include:

Playing with Python data processing in the way of tidyverse

AnddplyrFunctions with the same name perform the same functions. for example

In [38]: df1 = pd.DataFrame({
    ...:     'col1': ['one', 'two', 'three'],
    ...:     'col2': [1, 2, 3]
    ...: })

In [39]: df2 = pd.DataFrame({
    ...:     'col1': ['one', 'four', 'three'],
    ...:     'col2': [1, 4, 3]
    ...: })

In [40]: anti_join(df1, df2, on='col1')
Out[40]:
  col1  col2
1  two     2

In [41]: outer_join(df1, df2, on='col1')
Out[41]:
    col1  col2_x  col2_y
0    one     1.0     1.0
1    two     2.0     NaN
2  three     3.0     3.0
3   four     NaN     4.0

In [42]: inner_join(df1, df2, on='col1')
Out[42]:
    col1  col2_x  col2_y
0    one       1       1
1  three       3       3

In [43]: left_join(df1, df2, on='col1')
Out[43]:
    col1  col2_x  col2_y
0    one       1     1.0
1    two       2     NaN
2  three       3     3.0

In [44]: right_join(df1, df2, on='col1')
Out[44]:
    col1  col2_x  col2_y
0    one     1.0       1
1   four     NaN       4
2  three     3.0       3

In [45]: semi_join(df1, df2, on='col1')
Out[45]:
    col1  col2
0    one     1
2  three     3

3. Tidy Verbs

3.1 PivotTable

  1. gather
In [48]: from plydata.tidy import *

In [49]: df = pd.DataFrame({
    ...:     'name': ['mary', 'oscar', 'martha', 'john'],
    ...:     'math': [92, 83, 85, 90],
    ...:     'art': [75, 95, 80, 72]
    ...: })

In [50]: df >> gather('subject', 'grade', ['math', 'art'])
Out[50]:
     name subject  grade
0    mary    math     92
1   oscar    math     83
2  martha    math     85
3    john    math     90
4    mary     art     75
5   oscar     art     95
6  martha     art     80
7    john     art     72
  1. pivot_longer
In [51]: df = pd.DataFrame({
    ...:     'name': ['mary', 'mary', 'john', 'john'],
    ...:     'city':['dakar', 'dakar', 'lome', 'lome'],
    ...:     'year': [1990, 1992, 1996, 1998],
    ...:     'data_t1_sunny': [8, 6, 4, 7],
    ...:     'data_t2_rainy': [9, 7, 7, 6]
    ...: })

In [52]: df >> pivot_longer(
    ...:     cols=select(startswith='data'),
    ...:     names_to=['take', 'season'],
    ...:     values_to='score',
    ...:     names_pattern=r'data_(t\d)(_\w+)',
    ...:     names_prefix={'take': 't', 'season': '_'}
    ...: )
Out[52]:
   name   city  year take season  score
0  mary  dakar  1990    1  sunny      8
1  mary  dakar  1992    1  sunny      6
2  john   lome  1996    1  sunny      4
3  john   lome  1998    1  sunny      7
4  mary  dakar  1990    2  rainy      9
5  mary  dakar  1992    2  rainy      7
6  john   lome  1996    2  rainy      7
7  john   lome  1998    2  rainy      6
  1. pivot_wider
In [53]: df = pd.DataFrame({
    ...:     'name': ['mary', 'oscar', 'martha', 'john'] * 2,
    ...:     'initials': ['M.K', 'O.S', 'M.J', 'J.T'] * 2,
    ...:     'subject': np.repeat(['math', 'art'], 4),
    ...:     'grade': [92, 83, 85, 90, 75, 95, 80, 72],
    ...:     'midterm': [88, 83, 89, 93, 85, 95, 76, 79]
    ...: })

In [54]: df >> pivot_wider(
    ...:     names_from='subject',
    ...:     values_from=('grade', 'midterm')
    ...: )
Out[54]:
  initials    name  grade_art  grade_math  midterm_art  midterm_math
0      J.T    john         72          90           79            93
1      M.J  martha         80          85           76            89
2      M.K    mary         75          92           85            88
3      O.S   oscar         95          83           95            83
  1. spread
In [55]: df = pd.DataFrame({
    ...:     'name': ['mary', 'oscar', 'martha', 'john'] * 2,
    ...:     'subject': np.repeat(['math', 'art'], 4),
    ...:     'grade': [92, 83, 85, 90, 75, 95, 80, 72]
    ...: })

In [56]: df >> spread('subject', 'grade')
Out[56]:
     name  art  math
0    john   72    90
1  martha   80    85
2    mary   75    92
3   oscar   95    83

3.2 character string

  1. extract: use regular expressions to split character string columns
  2. separate: use the specified delimiter to split character string columns
  3. separate_rows: split a row of variable values into multiple rows
In [57]: df = pd.DataFrame({
    ...:     'parent': ['martha', 'james', 'alice'],
    ...:     'child': ['leah', 'joe,vinny,laura', 'pat,lee'],
    ...:     'age': ['3', '12,6,4', '2,7']
    ...: })

In [58]: df >> separate_rows('child', 'age')
Out[58]:
   parent  child age
0  martha   leah   3
1   james    joe  12
2   james  vinny   6
3   james  laura   4
4   alice    pat   2
5   alice    lee   7
  1. unite: merge multiple columns into one column

4. summary

plydataMany functions are also provided to process classification variables, so we won’t explain them one by one. Those interested can learn by referring to the following documents. Each function has corresponding examples, which are clear and easy to understand

https://plydata.readthedocs.io/en/stable/api.html