Implementation of pandas sparse data structure

Time:2021-10-15
catalogue
  • brief introduction
  • Example of spare data
  • SparseArray
  • SparseDtype
  • Sparse properties
  • Calculation of spark
  • Sparseseries and sparsedataframe

brief introduction

If there are many Nan values in the data, storage will waste space. To solve this problem, pandas introduces a structure called sparse data to effectively store the values of these Nan.

Example of spare data

We create an array, set most of its data to Nan, and then use this array to create SparseArray:


In [1]: arr = np.random.randn(10)

In [2]: arr[2:-2] = np.nan

In [3]: ts = pd.Series(pd.arrays.SparseArray(arr))

In [4]: ts
Out[4]: 
0    0.469112
1   -0.282863
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -0.861849
9   -2.104569
dtype: Sparse[float64, nan]

The dtype type here is sparse [float64, Nan], which means that the Nan in the array is not actually stored, only non Nan data is stored, and the type of these data is float64

SparseArray

Arrays.sparsearray is a   ExtensionArray   , Used to store sparse array types.


In [13]: arr = np.random.randn(10)

In [14]: arr[2:5] = np.nan

In [15]: arr[7:8] = np.nan

In [16]: sparr = pd.arrays.SparseArray(arr)

In [17]: sparr
Out[17]: 
[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)

Use numpy. Asarray()   You can convert it to a normal array:


In [18]: np.asarray(sparr)
Out[18]: 
array([-1.9557, -1.6589,     nan,     nan,     nan,  1.1589,  0.1453,
           nan,  0.606 ,  1.3342])

SparseDtype

Sparsedtype represents the spare type. It contains two kinds of information. The first is the data type of non Nan value, and the second is the constant value during filling, such as Nan:


In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]

A sparsedtype can be constructed as follows:


In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], NaT]

You can specify values for padding:


In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
   ....:                fill_value=pd.Timestamp('2017-01-01'))
   ....: 
Out[21]: Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]

Sparse properties

You can access spark through. Spark:


In [23]: s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")

In [24]: s.sparse.density
Out[24]: 0.5

In [25]: s.sparse.fill_value
Out[25]: 0

Calculation of spark

NP calculation function can be directly used in sparsearay and will return a sparsearay.


In [26]: arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])

In [27]: np.abs(arr)
Out[27]: 
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)

Sparseseries and sparsedataframe

Sparseseries and sparsedataframe were removed in version 1.0.0. They are replaced by the more powerful SparseArray.
Here are the differences in the use of the two:


# Previous way
>>> pd.SparseDataFrame({"A": [0, 1]})

# New way
In [31]: pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})
Out[31]: 
   A
0  0
1  1

If it is a spark matrix in SciPy, you can use dataframe.spark.from_ spmatrix() :


# Previous way
>>> from scipy import sparse
>>> mat = sparse.eye(3)
>>> df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])

# New way
In [32]: from scipy import sparse

In [33]: mat = sparse.eye(3)

In [34]: df = pd.DataFrame.sparse.from_spmatrix(mat, columns=['A', 'B', 'C'])

In [35]: df.dtypes
Out[35]: 
A    Sparse[float64, 0]
B    Sparse[float64, 0]
C    Sparse[float64, 0]
dtype: object

This is the end of this article on the implementation of pandas sparse data structure. For more information about pandas sparse data structure, please search the previous articles of developeppaer or continue to browse the relevant articles below. I hope you will support developeppaer in the future!