My main task these days is to review the thesis “towards long term fairness in recommendation”^{[1]}The algorithm described in is programmed, and then the effect of the algorithm is tested and recorded. The following describes my work in four parts: detailed implementation of model algorithm, data set, model evaluation criteria and test result record.

## Detailed implementation of model algorithm

According to the description of the paper, the core part of the algorithm in the paper is the solution of the following constrained optimization problems:

\theta_{k+1} = \underset{\theta}{\arg\max}g^{T}(\theta-\theta_{k}) \\

s.t. \quad c+b^{T}(\theta – \theta_{k}) \leqslant 0 \\

\frac{1}{2}(\theta – \theta_{k})^{T}H(\theta-\theta_{k}) \leqslant \delta

\end{matrix}

\tag{1}

\]

This problem is called constrained strategy optimization problem by the author of this paper. The paper mentioned that the specific solution algorithm of this problem can refer to another paper, “constrained policy optimization”^{[2]}。 I consulted another paper, which demonstrated in detail the solution method of the following problems:

p^{*} = \underset{x}{\min} g^{T}x \\

s.t. \quad b^{T}x + c \leqslant 0 \\

x^{T}Hx \leqslant \delta

\end{matrix}

\tag{2}

\]

here\(g, B, X \ in \ mathbb {r} ^ {n}, C, \ Delta \ in \ mathbb {r}, \ delta > 0, H \ in \ mathbb {s} ^ {n}, \ text {and} h \ succ 0 \)。

The problem is an optimization problem with linear objective function and linear and quadratic constraints. It is a typical convex optimization problem. When there is at least one strictly feasible solution, the strong duality is satisfied (according to Slater’s condition theory). In this paper, the strong duality is used to obtain the analytical solution of the problem. The following is the specific derivation. Let’s first define the Lagrange function:

\tag{3}

\]

According to Lagrange duality, the dual problem of the original problem is a minimax problem:

\tag{4}

\]

Ask first\(\underset{x}{\min}L(x, \lambda, \nu)\), we order\(\nabla_{x}L(x, \lambda, \nu)=0\), solution

\tag{5}

\]

take\(x^*\)When Lagrange function is introduced, the problem is further transformed into

– \frac{1}{2\lambda}(g+\nu b)^TH^{-1}(g+\nu b) + (\nu c – \frac{1}{2}\lambda \delta)

\tag{6}

\]

We replace the expression with variables to make\(q=g^T H^{-1} g, r = g^TH^{-1}b, s=b^TH^{-1} b\), which can be further simplified as

– \frac{1}{2\lambda}(q + 2\nu r + \nu^2 s) + (\nu c – \frac{1}{2} \lambda \delta)

\tag{7}

\]

We are again by\(\frac{\partial L(\lambda, \nu)}{\partial \nu} = – \frac{1}{2\lambda}(2r+2\nu s) + c\), we are\(\mathbb{R}_{+}\)The univariate convex quadratic function optimization theory can be obtained

\tag{8}

\]

Further, we turn the problem into

Attention, here\(\Lambda_{a} = \{\lambda | \lambda c – r > 0, \lambda \geqslant 0\}\)，\(\Lambda_{b} = \{\lambda | \lambda c – r \leqslant 0, \lambda \geqslant 0\}\)。

So, to sum up, when there is at least one feasible solution, the optimal solution\(x^{*}\)satisfy

\tag{10}

\]

here\(\lambda^*\)and\(\nu^*\)As defined below

here\(q=g^TH^{-1}g, r=g^TH^{-1}b\)And\(s=b^TH^{-1}b\)。

Next is the programming details. First, we need to calculate\(q\)，\(r\), \(s\)Value of. This requires efficient calculation\(H^{-1}g, H^{-1}b\), the time complexity of the inverse of Hessian matrix is very high. We solve the equations\(Hx_{1}=g,Hx_{2}=b\)To achieve. But the Hessian matrix is likely to be sparse, which further increases the

Our solution is difficult, so we use the conjugate gradient method to solve the equations. The specific implementation code is as follows:

```
def cg_solver(Avp_fun, b, device, max_iter=10):
device = device
x = torch.zeros_like(b).to(device)
r = b.clone()
p = b.clone()
for i in range(max_iter):
Avp = Avp_fun(p, retain_graph=True)
alpha = torch.matmul(r, r) / torch.matmul(p, Avp)
x += alpha * p
if i == max_iter - 1:
return x
r_new = r - alpha * Avp
beta = torch.matmul(r_new, r_new) / torch.matmul(r, r
r = r_new
p = r + beta * p
```

Note that here we solve the equation according to the conjugate gradient method\(Ax=b\)There is a step to be calculated halfway\(A\)Search direction vector\(p\)Product of\(Ap\)Here, we use a special function implementation. We define the function as a function closure (Python language features are used here). The variables defined in the inner layer of the closure are completely isolated from the outside world. The specific implementation is as follows:

```
def get_Hvp_fun(functional_output, inputs, damping_coef=0.0):
inputs = list(inputs)
grad_f = flat_grad(functional_output, inputs, create_graph=True)
def Hvp_fun(v, retain_graph=True):
gvp = torch.matmul(grad_f, v)
Hvp = flat_grad(gvp, inputs, retain_graph=retain_graph)
Hvp += damping_coef * v
return Hvp
return Hvp_fun
```

In this way, we efficiently calculate\(H^{-1}g\)and\(H^{-1}b\)So as to obtain\(q,r,s\)Value of. According to the definition of the original paper\(c=J_{C}(\pi_{k})-d\)。 In this way, the basic variables Q, R, s and C required by our algorithm have been obtained if the problem is solvable, that is, if\(c^2/s-\delta >0\)And\(c>0\)We can use the analytical formula derived above\((11)\)To calculate the dual variable\(\lambda^*\)and\(\nu^*\)And then according to the analytical formula\((10)\)Calculated\(x^*\)。 Where dual variables are calculated\(\lambda^*\)and\(\nu^*\)The functions are implemented as follows:

```
def calc_dual_vars(self, q, r, s, c):
if c < 0.0 and c ** 2 / s - 2 * self.max_kl > 0.0:
lam = torch.sqrt(q / (2 * self.max_kl))
nu = 0.0
return lam, nu
A = q - r ** 2 / s
B = 2 * self.max_kl - c ** 2 / s
lam_mid = r / c
# lam_a*
lam_a = torch.sqrt(A / B)
# lam_b*
lam_b = torch.sqrt(q / (2 * self.max_kl))
f_mid = -0.5 * (q / lam_mid + 2 * lam_mid * self.max_kl)
f_a = -torch.sqrt(A * B) - r * c / s
f_b = -torch.sqrt(2 * q * self.max_kl)
if lam_mid > 0:
if c < 0:
if lam_a > lam_mid:
lam_a = lam_mid
f_a = f_mid
if lam_b < lam_mid:
lam_b = lam_mid
f_b = f_mid
else:
if lam_a < lam_mid:
lam_a = lam_mid
f_a = f_mid
if lam_b > lam_mid:
lam_b = lam_mid
f_b = f_mid
else:
if c < 0:
lam = lam_b
else:
lam = lam_a
# lam*
lam = lam_a if f_a >= f_b else lam_b
# v*
nu = max(0.0, (lam * c - r) / s)
return lam, nu
```

So far, we have completed the core part of the algorithm, that is, the policy function\(\pi\)Parameters of\(\theta\)How to update according to constraints (although we have not fully implemented its training algorithm).

Next, we describe a more complete framework, the strategy function of reinforcement learning model\(\pi\)How to train. The complete training algorithm in the paper adopts the actor judge mode, and its schematic diagram is shown in the figure below:

Among them, the strategy function in the actor\(\pi_{\theta}\)State value function\(V_{\omega}^{\pi}(s_{t})\)、

Cost function\(V_{\phi}^{\pi}(s_{t})\)Are implemented by neural networks. The parameters of strategy function, state value function and cost function are\(\theta\)，\(\omega\)，\(\phi\)。

We construct the state value function\(V_{\omega}(s_t)\)To approximate the real state value function\(V_{w}^{\pi}(s_{t})\)And then used to optimize actors. The jury network is optimized according to time differential learning, in which the mean square error (MSE) needs to be minimized:

\tag{12}

\]

here\(y_{t} = r_{t} + \gamma_{r}V_{w}(s_{t+1})\)。 In addition, in order to improve the accuracy, we introduce a separate cost function for constraint strategy optimization

Judges\(V_{\phi}(s)\)And according to formula and\((12)\)Update in a similar way:

\tag{13}

\]

here\(y_{t} = c_{t} + \gamma_{c}V_{\phi}(s_{t+1})\)。 In this way, the core algorithm of our training part can be described, as shown in the following algorithm.

Note that the experience replay dataset stores a total of\(T\)Trajectories of time steps, from which we sample the state transition quintuples\((s_{t}, a_{t}, r_{t}, c_{t}, s_{t+1})\)。 The empirical playback data set can be obtained only after we process the data once.

In this way, our main algorithm flow has been described. Next, we need to test the model. As in the previous reproduction paper, we use the user transaction data (including user ID, item ID, user score and timestamp) in movielens-1m data set to test the FCPO model proposed in the paper.

## data set

We first sort the user transaction data in the order of timestamp, and then divide the data into training set and test set in the proportion of 4:1. Among them, we divide the last item pushed by each user in the training set into verification set. Some basic statistics of movielens-1m dataset are shown in the table below. We divide the data into categories based on popularity (for example, according to the number of exposures of items)\(G_{0}\)and\(G_{1}\)Two groups. Specifically, the items with the top 20% of exposure times belong to the popular group\(G_{0}\), the last 80% of items belong to the long tail group\(G_{1}\)。

Number of users | Number of items | Number of actions / users | Number of actions / items | Number of actions | density |
---|---|---|---|---|---|

6040 | 3706 | 166 | 270 | 1000209 | 4.468 |

In addition, for the recommendation system based on reinforcement learning, the initial state of each user during training is the first five items clicked in the training set, and the initial state during testing is in the training set

The last 5 items clicked. For simplicity, when we test here, we let the RL agent recommend one item to the user at a time, but in fact, the length of the recommendation list of the algorithm can be adjusted.

## Model evaluation criteria

We use some common criteria, such as recall rate and F1 score, to evaluate the performance of our recommendation system. In addition to model-based accuracy

At the same time, we also add two fairness measures. For items exposed to specific users, we measure the Gini coefficient; For items exposed to a specific group, we measure the prevalence rate.

Gini coefficient measures the unfairness between frequency distributions (such as the number of exposures of items), which can be regarded as an individual level fairness measure. Given exposure records from each article,

\(\mathcal{M}=[g_{1}, g_{2}, …, g_{|\mathcal{I}|}]\), Gini coefficient can be expressed as\((14)\)To calculate

\tag{14}

\]

here\(\bar{g}\)Represents the average number of exposures of all items. Popularity ratio refers to the proportion of popular items in the recommendation list, which can be regarded as a measure of fairness at the popularity level.

These two fairness measures are more fair than the original recommendation system.

## test result

Next is our test results on movielens-1m. The test results are shown in the table below. Here, the parameters of fairness constraints corresponding to fcpo-1, fcpo-2 and fcpo-3 are\(\alpha^{‘}=1, \alpha^{‘}=0.8\)And\(\alpha^{‘}=0.4\)，

And we assume that the length of the recommendation list is k = 5.

method | Recall rate (%)\(\uparrow\) | F1(%) \(\uparrow\) | Gini coefficient (%)\(\downarrow\) | Attention ranking (%)\(\uparrow\) |
---|---|---|---|---|

FCPO-1 | 2.033 | 2.668 | 99.81 | 99.28 |

FCPO-2 | 1.520 | 2.015 | 99.47 | 99.28 |

FCPO-3 | 0.998 | 1.449 | 99.47 | 99.28 |

## reference

- [1] Ge Y, Liu S, Gao R, et al. Towards Long-term Fairness in Recommendation[C]//Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 2021: 445-453.
- [2] Achiam J, Held D, Tamar A, et al. Constrained policy optimization[C]//International Conference on Machine Learning. PMLR, 2017: 22-31.