Reading papers on fairness of recommendation system (5)

Time:2021-10-25

My main task these days is to review the thesis “towards long term fairness in recommendation”[1]The algorithm described in is programmed, and then the effect of the algorithm is tested and recorded. The following describes my work in four parts: detailed implementation of model algorithm, data set, model evaluation criteria and test result record.

Detailed implementation of model algorithm

According to the description of the paper, the core part of the algorithm in the paper is the solution of the following constrained optimization problems:

\[\begin{matrix}
\theta_{k+1} = \underset{\theta}{\arg\max}g^{T}(\theta-\theta_{k}) \\
s.t. \quad c+b^{T}(\theta – \theta_{k}) \leqslant 0 \\
\frac{1}{2}(\theta – \theta_{k})^{T}H(\theta-\theta_{k}) \leqslant \delta
\end{matrix}
\tag{1}
\]

This problem is called constrained strategy optimization problem by the author of this paper. The paper mentioned that the specific solution algorithm of this problem can refer to another paper, “constrained policy optimization”[2]。 I consulted another paper, which demonstrated in detail the solution method of the following problems:

\[\begin{matrix}
p^{*} = \underset{x}{\min} g^{T}x \\
s.t. \quad b^{T}x + c \leqslant 0 \\
x^{T}Hx \leqslant \delta
\end{matrix}
\tag{2}
\]

here\(g, B, X \ in \ mathbb {r} ^ {n}, C, \ Delta \ in \ mathbb {r}, \ delta > 0, H \ in \ mathbb {s} ^ {n}, \ text {and} h \ succ 0 \)
The problem is an optimization problem with linear objective function and linear and quadratic constraints. It is a typical convex optimization problem. When there is at least one strictly feasible solution, the strong duality is satisfied (according to Slater’s condition theory). In this paper, the strong duality is used to obtain the analytical solution of the problem. The following is the specific derivation. Let’s first define the Lagrange function:

\[L(x, \lambda, \nu) = g^{T}x + \frac{\lambda}{2}(x^THx-\delta) + \nu(b^Tx+c)
\tag{3}
\]

According to Lagrange duality, the dual problem of the original problem is a minimax problem:

\[\underset{\nu \geqslant 0}{\underset{\lambda \geqslant 0}{\max}} \underset{x}{\min}L(x, \lambda, \nu)
\tag{4}
\]

Ask first\(\underset{x}{\min}L(x, \lambda, \nu)\), we order\(\nabla_{x}L(x, \lambda, \nu)=0\), solution

\[ x^* = – \frac{1}{\lambda}H^{-1}(g+\nu b)
\tag{5}
\]

take\(x^*\)When Lagrange function is introduced, the problem is further transformed into

\[ \underset{\nu \geqslant 0}{\underset{\lambda \geqslant 0}{\max}}
– \frac{1}{2\lambda}(g+\nu b)^TH^{-1}(g+\nu b) + (\nu c – \frac{1}{2}\lambda \delta)
\tag{6}
\]

We replace the expression with variables to make\(q=g^T H^{-1} g, r = g^TH^{-1}b, s=b^TH^{-1} b\), which can be further simplified as

\[ \underset{\nu \geqslant 0}{\underset{\lambda \geqslant 0}{\max}}
– \frac{1}{2\lambda}(q + 2\nu r + \nu^2 s) + (\nu c – \frac{1}{2} \lambda \delta)
\tag{7}
\]

We are again by\(\frac{\partial L(\lambda, \nu)}{\partial \nu} = – \frac{1}{2\lambda}(2r+2\nu s) + c\), we are\(\mathbb{R}_{+}\)The univariate convex quadratic function optimization theory can be obtained

\[ \nu = (\frac{\lambda c- r}{s})_{+}
\tag{8}
\]

Further, we turn the problem into

\[\underset{\lambda \geqslant 0}{\max}

Attention, here\(\Lambda_{a} = \{\lambda | \lambda c – r > 0, \lambda \geqslant 0\}\)\(\Lambda_{b} = \{\lambda | \lambda c – r \leqslant 0, \lambda \geqslant 0\}\)

So, to sum up, when there is at least one feasible solution, the optimal solution\(x^{*}\)satisfy

\[ x^* = – \frac{1}{\lambda^*}H^{-1}(g+v^*b)
\tag{10}
\]

here\(\lambda^*\)and\(\nu^*\)As defined below

\[ \begin{matrix}

here\(q=g^TH^{-1}g, r=g^TH^{-1}b\)And\(s=b^TH^{-1}b\)
Next is the programming details. First, we need to calculate\(q\)\(r\), \(s\)Value of. This requires efficient calculation\(H^{-1}g, H^{-1}b\), the time complexity of the inverse of Hessian matrix is very high. We solve the equations\(Hx_{1}=g,Hx_{2}=b\)To achieve. But the Hessian matrix is likely to be sparse, which further increases the
Our solution is difficult, so we use the conjugate gradient method to solve the equations. The specific implementation code is as follows:

def cg_solver(Avp_fun, b, device, max_iter=10):
    device = device
    x = torch.zeros_like(b).to(device)
    r = b.clone()
    p = b.clone()
    for i in range(max_iter):
        Avp = Avp_fun(p, retain_graph=True)
        alpha = torch.matmul(r, r) / torch.matmul(p, Avp)
        x += alpha * p
        if i == max_iter - 1:
            return x
        r_new = r - alpha * Avp
        beta = torch.matmul(r_new, r_new) / torch.matmul(r, r
        r = r_new
        p = r + beta * p

Note that here we solve the equation according to the conjugate gradient method\(Ax=b\)There is a step to be calculated halfway\(A\)Search direction vector\(p\)Product of\(Ap\)Here, we use a special function implementation. We define the function as a function closure (Python language features are used here). The variables defined in the inner layer of the closure are completely isolated from the outside world. The specific implementation is as follows:

def get_Hvp_fun(functional_output, inputs, damping_coef=0.0):
    inputs = list(inputs)
    grad_f = flat_grad(functional_output, inputs, create_graph=True)
    def Hvp_fun(v, retain_graph=True):
        gvp = torch.matmul(grad_f, v)
        Hvp = flat_grad(gvp, inputs, retain_graph=retain_graph)
        Hvp += damping_coef * v
        return Hvp
    return Hvp_fun

In this way, we efficiently calculate\(H^{-1}g\)and\(H^{-1}b\)So as to obtain\(q,r,s\)Value of. According to the definition of the original paper\(c=J_{C}(\pi_{k})-d\)。 In this way, the basic variables Q, R, s and C required by our algorithm have been obtained if the problem is solvable, that is, if\(c^2/s-\delta >0\)And\(c>0\)We can use the analytical formula derived above\((11)\)To calculate the dual variable\(\lambda^*\)and\(\nu^*\)And then according to the analytical formula\((10)\)Calculated\(x^*\)。 Where dual variables are calculated\(\lambda^*\)and\(\nu^*\)The functions are implemented as follows:

def calc_dual_vars(self, q, r, s, c):
    if c < 0.0 and c ** 2 / s - 2 * self.max_kl > 0.0:
        lam = torch.sqrt(q / (2 * self.max_kl))
        nu = 0.0
        return lam, nu
    A = q - r ** 2 / s
    B = 2 * self.max_kl - c ** 2 / s
    lam_mid = r / c
    # lam_a*
    lam_a = torch.sqrt(A / B)
    # lam_b*
    lam_b = torch.sqrt(q / (2 * self.max_kl))
    f_mid = -0.5 * (q / lam_mid + 2 * lam_mid * self.max_kl)
    f_a = -torch.sqrt(A * B) - r * c / s
    f_b = -torch.sqrt(2 * q * self.max_kl)
    if lam_mid > 0:
        if c < 0:
            if lam_a > lam_mid:
                lam_a = lam_mid
                f_a = f_mid
            if lam_b < lam_mid:
                lam_b = lam_mid
                f_b = f_mid
        else:
            if lam_a < lam_mid:
                lam_a = lam_mid
                f_a = f_mid
            if lam_b > lam_mid:
                lam_b = lam_mid
                f_b = f_mid
    else:
        if c < 0:
            lam = lam_b
        else:
            lam = lam_a
    # lam*
    lam = lam_a if f_a >= f_b else lam_b
    # v*
    nu = max(0.0, (lam * c - r) / s)
    return lam, nu

So far, we have completed the core part of the algorithm, that is, the policy function\(\pi\)Parameters of\(\theta\)How to update according to constraints (although we have not fully implemented its training algorithm).
Next, we describe a more complete framework, the strategy function of reinforcement learning model\(\pi\)How to train. The complete training algorithm in the paper adopts the actor judge mode, and its schematic diagram is shown in the figure below:
演员-评委训练方法示意图
Among them, the strategy function in the actor\(\pi_{\theta}\)State value function\(V_{\omega}^{\pi}(s_{t})\)
Cost function\(V_{\phi}^{\pi}(s_{t})\)Are implemented by neural networks. The parameters of strategy function, state value function and cost function are\(\theta\)\(\omega\)\(\phi\)
We construct the state value function\(V_{\omega}(s_t)\)To approximate the real state value function\(V_{w}^{\pi}(s_{t})\)And then used to optimize actors. The jury network is optimized according to time differential learning, in which the mean square error (MSE) needs to be minimized:

\[ L(\omega) = \sum_{t}(y_{t} – V_{\omega}(s_{t}))^2
\tag{12}
\]

here\(y_{t} = r_{t} + \gamma_{r}V_{w}(s_{t+1})\)。 In addition, in order to improve the accuracy, we introduce a separate cost function for constraint strategy optimization
Judges\(V_{\phi}(s)\)And according to formula and\((12)\)Update in a similar way:

\[ L(\phi) = \sum_{t}(y_{t}-V_{\phi}(s_{t}))^2
\tag{13}
\]

here\(y_{t} = c_{t} + \gamma_{c}V_{\phi}(s_{t+1})\)。 In this way, the core algorithm of our training part can be described, as shown in the following algorithm.

fcpo算法伪代码

Note that the experience replay dataset stores a total of\(T\)Trajectories of time steps, from which we sample the state transition quintuples\((s_{t}, a_{t}, r_{t}, c_{t}, s_{t+1})\)。 The empirical playback data set can be obtained only after we process the data once.
In this way, our main algorithm flow has been described. Next, we need to test the model. As in the previous reproduction paper, we use the user transaction data (including user ID, item ID, user score and timestamp) in movielens-1m data set to test the FCPO model proposed in the paper.

data set

We first sort the user transaction data in the order of timestamp, and then divide the data into training set and test set in the proportion of 4:1. Among them, we divide the last item pushed by each user in the training set into verification set. Some basic statistics of movielens-1m dataset are shown in the table below. We divide the data into categories based on popularity (for example, according to the number of exposures of items)\(G_{0}\)and\(G_{1}\)Two groups. Specifically, the items with the top 20% of exposure times belong to the popular group\(G_{0}\), the last 80% of items belong to the long tail group\(G_{1}\)

Number of users Number of items Number of actions / users Number of actions / items Number of actions density
6040 3706 166 270 1000209 4.468

In addition, for the recommendation system based on reinforcement learning, the initial state of each user during training is the first five items clicked in the training set, and the initial state during testing is in the training set
The last 5 items clicked. For simplicity, when we test here, we let the RL agent recommend one item to the user at a time, but in fact, the length of the recommendation list of the algorithm can be adjusted.

Model evaluation criteria

We use some common criteria, such as recall rate and F1 score, to evaluate the performance of our recommendation system. In addition to model-based accuracy
At the same time, we also add two fairness measures. For items exposed to specific users, we measure the Gini coefficient; For items exposed to a specific group, we measure the prevalence rate.
Gini coefficient measures the unfairness between frequency distributions (such as the number of exposures of items), which can be regarded as an individual level fairness measure. Given exposure records from each article,
\(\mathcal{M}=[g_{1}, g_{2}, …, g_{|\mathcal{I}|}]\), Gini coefficient can be expressed as\((14)\)To calculate

\[ Gini\quad Index(\mathcal{G}) = \frac{1}{2 |\mathcal{I}|^2 \bar{g}}\sum_{i=1}^{|\mathcal{I}|}\sum_{j=1}^{|\mathcal{I}|}|g_{i}-g_{j}|
\tag{14}
\]

here\(\bar{g}\)Represents the average number of exposures of all items. Popularity ratio refers to the proportion of popular items in the recommendation list, which can be regarded as a measure of fairness at the popularity level.
These two fairness measures are more fair than the original recommendation system.

test result

Next is our test results on movielens-1m. The test results are shown in the table below. Here, the parameters of fairness constraints corresponding to fcpo-1, fcpo-2 and fcpo-3 are\(\alpha^{‘}=1, \alpha^{‘}=0.8\)And\(\alpha^{‘}=0.4\)
And we assume that the length of the recommendation list is k = 5.

method Recall rate (%)\(\uparrow\) F1(%) \(\uparrow\) Gini coefficient (%)\(\downarrow\) Attention ranking (%)\(\uparrow\)
FCPO-1 2.033 2.668 99.81 99.28
FCPO-2 1.520 2.015 99.47 99.28
FCPO-3 0.998 1.449 99.47 99.28

reference

  • [1] Ge Y, Liu S, Gao R, et al. Towards Long-term Fairness in Recommendation[C]//Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 2021: 445-453.
  • [2] Achiam J, Held D, Tamar A, et al. Constrained policy optimization[C]//International Conference on Machine Learning. PMLR, 2017: 22-31.