# Reading papers on fairness of recommendation system (3)

Time：2021-10-27

My main task these days is to debug and run the code written according to the paper learning fair representations for recommendation: a graph based perspective, and then test the effect of the model and record it. The following describes my work in four parts: data set description, model evaluation strategy, super parameter adjustment and test result recording.

## Dataset description

Movielens-1m is a benchmark data set of recommendation system. This data set includes nearly 1 million scoring information of 6040 users for nearly 4000 films. Users have three category attributes, including gender (2 categories), age (7 categories) and position (21 categories). Following the previous fairness based recommendation system, we divide the training set and test set according to the ratio of 9:1.
Lastfm-360k is a music recommendation data set, which includes the scores of music producers by users from the music website last.fm. This data set includes 17 million ratings of 290000 music producers by about 360000 users. We take the number of times the user plays music as the user’s score value. Because the value of the score may be in a large range, we first transform the log log, and then normalize the score to a range between 1 and 5. Users have their own user portraits, and their information includes gender attributes (2 categories) and age attributes (7 categories). Like the data partition strategy of many classical recommendation systems, we divide the training set, verification set and test set into 7:1:2.

## Model evaluation strategy

To evaluate the performance of the recommendation system, we use root mean square error (RMSE). In order to effectively measure the fairness of our algorithm. We calculated the fairness performance of 20% of test users.
Because binary features (such as gender) are unbalanced in each data set, with 70% of men and 30% of women, we use AUC measure to measure the performance of binary classification. For multivalued attributes, we use micro averaged F1 to measure.
AUC or F1 can be used to measure whether sensitive gender information is exposed in the process of representing learning. The smaller the measure of this classification, the better the fairness of the system and less sensitive information is leaked.
Because the model I reproduced in this paper is “the upstream model is unknown” (that is, in order to enhance its universality, the upstream model is treated as a black box model), and can be applied to many multi-attribute recommendation scenarios, we designed many tests according to different model evaluation settings.
Firstly, we choose the most advanced state of the art graph convolution network (GCN) recommendation model as our benchmark model. Because the GCN based recommendation model was originally designed as a ranking based loss function, we modified it to a score based loss function, and added more detailed score values in the process of graph convolution to facilitate our setting.

In the actual model implementation, we choose multi-layer perceptron (MLP) as the specific architecture of each filter and discriminator, and the embedding size of the filter is set to$$D=64$$
For the movielens dataset, each filter network has three layers, and the dimensions of the hidden layer are 128 and 64 respectively. The discriminator has four layers, and the dimensions of the hidden layer are 16, 8 and 4 respectively.
For the lastfm-360k dataset, each filter network has four layers, and the hidden layer dimensions are 128, 64 and 32 respectively. Each discriminator
There are 4 layers, and the hidden layer dimensions are 16, 8 and 4 respectively.
We use leakyrelu as the activation function. Balance parameters on movielens dataset$$\lambda$$Set to 0.1 and set the balance parameter to 0.2 on the lastfm-360k dataset. The parameters in all objective functions are differentiable. We use the Adam optimizer to adjust the initial learning rate to 0.005.

## Test result record

The following two tables are our test results of the model. In the process of testing, for the sake of simplicity, we adopt a simple self center graph structure and use first-order weighted integration. According to the test results of the model, we can see that if the GCN model directly considers the sensitive information filter, the performance of the recommendation system will be reduced by 5% to 10%, because we need to exclude any hidden vector dimensions useful for scoring, but sensitive information may be exposed.
The training process on movielens-1m dataset is as follows:

ga0--train-- 383.52958726882935
epoch:1 time:383.5  train_loss f:-192.2718 d:19.5616val_loss f:-192.8258 d:19.2919
val_rmse:0.9045  test_rmse:0.895
train data is end
ga0--train-- 360.9422023296356
epoch:2 time:360.9  train_loss f:-191.72 d:19.4652val_loss f:-200.0517 d:20.0125
val_rmse:0.7063  test_rmse:0.6894
train data is end
ga0--train-- 363.16574025154114
epoch:3 time:363.2  train_loss f:-200.8263 d:19.2499val_loss f:-203.8944 d:20.4799
val_rmse:2.8324  test_rmse:2.8068
train data is end
ga0--train-- 355.92360401153564
epoch:4 time:355.9  train_loss f:-189.3184 d:19.3741val_loss f:-180.7054 d:18.0778
ga0 clf_age/4
no model save path
val_rmse:0.7821  test_rmse:0.7787age f1:0.4683	0.4683 0.4683 0.4683
train data is end
ga0--train-- 356.7487156391144
epoch:5 time:356.7  train_loss f:-198.0661 d:19.8271val_loss f:-190.4692 d:19.0536
ga0 clf_age/5
no model save path
val_rmse:0.7407  test_rmse:0.7326age f1:0.469	0.469 0.469 0.469

For the performance of different summary networks with different self-centered structures on the data set movielens-1m, the “constant” here represents the constant local function aggregation, and the “learnable” represents the parameter learnable aggregation.

Sensitive attribute RMSE AUC/F1
Gender 0.8553 0.8553
Age 0.8553 0.3948
occupation 0.8553 0.1556

The training process on the lastfm-360k dataset is as follows:

ga0
--------training processing-------
train data is end
ga0--train-- 380.44703578948975
epoch:0 time:380.4  train_loss f:-200.3726 d:19.7304val_loss f:-193.2152 d:19.3319
val_rmse:0.9439  test_rmse:0.9304
train data is end
ga0--train-- 383.52958726882935
epoch:1 time:383.5  train_loss f:-192.2718 d:19.5616val_loss f:-192.8258 d:19.2919
val_rmse:0.9045  test_rmse:0.895
train data is end
ga0--train-- 360.9422023296356
epoch:2 time:360.9  train_loss f:-191.72 d:19.4652val_loss f:-200.0517 d:20.0125
val_rmse:0.7063  test_rmse:0.6894
train data is end
ga0--train-- 363.16574025154114
epoch:3 time:363.2  train_loss f:-200.8263 d:19.2499val_loss f:-203.8944 d:20.4799
val_rmse:2.8324  test_rmse:2.8068
train data is end
ga0--train-- 355.92360401153564
epoch:4 time:355.9  train_loss f:-189.3184 d:19.3741val_loss f:-180.7054 d:18.0778
ga0 clf_age/4
no model save path
val_rmse:0.7821  test_rmse:0.7787age f1:0.4683	0.4683 0.4683 0.4683
train data is end

The following is the performance on lastfm-360k.

Sensitive attribute RMSE AUC/F1
Gender 0.7358 0.5642
Age 0.7358 0.4953

## reference

• [1] Wu L, Chen L, Shao P, et al. Learning Fair Representations for Recommendation: A Graph-based Perspective[C]//Proceedings of the Web Conference 2021. 2021: 2198-2208.