My main task these days is to debug and run the code written according to the paper learning fair representations for recommendation: a graph based perspective, and then test the effect of the model and record it. The following describes my work in four parts: data set description, model evaluation strategy, super parameter adjustment and test result recording.
Movielens-1m is a benchmark data set of recommendation system. This data set includes nearly 1 million scoring information of 6040 users for nearly 4000 films. Users have three category attributes, including gender (2 categories), age (7 categories) and position (21 categories). Following the previous fairness based recommendation system, we divide the training set and test set according to the ratio of 9:1.
Lastfm-360k is a music recommendation data set, which includes the scores of music producers by users from the music website last.fm. This data set includes 17 million ratings of 290000 music producers by about 360000 users. We take the number of times the user plays music as the user’s score value. Because the value of the score may be in a large range, we first transform the log log, and then normalize the score to a range between 1 and 5. Users have their own user portraits, and their information includes gender attributes (2 categories) and age attributes (7 categories). Like the data partition strategy of many classical recommendation systems, we divide the training set, verification set and test set into 7:1:2.
Model evaluation strategy
To evaluate the performance of the recommendation system, we use root mean square error (RMSE). In order to effectively measure the fairness of our algorithm. We calculated the fairness performance of 20% of test users.
Because binary features (such as gender) are unbalanced in each data set, with 70% of men and 30% of women, we use AUC measure to measure the performance of binary classification. For multivalued attributes, we use micro averaged F1 to measure.
AUC or F1 can be used to measure whether sensitive gender information is exposed in the process of representing learning. The smaller the measure of this classification, the better the fairness of the system and less sensitive information is leaked.
Because the model I reproduced in this paper is “the upstream model is unknown” (that is, in order to enhance its universality, the upstream model is treated as a black box model), and can be applied to many multi-attribute recommendation scenarios, we designed many tests according to different model evaluation settings.
Firstly, we choose the most advanced state of the art graph convolution network (GCN) recommendation model as our benchmark model. Because the GCN based recommendation model was originally designed as a ranking based loss function, we modified it to a score based loss function, and added more detailed score values in the process of graph convolution to facilitate our setting.
Super parameter adjustment
In the actual model implementation, we choose multi-layer perceptron (MLP) as the specific architecture of each filter and discriminator, and the embedding size of the filter is set to\(D=64\)。
For the movielens dataset, each filter network has three layers, and the dimensions of the hidden layer are 128 and 64 respectively. The discriminator has four layers, and the dimensions of the hidden layer are 16, 8 and 4 respectively.
For the lastfm-360k dataset, each filter network has four layers, and the hidden layer dimensions are 128, 64 and 32 respectively. Each discriminator
There are 4 layers, and the hidden layer dimensions are 16, 8 and 4 respectively.
We use leakyrelu as the activation function. Balance parameters on movielens dataset\(\lambda\)Set to 0.1 and set the balance parameter to 0.2 on the lastfm-360k dataset. The parameters in all objective functions are differentiable. We use the Adam optimizer to adjust the initial learning rate to 0.005.
Test result record
The following two tables are our test results of the model. In the process of testing, for the sake of simplicity, we adopt a simple self center graph structure and use first-order weighted integration. According to the test results of the model, we can see that if the GCN model directly considers the sensitive information filter, the performance of the recommendation system will be reduced by 5% to 10%, because we need to exclude any hidden vector dimensions useful for scoring, but sensitive information may be exposed.
The training process on movielens-1m dataset is as follows:
ga0--train-- 383.52958726882935 epoch:1 time:383.5 train_loss f:-192.2718 d:19.5616val_loss f:-192.8258 d:19.2919 val_rmse:0.9045 test_rmse:0.895 train data is end ga0--train-- 360.9422023296356 epoch:2 time:360.9 train_loss f:-191.72 d:19.4652val_loss f:-200.0517 d:20.0125 val_rmse:0.7063 test_rmse:0.6894 train data is end ga0--train-- 363.16574025154114 epoch:3 time:363.2 train_loss f:-200.8263 d:19.2499val_loss f:-203.8944 d:20.4799 val_rmse:2.8324 test_rmse:2.8068 train data is end ga0--train-- 355.92360401153564 epoch:4 time:355.9 train_loss f:-189.3184 d:19.3741val_loss f:-180.7054 d:18.0778 ga0 clf_age/4 no model save path val_rmse:0.7821 test_rmse:0.7787age f1:0.4683 0.4683 0.4683 0.4683 train data is end ga0--train-- 356.7487156391144 epoch:5 time:356.7 train_loss f:-198.0661 d:19.8271val_loss f:-190.4692 d:19.0536 ga0 clf_age/5 no model save path val_rmse:0.7407 test_rmse:0.7326age f1:0.469 0.469 0.469 0.469
For the performance of different summary networks with different self-centered structures on the data set movielens-1m, the “constant” here represents the constant local function aggregation, and the “learnable” represents the parameter learnable aggregation.
The training process on the lastfm-360k dataset is as follows:
ga0 --------training processing------- train data is end ga0--train-- 380.44703578948975 epoch:0 time:380.4 train_loss f:-200.3726 d:19.7304val_loss f:-193.2152 d:19.3319 val_rmse:0.9439 test_rmse:0.9304 train data is end ga0--train-- 383.52958726882935 epoch:1 time:383.5 train_loss f:-192.2718 d:19.5616val_loss f:-192.8258 d:19.2919 val_rmse:0.9045 test_rmse:0.895 train data is end ga0--train-- 360.9422023296356 epoch:2 time:360.9 train_loss f:-191.72 d:19.4652val_loss f:-200.0517 d:20.0125 val_rmse:0.7063 test_rmse:0.6894 train data is end ga0--train-- 363.16574025154114 epoch:3 time:363.2 train_loss f:-200.8263 d:19.2499val_loss f:-203.8944 d:20.4799 val_rmse:2.8324 test_rmse:2.8068 train data is end ga0--train-- 355.92360401153564 epoch:4 time:355.9 train_loss f:-189.3184 d:19.3741val_loss f:-180.7054 d:18.0778 ga0 clf_age/4 no model save path val_rmse:0.7821 test_rmse:0.7787age f1:0.4683 0.4683 0.4683 0.4683 train data is end
The following is the performance on lastfm-360k.
-  Wu L, Chen L, Shao P, et al. Learning Fair Representations for Recommendation: A Graph-based Perspective[C]//Proceedings of the Web Conference 2021. 2021: 2198-2208.