wahcheung asked 1 year ago

Two dataframes have been read from hive, showrecords is the display record, playrecords is the playback record, and vid is the video ID (there are other fields, which are irrelevant here, omitted)
showRecords:

vid
1
1
2
2
3
3
4

playRecords:

vid
1
2
1

How to count the display times and playback times of each vid, and get a result similar to the following (result count_ Play in descending order)

vid  count_play  count_show
1    2           2
2    1           2
3    0           2
4    0           1

I tried

val mergeRecords = showRecords.groupBy("vid").agg("vid" -> "count").join(playRecords.groupBy("vid").agg("vid" -> "count"), Seq("vid"), "full_outer")

However, the result was not as expected and did not press count_ Play sort, and since it is full_ Outer join causes count_ Null value in play column. (other methods have been tried, but the vid is prone to problems because the primary key vid of the two tables has the same name.) How to solve this problem in a more elegant way?

1 Answers
wahcheung answered 1 year ago
val res = showRecords.groupBy("vid").agg(count("vid").as("showtimes")).join(playRecords.gr‌​oupBy("vid").agg(cou‌​nt("vid").as("playti‌​‌​mes")), Seq("vid"), "left").na.fill(0).orderBy(col("playtimes").desc)