DWQA QuestionsCategory: Artificial IntelligenceParsing JSON of nested object array by spark SQL
gonefuture asked 1 year ago

1. The JSON data is as follows
{“id”:11,”data”:[{“package”:”com.browser1″,”activetime”:60000},{“package”:”com.browser6″,”activetime”:1205000},{“package”:”com.browser7″,”activetime”:1205000}]}
{“id”:12,”data”:[{“package”:”com.browser1″,”activetime”:60000},{“package”:”com.browser6″,”activetime”:1205000}]}
……
In JSON, the activation time of app is used to analyze the total activation time of each app
I use spark SQL to parse JSON

val sqlContext = sc.sqlContext

    val behavior = sqlContext.read.json("behavior-json.log")
    behavior.cache()

    behavior.createOrReplaceTempView("behavior")
 

    val appActiveTime =  sqlContext.sql ("select data from behavior") // SQL query
    
    appActiveTime.show (100100) // print dataframe
    appActiveTime.rdd.foreach (println) // print RDD

But the printed dataframe looks like this

+----------------------------------------------------------------------+
|                                                                  data|
+----------------------------------------------------------------------+
|                        [[60000,com.browser1], [12870000,com.browser]]|
|                          [[60000,com.browser1], [120000,com.browser]]|
|                          [[60000,com.browser1], [120000,com.browser]]|
|                         [[60000,com.browser1], [1207000,com.browser]]|
|                                                [[120000,com.browser]]|
|                        [[60000,com.browser1], [1204000,com.browser5]]|
|                        [[60000,com.browser1], [12075000,com.browser]]|
|                          [[60000,com.browser1], [120000,com.browser]]|
|                         [[60000,com.browser1], [1204000,com.browser]]|
|                          [[60000,com.browser1], [120000,com.browser]]|
|                         [[60000,com.browser1], [1201000,com.browser]]|
|                                              [[1200400,com.browser5]]|
|                         [[60000,com.browser1], [1200400,com.browser]]|
|[[60000,com.browser1], [1205000,com.browser6], [1205000,com.browser7]]|

RDD is like this

[WrappedArray([60000,com.browser1], [60000,com.browser1])]
[WrappedArray([120000,com.browser])]
[WrappedArray([60000,com.browser1], [1204000,com.browser5])]
[WrappedArray([12075000,com.browser], [12075000,com.browser])]

And I want to convert the data into

com.browser1 60000 
com.browser1 60000 
com.browser 12075000 
com.browser 12075000 
.......

It is to change the array elements of each row in RDD into one row. Of course, it can also be other structures that are easy to analyze
Because I am a beginner of spark and Scala, I have tried for a long time but failed. So I hope you can guide me

3 Answers
liuxinsi answered 1 year ago

No problem
According to your original JSON, if it is converted to a table structure, it is in this format, because there are two attributes in JSON, one ID and one data. Data is a list. If you do not map yourself, spark will not know what the structure of the table mapped to your list will look like

+--------------------+---+
|                data| id|
+--------------------+---+
|[[60000, com.brow...| 11|
+--------------------+---+

So if you want to display your JSON in a table structure, you need to map it yourself

   val session = SparkSession.builder().config(sc).getOrCreate()
    try {
      val schema = StructType(Seq(
        StructField("id", LongType),
        StructField("package", StringType),
        StructField("activetime", LongType)
      ))

      val encoder = RowEncoder(schema)

      val df = session.read.json("file:\\\\\\E:/anc.json")
        .flatMap(new FlatMapFunction[Row, Row] {
          override def call(r: Row): util.Iterator[Row] = {
            val list = new util.ArrayList[Row]()
            val datas = r.getAs[mutable.WrappedArray.ofRef[Row]]("data")
            datas.foreach(data => {
              list.add(Row(r.getAs[Long]("id"), data.getAs[Long](1), data.getAs[String](0)))
            })
            list.iterator()
          }
        }, encoder)

      df.show()

Finally, the table structure is as follows:

+---+------------+----------+
| id|     package|activetime|
+---+------------+----------+
| 11|com.browser1|     60000|
| 11|com.browser6|   1205000|
| 11|com.browser7|   1205000|
+---+------------+----------+

You can be with you
sqlContext.sql(“SELECT data FROM behavior”)
After the map, the key point is how to display the list

gonefuture replied 1 year ago

I use the spark SQL method, and use explode () to be more concise.

Old t1234 answered 1 year ago

Do you have a solution?

gonefuture replied 1 year ago

There’s a solution, just like the one above.

JarShaw answered 1 year ago

Use the explode function to flatten an array

SELECT EXPLODE(data) FROM behavior