How to deal with the problems encountered in inserting and querying massive data in spark redis rows

Time:2021-1-23

Abstract:Because redis is a memory based database, its stability is not very high, especially in the standalone mode. So we will encounter many problems when using spark redis, especially in the scene of massive data insertion and query.

Massive data query

Redis is a database based on memory reading. Compared with other databases, the reading speed of redis will be faster. But when we want to query tens of millions of massive data, even redis will take a long time. At this time, if we want to terminate the execution of the select job, we want all the running tasks to be killed immediately.

Spark has a job scheduling mechanism. Sparkcontext is the entry of spark, which is equivalent to the main function of the application. The canceljobgroup function in sparkcontext can cancel a running job.

/**
  * Cancel active jobs for the specified group. See `org.apache.spark.SparkContext.setJobGroup`
  * for more information.
  */
 def cancelJobGroup(groupId: String) {
   assertNotStopped()
   dagScheduler.cancelJobGroup(groupId)
 }

It is reasonable to say that after the job is canceled, all tasks under the job should also be terminated. Moreover, when we cancel the select job, the executor will throw taskkilledexception. At this time, the taskcontext in charge of the task job will execute killetask if interrupted after catching the exception.

 // If this task has been killed before we deserialized it, let's quit now. Otherwise,
 // continue executing the task.
 val killReason = reasonIfKilled
 if (killReason.isDefined) {
   // Throw an exception rather than returning, because returning within a try{} block
   // causes a NonLocalReturnControl exception to be thrown. The NonLocalReturnControl
   // exception will be caught by the catch block, leading to an incorrect ExceptionFailure
   // for the task.
   throw new TaskKilledException(killReason.get)
 }
/**
 * If the task is interrupted, throws TaskKilledException with the reason for the interrupt.
 */
 private[spark] def killTaskIfInterrupted(): Unit

However, in spark redis, there will still be termination jobs, but the task still runs. Because the calculation logic of task is ultimately implemented in redisrdd, the compute of redisrdd will get keys from jedis. So to solve this problem, we should cancel the running task in redisrdd. There are two ways:

Method 1: refer to spark’s JDBC RDD, define close (), combined with intrusive iterator.

def close() {
   if (closed) return
   try {
     if (null != rs) {
       rs.close()
     }
   } catch {
     case e: Exception => logWarning("Exception closing resultset", e)
   }
   try {
     if (null != stmt) {
       stmt.close()
     }
   } catch {
     case e: Exception => logWarning("Exception closing statement", e)
   }
   try {
     if (null != conn) {
       if (!conn.isClosed && !conn.getAutoCommit) {
         try {
           conn.commit()
         } catch {
           case NonFatal(e) => logWarning("Exception committing transaction", e)
         }
       }
       conn.close()
     }
     logInfo("closed connection")
   } catch {
     case e: Exception => logWarning("Exception closing connection", e)
   }
   closed = true
 }
 
 context.addTaskCompletionListener{ context => close() } 
CompletionIterator[InternalRow, Iterator[InternalRow]](
   new InterruptibleIterator(context, rowsIterator), close())

Method 2: the asynchronous thread executes compute, and the main thread judges the task is interrupted

try{
   val thread = new Thread() {
     override def run(): Unit = {
       try {
          keys = doCall
       } catch {
         case e =>
           logWarning(s"execute http require failed.")
       }
       isRequestFinished = true
     }
   }
 
   // control the http request for quite if user interrupt the job
   thread.start()
   while (!context.isInterrupted() && !isRequestFinished) {
     Thread.sleep(GetKeysWaitInterval)
   }
   if (context.isInterrupted() && !isRequestFinished) {
     logInfo(s"try to kill task ${context.getKillReason()}")
     context.killTaskIfInterrupted()
   }
   thread.join()
   CompletionIterator[T, Iterator[T]](
     new InterruptibleIterator(context, keys), close)

We can execute compute asynchronously, and then judge whether the task is interrupted in another thread. If so, we can execute killtask if interrupted of taskcontext. In order to prevent killtask if interrupted from killing a task, we combine it with interruptible iterator: an iterator to provide task termination function. Work by checking the interrupt flag in [taskcontext].

Massive data insertion

We all know that redis data is stored in memory. Of course, redis also supports persistence and can back up data to the hard disk. When massive data is inserted, if redis does not have enough memory, it will obviously lose some data. The point that confuses users here is that when the used memory of redis is greater than the maximum available memory, redis will report an error: command not allowed when used memory > maxmemory. But when the data of insert job is larger than the available memory of redis, part of the data is lost, and no error is reported.

Because whether it is the jedis client or the redis server, when the data is inserted, the memory is not enough and the insertion will not succeed, but no response will be returned. So the solution that we can think of at present is to expand the redis memory when the insert data is lost.

summary

Spark redis is an open source project that is not widely used, unlike spark JDBC, which has been commercialized. So spark redis still has many problems. I believe that with the efforts of committer, spark redis will become more and more powerful.

Click follow to learn about Huawei’s new cloud technology for the first time~