Unexpectedly, a mongodb A bug in the driver caused the Net5 program deadlock!

Time:2021-12-17

1: Background

1. Tell a story

At the beginning of this month, a friend on the planet came to me and said that his program had a deadlock. He suspected that some of his writing methods had led to such an embarrassing situation in mongodb. The screenshot is as follows:

Unexpectedly, a mongodb A bug in the driver caused the Net5 program deadlock!

To tell you the truth, it’s the first time I’ve seen so many dumps and encountered a real deadlock. TMD is immediately interested… Talk to WinDbg.

2: WinDbg analysis

1. Is it really a deadlock

Since my friend said deadlock, I have to verify it first. You can use the command!syncblkView the synchronization block table.


0:000> !syncblk
Index         SyncBlock MonitorHeld Recursion Owning Thread Info          SyncBlock Owner
   97 000000F7B77CA1B8          107         1 000000F7D37A7210 8848  25   000000f7b853d480 System.Object
  144 000000F7D39BA2A8          495         1 000000F7DA4CDA70 75e0  63   000000f7b853de48 System.Object
-----------------------------
Total           603
CCW             1
RCW             1
ComClassFactory 0
Free            490

The following information can be obtained from the synchronization block table.

  1. Thread 25 is holding000000f7b853d480Lock object.
  2. Thread 63 is holding000000f7b853de48Lock object.

We know the so-calleddeadlockThis is an impasse caused by two threads eager to get the lock resources held by the other party and no one gives in. If I don’t understand, I’ll draw a picture:

Unexpectedly, a mongodb A bug in the driver caused the Net5 program deadlock!

The above figure is a deadlock. Incidentally, this situation is often encountered in SQL server. How will it be handled? This is interesting. A mediation thread in SQL server executes periodically. When this deadlock is detected, it will kill the thread with low priority, so that another thread can successfully obtain the lock. The thread that is killed will appear the following exception information:

System. Data. SqlClient. Sqlexception (0x80131904): the transaction (process ID 112) and another process are deadlocked on the lock communication buffer resource and have been selected as deadlock victims. Please rerun the transaction.
   In system Data. SqlClient. SqlConnection. OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   In system Data. SqlClient. SqlInternalConnection. OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   In system Data. SqlClient. TdsParser. ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   In system Data. SqlClient. TdsParser. TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   In system Data. SqlClient. SqlCommand. RunExecuteNonQueryTds(String methodName, Boolean async, Int32 timeout)
   In system Data. SqlClient. SqlCommand. InternalExecuteNonQuery(TaskCompletionSource`1 completion, String methodName, Boolean sendToPipe, Int32 timeout, Boolean asyncWrite)
   In system Data. SqlClient. SqlCommand. ExecuteNonQuery()
   In Microsoft ApplicationBlocks. Data. SqlHelper. ExecuteNonQuery(SqlConnection connection, CommandType commandType, String commandText, SqlParameter[] commandParameters)
   In Microsoft ApplicationBlocks. Data. SqlHelper. ExecuteNonQuery(String connectionString, CommandType commandType, String commandText, SqlParameter[] commandParameters)

Ha ha, is it deja vu? Well, after we have a certain understanding of deadlock, let’s assume that if it exists

  1. Thread 25 wants to get000000f7b853de48Lock object.
  2. Thread 63 wants to get000000f7b853d480Lock object.

In this case, it is bound to deadlock, right? How to use WinDbg for verification next? Switch to thread 25 to view the thread stack and stack object.


0:000> ~25s
ntdll!NtWaitForMultipleObjects+0xa:
00007ffb`9f230c7a c3              ret

0:025> !clrstack 
OS Thread Id: 0x8848 (25)
        Child SP               IP Call Site
000000F782904838 00007ffb9f230c7a [HelperMethodFrame_1OBJ: 000000f782904838] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000F782904990 00007ffb1d5e4d5c MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+ListConnectionHolder.Acquire()

0:025> !dso
OS Thread Id: 0x8848 (25)
RSP/REG          Object           Name
000000F782904648 000000f7b853de48 System.Object
000000F7829046D8 000000f7b84cb508 MongoDB.Driver.ReadPreference

You can see clearlyReliableEnterGetting000000f7b853de48When locking the object, it is stuck, and then switch to thread 63 to view it.


0:025> ~63s
ntdll!NtWaitForMultipleObjects+0xa:
00007ffb`9f230c7a c3              ret
0:063> !clrstack 
OS Thread Id: 0x75e0 (63)
        Child SP               IP Call Site
000000F787774EE8 00007ffb9f230c7a [HelperMethodFrame_1OBJ: 000000f787774ee8] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000F787775040 00007ffb1d5e4d5c MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+ListConnectionHolder.Acquire()
0:063> !dso
OS Thread Id: 0x75e0 (63)
RSP/REG          Object           Name
000000F787774A38 000000f7b82dc750 MongoDB.Bson.BsonBoolean
000000F787774BA0 000000f7b83a9a10 System.RuntimeType
000000F787774CF8 000000f7b853d480 System.Object

You can see clearlyReliableEnterGetting000000f7b853d480, this indicates that a deadlock does occur and there is nothing wrong with it.

2. Deadlock cause analysis

If you want to investigate the cause of the deadlock, you can only consider it carefullyThread stack + Thread stack object


0:063> !clrstack 
OS Thread Id: 0x75e0 (63)
        Child SP               IP Call Site
000000F787774EE8 00007ffb9f230c7a [HelperMethodFrame_1OBJ: 000000f787774ee8] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000F787775040 00007ffb1d5e4d5c MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+ListConnectionHolder.Acquire()
...
000000F78777BD10 00007ffb1e2fc69f System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/CancellationTokenSource.cs @ 724]
000000F78777BDA0 00007ffb1e2fc69f System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/CancellationTokenSource.cs @ 724]
000000F78777BE30 00007ffb1e2fc69f System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/CancellationTokenSource.cs @ 724]
000000F78777BEC0 00007ffb1e325d55 MongoDB.Driver.Core.Misc.SemaphoreSlimSignalable.Signal()
000000F78777BF10 00007ffb1e35b98e MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+ListConnectionHolder.Return(PooledConnection)
000000F78777BF70 00007ffb1e35b728 MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.ReleaseConnection(PooledConnection)
000000F78777C0A0 00007ffb1e35b4f9 MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+AcquiredConnection.Dispose()
000000F78777C0D0 00007ffb1e384164 MongoDB.Driver.Core.Operations.RetryableReadContext.Dispose()
000000F78777C100 00007ffb1e37ee31 MongoDB.Driver.Core.Operations.FindOperation`1+d__120[[System.__Canon, System.Private.CoreLib]].MoveNext()
...
000000F78777E980 00007ffb1dd807bf System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.InvokeContinuation(System.Action`1<System.Object>, System.Object, Boolean, Boolean) [/_/src/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs @ 1019]
000000F78777E9F0 00007ffb1dd80519 System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.OnCompleted(System.Net.Sockets.SocketAsyncEventArgs) [/_/src/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs @ 858]
000000F78777EA50 00007ffb1dd7f751 System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) [/_/src/System.Private.CoreLib/src/System/Threading/Overlapped.cs @ 59]

0:063> !dso
OS Thread Id: 0x75e0 (63)
RSP/REG          Object           Name
000000F787774A38 000000f7b82dc750 MongoDB.Bson.BsonBoolean
000000F787774BA0 000000f7b83a9a10 System.RuntimeType
000000F787774CF8 000000f7b853d480 System.Object
...
000000F78777B658 000000f7be3407c8 System.String    OperationCanceled
000000F78777B670 000000f7b82c07e0 System.Resources.RuntimeResourceSet
000000F78777B698 000000f7b82bff38 System.Globalization.CultureInfo
000000F78777B730 000000f7a36525c8 System.Char[]
000000F78777B7E0 000000f7be3407c8 System.String    OperationCanceled
...
000000F78777BEF8 000000f7b853de10 MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+ListConnectionHolder
000000F78777BF10 000000f79f79e398 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[MongoDB.Driver.IAsyncCursor`1[[Czkj.AlibabaTools.Models.mongo.CopyRecords, Czkj.AlibabaTools.Models]], MongoDB.Driver.Core],[MongoDB.Driver.Core.Operations.FindOperation`1+<ExecuteAsync>d__120[[Czkj.AlibabaTools.Models.mongo.CopyRecords, Czkj.AlibabaTools.Models]], MongoDB.Driver.Core]]
000000F78777BF40 000000f7b853de48 System.Object
...

Because the code is everywhereawait,asyncTherefore, the decompiled thread stack is really large. After careful comparison, it is found that the code flow is probably:

  1. Start by processing the asynchronous request callback of mongodb (system. Threading. Overlappeddata).
  2. stayMongoDB.Driver.Core.Operations.FindOperationI don’t know why I threw the cancellation exceptionOperationCanceledAnd then callRetryableReadContext.Dispose()

Unexpectedly, a mongodb A bug in the driver caused the Net5 program deadlock!

  1. stayListConnectionHolder.Return()Method000000f7b853de48Lock object.

Unexpectedly, a mongodb A bug in the driver caused the Net5 program deadlock!

  1. staySignalOrReset() -> SemaphoreSlimSignalable.Signal()Method.

Unexpectedly, a mongodb A bug in the driver caused the Net5 program deadlock!

Note: there is no exit in event triggerlockArea.

  1. In the process of several handlers entering another thread poolListConnectionHolder.Acquire()Method, hoping to get the data in the pool000000f7b853d480Lock object.

Unexpectedly, a mongodb A bug in the driver caused the Net5 program deadlock!

At the same time, thread 25 is doing this operation in reverse. Because everyone is a double lock, it eventually leads todeadlockThe occurrence of.

3: There are several questions to answer

1. Why are there two thread pools?

fromThread stack objectLook, you should also see that there are two thread poolsExclusiveConnectionPool


0:000> !dumpheap -type ExclusiveConnectionPool -stat
Statistics:
              MT    Count    TotalSize Class Name
00007ffb1d25eca0        2          464 MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool  

This is because of friendsMongodb connection stringThe replica set mode of dual IP is used.


{
  "mongo": {
    "Connection": "mongodb://xxx.aliyuncs.com:3717,xxx.aliyuncs.com:3717/admin?replicaSet=mgset-500180544&readPreference=secondaryPreferred&maxPoolSize=1000&wtimeoutMS=10000"
  }
}

2. Is it a programmer’s pot?

From the stack information, it’s not the programmer’s pot. It’s something that happens when mongodb receives asynchronous callbacksOperationCanceledException. A deadlock bug occurs in the subsequent processing logic of the exception.

3. What is the current version of mognodb SDK?

The official mongodb driver is2.13.1.0, that is2021-8-15The latest one was released in October2.13.2.0

Unexpectedly, a mongodb A bug in the driver caused the Net5 program deadlock!

4. Feedback

After knowing this information, I communicated with my friend. The friend said that he submitted an issue to mongodb community. A few days later, the official answer was in the latestv2.14.beta1It is handled in.

  • https://github.com/mongodb/mo…
  • https://jira.mongodb.org/brow…

That is, in the futurerelease v2.14.0It will be solved in the version. At present, we can only wait! Looking forward to… ha-ha

4: Summary

Generally speaking, this is a deadlock problem caused by a bug at the bottom of mongodb. The analysis process of dump has experienced twists and turns, although it is officially authoritativeMongoDB.DriverPackage, but also worthy of doubt, rather than blindly doubt yourself deeply Finally, look forward to the upcomingrelease v2.14.0All right.

Unexpectedly, a mongodb A bug in the driver caused the Net5 program deadlock!