Need Quality Code? Get Silver Backed

XUnit async/await Foot Guns

21stNov

0

by Gary H

Many people feel they must multi-task because everybody else is multitasking, but this is partly because they are all interrupting each other so much.

Marilyn vos Savant

Let's set the scene: a number of integration tests have been written using the Leaping Gorilla Testing library which exercies a component interacting with a database. When exercising the component under test we perform a number of async/await calls in sequence relying on the results in our assertions. This works flawlessly on dev machines, the code works as expected under production load but running the integration tests in an Azure DevOps Hosted Agent deadlocks and causes the long running test killer (dotnet test blame flags are important for a healthy pipeline) to end the run. Why?

Let's start by seeing the guts for a test. This is one test, we have multiple test classes each running different scenarios in their own When clauses against the MetadataStore

[When]
public async Task WhenMetadataIsRoundTripped()
{
    _saveResult = await MetadataStore.SaveMetadataAsync(_newMetadata);
    _retrievedMetadata = await MetadataStore.GetMetadataAsync(_parentObjectId);
    if (_retrievedMetadata.IsSuccess)
    {
        _deleteResult = await MetadataStore.DeleteMetadataAsync(_retrievedMetadata.Value);
    }
}

[Then]
public void SaveShouldBeSuccessful()
{
    Assert.True(_saveResult.IsSuccess);
}

[Then]
public void GetShouldBeSuccessful()
{
    Assert.True(_retrievedMetadata.IsSuccess);
}

[Then]
public void DeleteShouldBeSuccessful()
{
    Assert.NotNull(_deleteResult);
    Assert.True(_deleteResult.IsSuccess);
}

When this is run in the pipeline with blame-hang enabled we see errors like:

[Long Running Test] 'SaveShouldBeSuccessful', Elapsed: 00:02:58
[Long Running Test] 'SaveShouldBeSuccessful', Elapsed: 00:02:38
[Long Running Test] 'SaveShouldBeSuccessful', Elapsed: 00:02:18

On investigation it was always the first access of an awaited async method which was causing the hang. But why was this happening with such consistency?

In .Net threading with Tasks is handled using the concept of a SychronizationContext. The idea of this is that it tracks all of the running tasks and ensures that any async/await state machines pass control back and forth to the relevant threads that should be receiving their results.

XUnit implements its own SychronizationContext. This behaves in a similar manner to the .Net one but with a key difference - it explicitly locks the maximum number of inflight actions to match the number of (v)CPUs in the machine running the test. The failing pipelines run on MS's blessed VM's (Standard DS2_V2) which have 2vCPUs. This will be important later.

When XUnit sees a class containing Tests (methods marked with a Fact attribute or in our case a Then attribute) it chooses to parallelize them. It does so by running all tests in each class in parallel. XUnit chooses to resolve any async/await calls as late as possible - which is on access of the awaited task. If each of our tests wish to access an awaited result we may exhaust the worker queue and if any task relies on another that has not yet been run it will never be processed.

This is the issue that was seen with the metadata tests. Each Then statement is equivalent to one Fact test. They all live in one class so are run in parallel. The custom XUnit sync context does not attempt to resolve wait calls until absolutely required. Each of these are waiting on the async result of a method call. XUnit processes the in-flight calls in a non-deterministic manner meaning that if we have many tests being run in parallel awaiting the same resource then odds are good that we will not perform the awaits in the "correct" order leading to a deadlock. Couple this to the VM only having 2 vCPU meaning we have a max of 2 actions in flight it becomes easy to see why we were deadlocking.

How to fix it

The answer to this is that we have to stop our Then attributes causing tests to be run in parallel. The easiest way to do this was to ensure all metadata tests inherited from a single base class and then decorate that class with a Collection attribute. This tells XUnit that "All metadata tests are in the same collection, don't try to run them in parallel, one at a time please". An alternative approach would be to decorate every Then method with an additional Priority attribute and set them all sequentially though this is more prone to user error through repeating the same priority value than the Collection approach.

What other things can we do to ensure this doesn't affect us in the future? If we were running on a VM with more than two cores this would be less pronounced - this is why the issue couldn't be easily replicated on dev machines, they have more cores, more in-flight actions and therefore less chance of a simple deadlock. There's just more breathing space for tasks to finish. We should always take special care to ensure either Priority or Collection attributes are applied with tests accessing shared resources in the same class.

C# , Testing , XUnit

Comments are Locked for this Post