April 7, 2024 Post Mortem

Background

On March 30th, at approximately 11:00AM UTC, Toshi alerted the NEM & Symbol Discord about a problem he was encountering with his node: a fork was occurring at block height 3,191,538.

Author’s Note: We very much appreciate his outreach, and we encourage all node operators to contact us promptly when they sense anything might be amiss!

By the time we’d responded to him, the data folder had already been purged, and we did not have any further data to perform additional diagnostics. We’d mistakenly thought that his node had recovered, and did not think too much of the issue.

A handful of people in the community reported similar issues, but we couldn’t find anything concrete and assumed they’d eventually sync back with the network.

On April 1st, Toshi reported that the same issue had occurred. Now, we were worried, and put some manpower behind the issue.

Dusan, Toshi, and the rest of the community jumped into action, helping provide diagnostics and suggestions on possible causes. By this time, only a handful of nodes (10-20) had forked - representing less than 2% of the network - so there was not a large risk profile…yet.

We shot out a tweet to let everyone know we were on it, and got to work.

Diving In For Data

In the logs of forked nodes, there was this commonly reported error: processing of block 3191538 failed with Failure_LockSecret_Hash_Already_Exists.

The error meant that an active SECRET_LOCK was attempting to be replaced, which the network rules do not allow. Despite that, this block was both confirmed, and finalized on a majority fork. Since this was the divergent block, we know this was where we should focus our attention.

The biggest problem we were facing is that nodes synching from scratch or a backup (a.k.a a checkpoint) could not progress past Block 3191538. We didn’t notice that backups were being maintained by the community, so we reviewed them, but there was nothing out of the ordinary.

It soon became obvious that there was something strange with the secret lock with the composite hash of 0D0D03B478CBFA3A9C0A0A292E2FECB2C09A5DA01659CFF20C199823D0E79E81. Astute students of Symbol will remember that a secret lock composite hash is derived from the secret lock's secret and recipient - if either are changed, the secret lock's composite hash will be different.

We observed the following usage in the blockchain for secret locks with that composite hash:

Block 3,191,088: ACTIVATE | SECRET_LOCK created, with a 420 block expiration;
Block 3,191,098: DEACTIVATE | SECRET_LOCK completed;
Block 3,191,525: - ACTIVATE | SECRET_LOCK created, with a 420 block expiration;
Block 3,191,538: - ACTIVATE | SECRET_LOCK created, with a 420 block expiration;
Block 3,191,576: - DEACTIVATE | SECRET_LOCK completed.

Hm. Interesting.

Symbol allows for a composite hash to be reused if the prior lock is inactive (meaning, it has been COMPLETED or EXPIRED). So, the recreation of the lock at 3,191,525 is valid and makes sense because the prior lock was completed at block height 3,191,098. However, the recreation of the lock at block height 3,191,538 is unexpected, as the prior lock should still be active.

Author’s Note: Unrelated to this issue, it should be stressed that it is generally insecure to reuse a secret. In order to reduce memory requirements, Symbol does not keep track of all secrets used (neither do most chains). So, please remember this when using secret locks!

Another important observation we made was that only long running nodes seemed to have record of this duplicate lock creation. Nodes that synced from scratch rejected the duplicate lock created at block height 3,191,538 - and that was about all of the information we would have to diagnose and fix this issue. We would need to do a lot of ad-hoc testing and analysis to figure out what was happening.

Our first thought was that this could be a rollback issue, but we couldn’t explain why only the nodes with a longer-than-average uptime would have been affected: rollback issues typically only affect a small number of nodes. We quickly eliminated this train of thought, and moved onto our next potential clue: the transactions.

Something was going on with the pruning of secret locks (and finalization).

First, some background: since Symbol allows a composite hash to be reused internally, a list of secret locks is stored for each composite hash. The data structure is essentially a stack.

In addition, since secret locks are time limited, they can be pruned from the cache when they can no longer change the blockchain’s state. In practice, a secret lock is pruned when the block at which it can expire is finalized.

With this in mind, the secret lock created at block 3,191,098 would be pruned at block height 3,191,508(420 + 58).

We were able to piece together the likely sequence on the majority fork:

There was a SECRET_LOCK created at 3,191,525;
Block 3,191,508 finalized;
All matching SECRET_LOCKs were removed;
The SECRET_LOCK was created at 3,191,538.

The bug, however, was that the entire secret lock history was being pruned if any secret lock expired.

The Solution

The problematic code was this:

	auto groupIter = groupedSet.find(key);
	const auto* pGroup = groupIter.get();
	if (!pGroup)
		return;

	for (const auto& identifier : pGroup->identifiers())
		set.remove(identifier);

	groupedSet.remove(key);

pGroup->identifiers() returns all the composite hashes that expire at the specified height (key).

set.remove(identifier); removes the entire secret lock history instead of only the expiring parts.

Spot the fix? We need to prune secret locks individually, instead of whole histories.

The new code is:

	std::unordered_set<typename TDescriptor::KeyType, utils::ArrayHasher<typename TDescriptor::KeyType>> pendingRemovalIds;
	ForEachIdentifierWithGroup(*m_pDelta, *m_pHeightGroupingDelta, height, [height, &pendingRemovalIds](auto& history) {
		history.prune(height);

		if (history.empty())
			pendingRemovalIds.insert(history.id());
	});

	for (const auto& identifier : pendingRemovalIds)
		m_pDelta->remove(identifier);

	m_pHeightGroupingDelta->remove(height);

Here, only histories without any remaining secret locks are removed. The important part is only marking a history for removal when it is empty (history.empty()).

In order to work around the duplicate secret lock already in the chain (at block *538), a new configuration setting is added. skipSecretLockUniquenessCheck should be set to 3'191'538 on mainnet and 0 on all other chains.

The secret lock uniqueness check will be skipped at the height to which it is set.

Author’s Note: It is interesting to understand why state hash validation succeeded throughout this process. At block height 3,191,525, the created (second) secret lock was added to the cache and the state hash changed. Somewhere between blocks *526 and *537, the secret lock was pruned.

Pruning is assumed to not affect state because it only removes entities that can no longer affect the state. At block *538, the created (third) secret lock was added to the cache and the state hash changed.

Effectively, this was captured as an update of the secret lock.

Due to the improper treatment of the secret lock history, two additional manifestations of the bug were identified.

First, there is a situation where a secret lock can expire multiple times. For example, consider the following sequence:

Block 3197 038 - Secret lock created with 420 block expiration
Block 3197 075 - Secret lock completed
Block 3197 446 - Secret lock created with 420 block expiration

The first lock should expire at height *(038 + 420), but this should not result in any state change because it has been completed.

Unfortunately, at the time of expiration, only the status of the most recent lock is checked - the one created at *446. Since this lock is incomplete, Symbol incorrectly generates an expiry receipt and transfer.

This was fixed by explicitly checking whether or not the expiring lock is the most recent one and only continuing processing when it is:

	if (height == lockInfo.EndHeight || forceHeights.cend() != forceHeights.find(height))
		accountStateConsumer(accountStateCache.find(lockInfo.OwnerAddress).get());

There were two locks exploiting this bug on mainnet:

Lock created at block 3197038 with 1 uint of mosaic 0BCF7F87A4175ABE
Lock created at block 3197361 with 1 uint of mosaic 00E26AEDCE86A630

The forceSecretLockExpirations setting was added to allow synchronizing nodes to emulate the original (wrong) behavior.

Second, there is a situation where an expiry receipt is not registered for a secret lock. For example, consider the following sequence:

Block 3197 038 - Secret lock created with 420 block expiration
Block 3197 075 - Secret lock completed
Block 3197 446 - Secret lock created with 420 block expiration

The first lock is eligible for pruning at height *(038 + 420) whereas the second is eligible for pruning at height *(446 + 420). Due to the primary bug, when the block *781 is finalized, the entire history is removed from the cache. This includes the second (active) lock created at height *440. As a consequence, the lock will be unable to complete.

Since the first lock was errantly completed AND expired, this balances out. Notice that, together, these two manifestations are related. Effectively, they shorten the second lock duration to the duration of the first lock.

There were two locks that were affected by this bug on mainnet:

Lock created at block 3197440 with 1 uint of mosaic 00E26AEDCE86A630
Lock created at block 3197446 with 1 uint of mosaic 0BCF7F87A4175ABE

The skipSecretLockExpirations setting was added to allow synchronizing nodes to emulate the original (wrong) behavior.

For correct syncing from scratch, we recommend the following settings for mainnet:

skipSecretLockUniquenessChecks = 3'191'538
skipSecretLockExpirations = 3'197'860, 3'197'866
forceSecretLockExpirations = 3'197'458, 3'197'781

Please upgrade to 1.0.3.7 as soon as possible.

This is a mandatory upgrade and will prevent similar issues from happening in the future!

We would like to thank all the community members who helped, especially toshi, dusan, bootaru (and the NFTDriveEX team) and neluta!