October 25th 2022, Post Mortem
The Discovery
On October 13th, Toshi reported some unexpected behavior to us on Discord in the #sdk-js
channel. He had been comparing the outputs of the deprecated TypeScript SDK to the newer JavaScript SDK and noticed that the aggregate transaction hashes were being calculated differently. That…was not expected behavior. Even more concerning was the fact that they were both being accepted (and confirmed) by the network.
Root Cause
The team quickly identified two critical bugs - one in Catapult (the Symbol client), and one in the TypeScript and Java SDKs.
Client
In Catapult’s “Fushicho 2” (November 8th 2019) release, a new validator was added to the Aggregate Transaction plugin for the purposes of validating the TransactionsHash field of completed and bonded aggregate transactions. Unfortunately, due to an oversight, it was never registered in the Aggregate Transaction plugin, and thus never called.
manager.addStatelessValidatorHook([config](auto& builder) {
// the following line should have been present but was not
builder.add(validators::CreateAggregateTransactionsHashValidator());
builder.add(validators::CreateBasicAggregateCosignaturesValidator(
config.MaxTransactionsPerAggregate,
config.MaxCosignaturesPerAggregate));
if (config.EnableStrictCosignatureCheck)
builder.add(validators::CreateStrictAggregateCosignaturesValidator());
});
Due to the rush to launch, we forgot to add a E2E test that would trigger the Failure_Aggregate_Transactions_Hash_Mismatch
failure code. Consequently, Symbol was launched without proper validation of aggregate transactions hashes.
On top of that, there was a discrepancy in the hash calculation for aggregate transactions between the deprecated TypeScript SDK; the Java SDK; and our Python and JavaScript SDKs.
While the Python and JavaScript SDKs were calculating aggregate transaction hashes correctly, the TypeScript SDK and Java SDK - which we had deprecated earlier this year - were not.
SDK
The (deprecated) TypeScript SDK has two bugs in its calculation.
Recall that an aggregate transaction is a container of embedded transactions. Each embedded transaction is guaranteed to start on an 8-byte bounday. Zeroed pad bytes are inserted between transactions, whenever necessary, in order to achieve this. When calculating the embedded transaction hash, only the meaningful data of the transaction should be hashed and the zero pad bytes should be excluded. Unfortunately, the TypeScript SDK was including these zero pad bytes in the hash calculation. While frustrating, this does not reduce security.
The real bug was found in the Merkle hash calculation. Due to a improper usage of splice, splice inserted instead of replaced an element. The second parameter should have been 1 to replace.
hashes.splice(i / 2, 0, this.hash([hashes[i], hashes[i + 1]]));
Due to both poor test coverage as well as not utilizing our standard test vectors, both of these oversights went unnoticed.
The (deprecated) Java SDK had the same Merkle hash calculation bug as the TypeScript SDK. Surprisingly, it calculated the embedded transaction hashes correctly (without padding).
It is somewhat amusing that these two SDKs have been calculating different aggregate transactions hashes for years and no one noticed until now. At a minimum, they should have been using common test vectors.
How an Attack Could Work
Due to the missing check of the aggregate transactions hashes, cosignatories are only signing the Aggregate Transaction header. So, the only constraint is that the total embedded transactions size must match the payload_size in the header.
Due to the bug in the invalid Merkle Hash algorithm used by the TypeScript and Java SDKs, the corrupt Merkle Hash only protects a subset of transactions from modification. For example, when there are three transactions, only the first two transactions are protected from modification but not the third. As long as the size of the third transaction does not change, it can be replaced with anything.
Example Attack: Aggregate Complete
Assume there are three participants Alice, Bob and Charlie that want to swap mosaics.
Alice wants to pay Bob and Charlie each 10 Alpha mosaics.
In exchange, Charlie has agreed to pay 100 XYM.
Alice creates an aggregate complete transaction with three parts:
Alice sends 10 Alpaca to Bob
Alice sends 10 Alpaca to Charlie
Charlie sends 100 XYM to Alice
After creating the transaction, Alice signs it. She passes it to Bob who cosigns it. Bob passes it to Charlie who wants to cheat Alice.
Charlie knows this and replaces the third transaction so that he sends Alice 0 XYM instead. Or, even worse, he could change it so Alice sends HIM 100 XYM!
After replacing the third transaction, Charlie cosigns it and sends it to the network. He and Bob each receive 10 Alpaca from Alice. Alice receives nothing and could end up even sending XYM to Charlie.
Example Attack: Aggregate Bonded
The previous attack can also be carried out against aggregate bonded transactions.
After creating the transaction, Alice sends it to the network where it is added to the partial transactions cache. Bob submits his cosignature to the network.
At this point, Charlie downloads the transaction. He modifies the third transaction as in the previous section. Next, he downloads Bob’s cosignature from the network and attaches it. Finally, he cosigns it himself, and sends the, now completed, aggregate to the network.
Assessment
Upon discovery, we had two priorities: get the network patched before a blackhat could take advantage of the bug, and assess the damage on the current chain.
At the time of analysis (October 25th), there were 38,8607 aggregates confirmed on the network.
Group 1
73,615 (19.07%) were using unpadded hashes along with the correct Merkle hash algorithm.
129,987 (33.68%) were using padded transactions hashes, along with the correct Merkle hash algorithm. These are highly likely to have been initiated with the TypeScript SDK. Due to the specifics of the bug in the Merkle hash algorithm, aggregates containing two or fewer aggregates will calculate a cryptographically verifiable hash.
Group 2
184,471 (47.80%) are using padded transaction hashes along with an invalid Merkle hash algorithm. These are highly likely to have been initiated with the TypeScript SDK. Within those:
514 (0.13%) are using unpadded transaction hashes along with the invalid Merkle hash algorithm. These are highly likely to have been initiated with the Java SDK;
20 (0.005%) are corrupt and not calculated by one of the above methods. Upon further analysis, these all seem to have been initiated by the miscellaneous scripts. Reviewing these scripts, some of them modify one or more embedded transactions after calculating the aggregate transactions hash but do not recalculate it.
All of the embedded transactions in Group 1 can be crypographically verified as intended in the design of Symbol. None of the embedded transactions in Group 2 can be cryptographically verified. Luckily, once they are finalized, an attacker will be unable to spoof them because the state hashes must match.
The Fix
Catapult
In v1.0.3.4, the AggregateTransactionsHashValidator
is correctly registered in the Aggregate Transaction plugin.
At the fork block of 1’690’500, all aggregate transaction hashes will be calculated from the embedded transaction hashes without padding, and with the correct Merkle hash algorithm.
Prior to the fork block, it will allow aggregate transactions with transactions hashes meeting any of the following criteria:
Unpadded transaction hashes with correct Merkle Hash algorithms
Padded transaction hashes with correct Merkle Hash algorithms (TS SDK, <= 2 embedded transactions)
Unpadded transaction hashes with invalid Merkle Hash algorithms (Java SDK, > 2 embedded transactions)
Padded transaction hashes with invalid Merkle Hash algorithms (TS SDK, > 2 embedded transactions)
Hash has matching entry in new corrupt_aggregate_transaction_hashes network configuration setting (20 miscellaneous)
For performance reasons, at and after the fork block, only aggregate transactions with version 2 will be allowed. This version bump will allow the aggregate transactions hash validator to be stateless and run in parallel outside the main commit lock.
SDKs
The Python and JavaScript SDKs have been updated to support V2 aggregate transactions (complete and bonded).
The TypeScript SDK has been updated to use the correct Merkle hash algorithm, calculate embedded transaction hashes without padding, enable creation of V2 aggregate transactions, and have limited support for reading and verifying V1 aggregate transactions.
The Java SDK has been updated to use the correct Merkle hash algorithm, and have limited support for V2 aggregate transactions.
Moving Forward
One of the challenges in being a client developer is how you handle vulnerabilities in the network. It’s a constant tradeoff between responsible disclosure and decentralization: speak publicly, and you may find yourself entering into a game of cat-and-mouse with an opportunistic hacker; speak privately, and you isolate a large portion of your userbase. Your attack surface increases with each user you inform, so you’re constantly balancing the tradeoffs.
There is no standard framework to follow - each situation is unique. You’re making battle plans on-the-fly.
It’s clear we need a better way of quickly updating the network.
Traditionally, projects will have “war rooms” where validators and service providers gather - very rarely will exchanges intermingle with developers and other operators, thus increasing the burden of information distribution. We should have a war room established for in the future to quickly disseminate informaton, and as a community hold each other accountable under a gentlemen’s agreement to not leak or distribute information outside of the channel.
It’s also clear we need a better way of releasing. We moved towards a monorepo structure earlier this year to reduce that burden (thanks, Jenkins), but our turnaround time from fork to announcement was almost eight hours. That is unacceptable.
Finally, we need to slow down. In a rush to release Symbol to meet some arbitrary deadline, end-to-end tests were skipped and so were critical audits. Our test coverage across products is poor. And a final bug in Bootstrap led to a subset of nodes not being able to upgrade - a bug that we’re still triaging as we speak.
We’ll be working on a dedicated node deployment and maintenance tool post-fork to help reduce the burden on node operators and ensure we have seamless upgrades in the future. Our team is working on upping the test coverage of both the desktop and mobile wallets and migrating off the TypeScript SDK to our new dual NEM and Symbol JavaScript SDK, and we’re working on a documentation overhaul to ensure undefined or incorrect behavior in the future can be caught by both observant community developers and passionate community members.
Finally, we are awarding a responsible disclosure bounty to Toshi of roughly ¥370,000 JPY. Thanks to our favorite space lion, the Symbol network is secure once again.