Cracking the Code: How Merkle Trees Ensure the Integrity of Data

Ensuring blockchain data integrity

As a blockchain developer, you are well aware of the importance of ensuring the integrity of the data you record in your blockchain.

Merkle trees offer a powerful solution to this challenge, leveraging the properties of good cryptographic hash functions to create a secure digital fingerprint for a set of data elements. In this blog, we’ll delve into the workings of Merkle trees and how they can be used to prove the integrity of transactions in proof-of-work blockchain systems.

By the end of this article, you’ll have a deep understanding of how Merkle trees work and how they can be used to improve the security of your blockchain.

What are Merkle trees?

Merkle trees are a data structure used in computer systems which ensures the integrity of all the elements included in a data set when there is a need to prove that no tampering has occurred anywhere in the records.

A Merkle tree leverages the properties of good cryptographic hash functions in that every data element that is to be recorded in the Merkle tree is first passed through a one-way hash function as the first step of recording.

Passing the data object through the hash function maps an arbitrary length data element to a fixed length output in an effectively unique and completely deterministic way.

If the SHA256 hash function is used, then regardless of if it’s a 10-byte file or a 4GB file, the output of passing the binary representation of that file through the hash function will always be 32 bytes (256bits).

Any data element at its essence is simply just 1’s and 0’s. 8 of these 1’s or 0’s makes a byte. Whether it’s a jpeg, pdf, mp3 or mp4, it is all fundamentally still those 1’s and 0’s where a file extension simply directs how those 1s and 0s are interpreted or written by a program.

Any change in any part of a data object, whether it’s an extra letter, a deleted word, an added punctuation mark, will all result in a change to the 1’s and 0’s of that data element.

The output of any data element put through a good hash function will be completely different following the change of even a single 1 or 0, to the extent that even though they are passed through the same SHA256 algorithm, there is no discernible mathematical relationship between the two versions of the same data object.

This means the output from the hash function can serve as a digital fingerprint that is specific to the data object in that state. Any change will be reflected in an entirely different hash output.

However, because the algorithm that underwrites the hash function is purely deterministic, every time you put the identical data object through the same hash function, you will create the same identical output.

How Merkle trees secure data

A Merkle tree leverages this property to create a record of a data set which can be used to prove that no tampering has occurred to any of the elements of the set.

If we have four files, A, B, C and D, then we can hash those four files to get H(A), H(B), H(C), and H(D). These first hashes are called the leaf node values of a Merkle tree.

As the relative ordering of those files is important, we can create a digital fingerprint for the pairs in order.

We can join H(A) and H(B) together to create a 64-byte string of characters, which itself can be passed through the SHA256 function to create the pairs’ digital fingerprint. This fingerprint will be entirely different if the pair is ordered as BA rather than AB. The same can be done with H(C) and H(D).

The digital fingerprint of pairs of values from a Merkle tree are known as the interior node values of the tree.

Now to record the ordering of these fingerprints of the pairs, we need to join them together again to make a 64-byte value, which is again put through the hash function to create a single secure fixed byte length summary value of the entire set. This value is called the Merkle root.

Remembering that AB will hash to a different result than BA, Similarly ABCD will hash to a different output than CDBA even though it’s the same files.

How Merkle proofs scale proof-of-work blockchains

This same process which was done on our four objects can also be done on millions or billions of data objects. Because as we move up through the output of ordered pairs being hashed, we halve the number of objects in the data set on the next layer of the tree.

In the above example, if I wanted to check whether data object A is still in the same state as it was when the published Merkle root was created, then I can simply hash the object, and combine it with the hash of object B. Then I combine the value I calculated there with the digital fingerprint of CD, and if I get the same Merkle root then I know that no changes have occurred.

So even in a batch of 4 billion data objects, if some specific values are provided, then the integrity of any element in that set can be determined by only performing the number of hashes as there are layers to the tree. In that case it will be 32 calculations as 232 = 4.29 billion.

Read: How the Merkle proof standard can improve your Bitcoin payment service

How Merkle proofs ensures integrity of blockchain data

In proof-of-work blockchain systems, we call the set ‘a block’, and the elements transactions. We can then ensure that the Merkle root which summarises all the transactions of a batch of transactions to be recorded as a block has not been tampered with by again leveraging the property of cryptographic hash functions.

For the SHA256 function, there are 2256 (or approximately 10 with 78 zeroes worth) of unique outputs which can be created. The outputs then are effectively just a randomised number between 0 and 2256. By the nature of probability and distribution, half the numbers will be bigger than 2255 and very few of them will start with a zero.

As even fewer of them will start with multiple zeroes, then if we include the Merkle root as one of a few data parameters which are put into a hash function as a serialised 80-byte string, then we can set a challenge that the only valid output is one that has a certain number of leading zeroes.

To find a solution that satisfies the leading number of zeroes is a challenge that can only be accomplished by brute force. After each unsuccessful attempt, one bit of the serialised 80-byte string is changed, and another attempt is made.

Although Bitcoin miners can execute billions of these hashes per second on their specialised machine, even with hundreds of these machines networked together and working on that challenge, it will still take them many minutes to find a valid output.

Because it takes several minutes of operating expensive machinery which uses immense amounts of power, the solution to the hash puzzle is known by all on the network to be very hard to recreate. The output then can be shown to be proof that you have done the work to find a valid solution and other entities can know that you mean business and are invested in the system.

A Merkle root being part of the input that was hashed to generate a winning output, can then be considered extremely secure and by extension, the chronology of all the transactions that are summarised by that Merkle root can be considered secure as well.

The same end could be achieved by leveraging the property of other hash-based data structures such as a linked list, where A is hashed to create H(A), then H(B) is added on to the end and hashed, to create H(AB) and then H(C) Is added on to the end and hashed to create H(ABC) etc, etc.

This approach, while secure, suffers in that in order to verify the integrity of a single element, many operations need to be performed, and it cannot be processed in parallel by computer systems.

A Merkle tree on the other hand is extremely parallelisable because substructures of the entire tree can be processed separately and only combined at the higher levels of the tree. In our example above AB can be processed on a separate core than CD and they can be combined later as one operation. In a Merkle tree with 32 layers, then the entire tree can be threaded into 2, 4, 8, 16 or even 32 cores in parallel.

Additionally, to verify the integrity of a data element, only 32 x 32-byte values need to be provided in addition to the element to determine whether it can create the desired Merkle root. This efficiency is at the heart of the Simplified Payment Verification mechanism where the integrity of the source of funds being committed to a transaction can be checked by the provision of only 32 x 32-bytes (1KB) of data for one transaction out of 232 or 2KB for one out of 264 transactions.

Bitcoin Merkle proofs illustrate the creator’s vision of scale

The Merkle tree is not necessarily the most efficient data structure to use in systems that only record a few transactions in a block. Simple hashed lists would achieve the same end and offer more simplicity. When the system grows to be processing billions of transactions a second, then the efficiency of a Merkle tree really comes into play and is critical to the scalability of the system.

It is truly impressive how much forethought went into these design considerations by Dr Craig Wright to build a system from the beginning which was capable to handle such astronomical throughput many years down the road.