Blockchain: Embedding the provenance of a digital file.

Do you know what the cool thing about immutable data is? You can’t change it…

Do you know what the problem with immutable data is? You can’t find it!

Unless of course, someone points you in the right direction. But wait, then you have to trust them! Look -> that’s my immutable tree (e.g. hash of data), isn’t it wonderful and completely tamper-proof evidence.

But wait…

How do you know that is the ‘right’ tree? Let me explain in the best possible way, using an example from The Wolf of Wall Street (WoWS).

One of the famous scams in the film involved having a mailing list of investors. The ‘Wolf’ then proceeds to predict a series of financial events before the fact, splitting the mailing list into prediction A and prediction B. He would then mail exactly half of the mailing list one prediction, and the other half another. If we start with 2000 people on the list, every ‘prediction’ then halves the number who have received the ‘correct’ prediction prior to the fact. After 5 splits and 5 predictions, you then have a mailing list of 68 investors who have seen you impossibly predict real life events prior to them occurring.

These 68 targets are then ripe for the picking. The ‘Wolf’ picks up his phone and proceeds to leverage his proof of omnipotentence to scam the investors out of their hard earned cash.

What the Wolf has done here is created as many facts as he desired, and referenced them whenever it ‘became true’ and therefore suited him. So each of his predictions was true, but in a greater context it wasn’t proof of the truth!

Our poor blockchain, how do we protect it from the WoWS?…

Hello — I am a cryptographic hash

Key to this problem and its solution is an understanding that often blockchains aren’t used as a ‘database’ with bells and whistles, that’s what databases are for. Blockchains are ledgers to record data, or even better, proof of data. For that reason people often choose to submit evidence of data, or references to data, using cryptographic hashes. Apart from the fact that this makes the blockchain far more lightweight and scalable, it also stays true to the ‘function’ of a blockchain, a distributed ledger to trustlessly prove immutable data. It has the added advantage that in a public setting you can prove data exists and is not corrupted without revealing the contents of that data to prying eyes.

A cryptographic hash at its simplest level is a way of generating a unique string (hash) based on an input. Only the unaltered data can recreate this hash (It is mathematically improbable for you to reconstruct the data from the hash), therefore if you can run a hashing function on the data and come up with the same hash that is on the blockchain, you have evidence that this was the data that was submitted. It is surprisingly simple, even one deviating byte from the original input will change the hash, therefore showing that the data has been tampered with.

So we have a way of validating data, but what we don’t have is a way for people to find that data. Going back to the WoWS example, people could maliciously store that data in many different contexts to suit their purpose.

The proof of the devious husband…

Let’s imagine that this particular data is a divorce contract between a devious husband and his long-suffering wife that they’d decided to sign on blockchain. The devious husband can take the data and publish it on the blockchain, where his wife signs it! All good right…


The devious husband has taken the opportunity to republish exactly the same information again on the blockchain. Same data, same contract. So his wife has put her digital signature happily onto the contract, but unbenownest to her, the husband is looking at an entirely different truth, and takes the reference of the unsigned version to court to prove that his wife has missed the deadline to sign their agreement, losing her right to theoretical custody of their non-existent children.

This is an example of creating multiple truths like the WoWS, and referencing the one that happens to be convenient in each context. Let us remember, the data itself is immutable (so the tree itself is the same), but we don’t have much control over where it exists or is published and in what context.

The issue

There are ways to mitigate this problem that blockchain architects can use. If for example you create a hashed data makes reference to immutable content that would provably out a lie.

To use supply chains as an example, imagine that we had a ledger of lettuces that Farmer Joe forgot in the back of his warehouse. A good providence blockchain system would create the ‘assets’ and their unique ids at the time of harvesting, and make sure that the ‘time’ was registered as part of the data (as well as the timestamp on the blockchain).

This would mean that should Farmer Joe feel tempted to issue a new batch of lettuces (a new convenient truth) to pass off his over-ripe produce as fresh (switching the stamp) he will encounter a few problems.

  1. How to account for the issued lettuce in the first batch and;
  2. How to justify the re-harvesting of the new batch and;
  3. Someone noticing that his ‘fresh’ lettuce isn’t actually fresh at all.

We can’t force Farmer Joe to be honest and not lie about having creating a ‘new lettuce’ that he didn’t really harvest, but we are certainly causing him more issues to maintain the consistency of the system (his charade!). Should any complaints arrive regards his overripe produce, it would be easy to follow the audit trail and see patterns of misuse.

Hold on Farmer Joe, this ‘new’ moldy lettuce you delivered seems to have been inexplicably harvested 3 days after the batch of fresh lettuces, and there is a surprising similarity with those products that supposedly went missing off the back of lorry from the first batch.

But of course, most people aren’t using blockchains to keep track of Farmer Joe’s lettuce, they are using it transfer digital assets (in many cases with a corresponding FIAT value), and to prove the provenance of all sorts of different things. So if we can be sure that the ‘data’ on the blockchain hasn’t been tampered with, the question is (as Obi-Wan might say), how can we be sure that this is the data we are looking for.

The current solution

As referenced above, part of the solution is to add data and make references within the hashed data that would act as proof. You could associate a cryptographic record of an author and a date with some data on the blockchain. Then if that data appeared elsewhere with a different author and date, you have a means of proof that one appeared ‘before’ the other.

Imagine for example that you are a world-class photographer, and you upload that photo (which you know will go viral) to your photographers provenance blockchain.

Some horrible dastardly malcontent then steals your photo and proceeds to enter it into the world series of photography under a different name.This dastard has proceeded to submit this into a public blockchain, ‘proving’ their claim on the provenance. You have one piece of data being used in different contexts by different publishers. Currently, if you noticed this kind of infringement or use, you would have to manually provide evidence that your blockchain is more valid (assuming you actually noticed!), and take evidence from the blockchain that your block was first and is the genuine ‘first’ claimant of the data.

This isn’t a bad system, and in some cases it will work. It just requires an extra step in the middle to ‘locate the tree’ and to trust that the person who has planted the tree is acting honorably to find the proof of provenance. The photography competition could study the claims and award it to the original owner based on the fact that he was the first to submit cryptographic evidence of creating that data.

A step further…

Here at Chainfrog we’ve filed a patent that makes headway for this particular problem, particularly in reference to signed digital assets. So, let’s imagine we’ve just created a very important document (VID), and we want to submit evidence to a public blockchain that the VID exists, and was created by us at a certain time.

Now let’s imagine that we send that VID to the blockchain, and we present the hash of the VID that we have submitted to some high-profile public blockchains! We have provided immutable evidence that the VID was published by us at a certain point of time that we can show to someone whenever it suits our purposes.


The evidence for the VID is provided by us, as and when we feel it is appropiate.


We have invented a mechanism for improving the provenance of digital files. It works by taking a file (for example a PDF, or an image file), and writing the reference to the blockchain including the blockheight into the metadata of the file. That way, the file itself contains a permanent immutable reference to its proof. If we change the metadata to point to a different ‘proof’, the hash will change and essentially the file becomes a ‘different’ one. So with our invention we permanently link a file via its metadata with its cryptographic proof of existence.

This takes away the possibility of somebody selectively supplying proof to support their own means, and it also means that you have a permanent reference to the proof of the data, meaning that you have an easy way to find it!

If you want a bit more information about ‘how’ we do this, feel free to follow the videos on LinkedIn or get in touch!

Working demo

We have a small working prototype showcasing this invention. It works with jpeg or jpg files (under 1mb) and it allows you to upload an image file on a website. This website will then write the block height of the blockchain and reference to the blockchain into the image metadata and provide the file for download. You then use another website to upload this file, where it will read the metadata and confirm that this uncorrupted file has been published to the blockchain exactly where it said it would have been.

It is pretty fun in that it is a self-referencing digital file with evidence of it’s proof on the blockchain…

  1. Upload a small image here (and download)
  2. Verify the downloaded image here:

COVID-19 Blockchain response ->

In these difficult times we’ve published a POC system for a blockchain distributed system to provide data feeds and implement a traffic light system, follow the blogs here:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store