Git: it’s a blockchain!

Ben Klemens
13 min readApr 30, 2024

--

This is a git tutorial, in the form of an introduction to block chains/This is a learning-by-doing exercise in building a block chain, with application to git.

Block chain-wise, this is an extremely basic demo in which we build an extremely basic chain, which will at least give you a more concrete idea of what’s going on — and how simple it is. Your hands will wave that much less next time blockchains come up. What makes it immutable and distributable? Why do I not need to pay somebody a million dollars to implement one? If you know what a Merkle tree is, you’re already ahead of this post, but out of the millions of people who talk about block chains, a very great many don’t.

Git-wise, take this as an intermediate-level tutorial, in the sense that you may have already done one of those type this, now type this tutorials that give you the bare minimum you need to get by, but you want to have a better idea of what’s actually going on. When I google “git tutorial”, the search engine tells me it found “About 565,000 results”. That means I can write a why article, and leave the how to the others.

OK, let’s build some block chains! I won’t mention git again for a long while. If you want to spoil it and jump to that part, <ctrl>-F on this page for “Everything above applies directly to git.”

I know it makes the post look needlessly intimidating, but in the spirit of radical replicability, I’m going to give you the commands to replicate the entire process on your own personal computer. If you cut/paste the blocks of code to your POSIX command line (“the bash prompt”), you’ll get the results shown and build the same databases I’m building. If you’re not in the mood to read other people’s code, feel free to skip the replication script to the output showing what’s new at each step. You’re not being tested on this.

Step 1: generate a file and its hash.

Start with the first block.

mkdir bc
cd bc

cat << 'x----x' > block1
This is our ICO. We're excited!
We are gonna make $$$$$
x----x

echo "Print block1 and its SHA256 hash:"
cat block1
shasum block1

echo "Now add a second file:"
sed 's/\$//' < block1 > block2

cat block2
shasum block*

Output:

Print block1 and its SHA256 hash:
This is our ICO. We're excited!
We are gonna make $$$$$

ebd82b8c386fabeaaadc0556f088a15d424bca66 block1

Now add a second file:
This is our ICO. We're excited!
We are gonna make $$$$

ebd82b8c386fabeaaadc0556f088a15d424bca66 block1
bc415b9add511c162ab680cbbe6b8c27e7ab7d55 block2

The script writes two lines of text to the file block1 (via “Here document”). Then, we use the shasum program to reduce that file to a single unreasonably long hexadecimal number — a hash.

The SHA algorithm itself is far beyond the scope of this post, but here’s what you need to know:

  • It’s replicable, deterministic, nonrandom. Anywhere in the world, the shasum will give you the same value given the same inputs.
  • I can check for tampering. Knowing that block1 has a hash of ebd8, if you give me some text and tell me it’s block1, I can definitively verify your claim. If you were a true jerk, you’d tell me my hash is wrong and the hash of your version of block1 is the true, correct hash; we’ll address that shortly.
  • If you tried this and have a different output, it means your computer and mine do line endings differently, so in the context here we built entirely different files. block2 differs from block1 by lowering expectations, from making $$$$$ to making $$$$, and you can see that single-character change entirely rewrote the hash.
  • There is no way to back out the original data. Given the hash for block1, I can’t extract much information at all about block1 itself. That’s where the “secure” in Secure Hash Algorithm came from. After all, your input could be famously long, and you’d still get a one-line string of digits back.

Step 2: The block trick.

In which we write the hash of block1 into the text of block2.

getsum() {
shasum $1 | sed 's/\(.......\).*/\1/'
}

echo "The previous block was: $(getsum block1)" >> block2

cat << x----x > block3
Bad news: the bottom fell out and our ICO is now worthless.
Have a nice day.
The previous block was: $(getsum block2)
x----x

echo "Here is the new block2:"
cat block2

echo "
And here is block3:"
cat block3

Output

Here is the new block2:
This is our ICO. We're excited!
We are gonna make $$$$
The previous block was: ebd82b8

And here is block3:
Bad news: the bottom fell out and our ICO is now worthless.
Have a nice day.
The previous block was: 282540d

The whole trick, the magic of it all, was to add to each file a note giving the hash of the previous file. This procedure is sometimes called a “Merkle tree”, after the person who patented a comparable concept for binary trees in 1979. [Merkle got lucky; the USPTO, especially at the time, frowned heavily at patents on methods of arranging information. Look up the printed matter doctrine, which was still respected by USPTO at that time.]

What does this trick do for us?

  • If I have block3, I know the hash for block2, so I can verify whether it’s been modified. Once I’m satisfied with block2, I can pull the hash for block1 and do the same with it.
  • In Step 1, the truly devious changed block1 and also changed the report of its hash, so their fake block would pass the faked hash check. Now they can’t do that because we wrote the hash for block1 in the text of block2. If they change the hash printed in block2, that dominos to change the hash for block2, which got printed in block3. To tell one lie, you now have to tell a whole chain of them.
  • The records themselves now record the sequence. It’s not just naming them block 1–2–3 that links the blocks now.

Step 3: a database of blocks

Here, we go all-in on the fact that the chain of records is now encoded in the blocks themselves. We’re going to rename the blocks with their hashes, and leave ourselves a note about where the first file is:

main=$(getsum block3)
block2=$(getsum block2)

echo "main=$(getsum block3)" > Readme

rename_blocks () {
shasum block* | sed 's/\(.......\)[^ ]* \+\(.*\)/mv \2 \1/' > go
source ./go
rm go
}

rename_blocks
echo "Directory listing:"
ls
echo "Readme:"
cat Readme

Output:

Directory listing:
282540d d1aacf7 ebd82b8 Readme

Readme:
main=d1aacf7

The directory listing now looks more like a database: a list of blobs in no particular order.

In theReadme, we have a note that what I’m now calling the main thread starts at block d1aacf7. Did you ever do those scavenger hunts where you get a clue, which sends you to another spot with another clue pointing you to the next spot down? I hope you enjoyed those, because that’s what’s become of our data. I can open the block listed as the head of the main thread, enjoy its contents, and check its note for the previous block in the chain, and repeat back to the beginning of the chain.

Please note a little sleight-of-terminology: I am effectively equating the main thread with the last block on its thread. That is, the thread as a sequence of three files has no associated hash, but we gave it a hash-by-proxy by declaring main to equal the last block, d1aacf7.

Step 4: an alt history

Let’s write a little shell function to trace a chain. Don’t worry if it’s illegible; it just reads the last line of the next file and gets the file name at the end of that line while it makes sense to do so.

Also, sorry I was a downer about the ICO failing. Let’s write an alt history where it doesn’t, then look at all the chains:

trace_chain () {
thisblock=$1
while [ "$thisblock" ]; do
echo "$thisblock"
thisblock=$(sed -n '$p' $thisblock | grep -o '[0-9a-f]*$')
done
echo '<EOB>'
}

cat << x----x > block4
We won! We're all millionaires!
We don't know where the money came from, but we have it!
The previous block was $block2
x----x

echo "alt=$(getsum block4)" >> Readme
rename_blocks
source Readme

echo "Alt block:"
cat $alt
echo "
Directory listing:"
ls
echo "
Readme:"
cat Readme
echo "
Main chain:"
trace_chain $main
echo "
Alt chain:"
trace_chain $alt

Output:


Alt block:
We won! We're all millionaires!
We don't know where the money came from, but we have it!
The previous block was 282540d

Directory listing:
282540d d1aacf7 da6eb63 ebd82b8 Readme

Readme:
main=d1aacf7
alt=da6eb63

Main chain:
d1aacf7
282540d
ebd82b8
<EOB>

Alt chain:
da6eb63
282540d
ebd82b8
<EOB>

We now have two threads. The main is unaffected by this new step, and the alt behaves just like main, in that we can trace it back to the original post. We know the threads branched apart at the second block, 282540d; they have that and all prior blocks in common.

Step 5: clone

Let’s make a copy of the database, which means plain old copying of blocks and the Readme to a new directory. I’ll call one space the origin and the other the clone. We’ll continue the story in that new space.

cd ..
cp -r bc bc2
cd bc2

cat << x----x > block400
I'm cashing out.
This block records my selling all my coins.
The previous block was $alt
x----x

cat << x----x > Readme
main=$main
alt=$(getsum block400)
x----x

rename_blocks
source Readme

echo "New block:"
cat $alt
echo "This index:"
cat Readme
echo "The origin's index:"
cat ../bc/Readme

Output:

New block:
I'm cashing out.
This block records my selling all my coins.
The previous block was da6eb63

This index:
main=d1aacf7
alt=773a512

The origin's index:
main=d1aacf7
alt=da6eb63

Thanks to the database arrangement, making a clone of the original project just means copying over all the blocks, and the metadata in the Readme. If we wanted to be clever and only copy one thread, we could do that by tracing along the thread and copying only the relevant blocks. If we’re concerned that there were errors in copying, we of course have the hashes and can verify everything. Git allows “shallow clones” wherein you pull only those objects needed for the last handful of steps in the chain.

The lead developer of Git was the same guy who led development of the Linux operating system, and both make political statements about what it means to have ownership of software. At this point, we’ve stripped away the components other revision control systems had, like a server/client model or system to lock files or nontrivial metadata store, that gave one central maintainer more control than anybody else. If you have the blocks and know which commit is main, then your clone and the origin are indistinguishable, and your contributions have symmetric treatment on the blockchain to everybody else’s.

We now have several threads running: the new alt thread, the original alt thread, the new main thread, and the original main thread, where right now the two main threads coincide, but who knows what the future will bring.

Step 6: push back

I’m done working in the copy, and I want to ship my changes to the origin.

cp $alt ../bc
cat << x----x > ../bc/Readme
main=$main
alt=$alt
x----x

cd ../bc
echo "The database at the origin:"
ls
echo "The alt chain at the origin:"
trace_chain $alt

Output:

The database at the origin:
282540d 773a512 d1aacf7 da6eb63 ebd82b8 Readme

The alt chain at the origin:
773a512
da6eb63
282540d
ebd82b8
<EOB>

How easy is that? All we had to do was copy the new block, and rewrite the index to point to the new head of the alt branch. We’ll call this a fast-forward, because we just replayed the new blocks onto the original alt branch, and git was designed in the mid ’00s when people still had cassette tape players.

Note that there was no “intelligence” required at the origin. All it had to do was add blocks to its database and note where the alt branch wound up. The trace_chain function worked without issue, because when it looked for the parent of the just-added block, that parent was already present in the database.

Step 7: muck it all up

In this step, we’re going to add blocks to the main thread in both the origin copy and the clone.

cat << x----x > block300
Let's try again!
We're going to make it all back!
The previous block was $main
x----x

main_origin=$(getsum block300)
rename_blocks

cat << x----x > Readme
main=$main_origin
alt=$alt
x----x

cd ../bc2

cat << x----x > block300
We've all become librarians.
A library is a growing organism.
The previous block was $main
x----x

main_bc2=$(getsum block300)
rename_blocks

cat << x----x > Readme
main=$main_bc2
alt=$alt
x----x

echo "In repository2, the new block:"
cat $main_bc2
echo "
In repository2, the main thread:"
trace_chain $main_bc2

cd ../bc

echo "In the origin repository, the new block:"
cat $main_origin
echo "
In origin, the main thread:"
trace_chain $main_origin

Output:

In repository2, the new block:
We've all become librarians.
A library is a growing organism.
The previous block was bf0595c

In repository2, the main thread:
40b3e94
d1aacf7
282540d
ebd82b8
<EOB>

In the origin repository, the new block:
Let's try again!
We're going to make it all back!
The previous block was d1aacf7

In origin, the main thread:
da80065
d1aacf7
282540d
ebd82b8
<EOB>

Can we apply a fast-forward? In the clone, main is now bf0595c (librarianship), and I can of course copy block bf0595c to the origin database. And I could change the annotation in the origin to say that main=bf9595c. So, what happened to block d1aacf7 (try again)? It had been main, but now it’s nameless and lost. Symmetrically, I could fast-forward the clone to look like the origin, and trample the librarians in the process.

Content-wise, of course, entirely different things happened: in one, our narrator became a librarian, and in the other our narrator ran a new ICO, and it would take human intelligence to decide which of these alternatives actually happened. To make pushing from the second repository to the origin work:

  • Copy the blocks from remote to origin.
  • An intelligent designer at the origin looks at both blocks and makes a decision, keeping one, the other, or some sort of melange between them.
  • They write the result to a new block, which has two parents.

Everything above applies directly to git.

OK, we’ve built a chain of little three-line text files, and it’s immutable, and it’s distributed. Also, it’s useful: this is how projects are managed in the present day.

  • The biggest jump to Git is that, in our toy database, a block was a single file; in git, a block — herein a commit or snapshot , what gets written when you type git commit — holds everything you are tracking in a working directory, bundled into one blob. (In the object store of the .git directory, it could be multiple files, an implementation detail we won’t think about here.) We had an easy time looking at flat files using plain old cat, but checking out a git snapshot is more like unfurling a tent, unpacking all your equipment into it, and zipping yourself in. The commit you currently have checked out is given the tag HEAD. The elements of the commit you are about to check in next is called “the index,” and now you need index management commands like git add, git rm, git mv. In an earlier version of this post, I’d had a step zipping a directory into a zip or tar file — stuffing the tent and everything in it into your backpack — then putting that entire blob on the chain.
  • Of course, the SHA256 hash is used as the name of the commit, just as we did here. The parent(s) is recorded in the metadata, and if so desired you can get it via, e.g. git log --pretty=%P -1 HEAD. If you remove
    the -1 you’ll get the full list of all the parents, tracing back the chain. The system is doing what the trace_chain function did above, stepping from commit to commit.
  • Branches don’t have SHA names. They’re defined entirely by the head of the branch, just as we did in the Readme above. Note that “head” here (last in a branch) is an entirely different use from “HEAD” above (checked out into working directory), though on most days they coincide.
  • Cloning a repository is cheap, mostly just a question of copying blocks around. Want to delve into history or try flamboyant branching tricks? Just git clone your_repository a_clone and experiment in the copy all you want (or try the worktree command, which does effectively the same thing).
  • What comes after a block in a chain? It’s hard to tell. Our system is great for walking backward, but we can’t write down a block’s future successors into that block — that would change the block. And as you can see, we could have as many successors as we wanted. When I started using git, I’d expected the history you get from the log to be like a subway line, where if you’re at a commit somewhere in the middle you could look forward or backward. It bothered me that if you’re in one of those middle commits you could only see the history before that point. But the backward-looking is baked into the block chain, and the only mechanism to find what comes next would be to check every snapshot and see if it has the right parent.
  • You have three types of merge:
    1. Fast-forwards are easy, and just require copying blocks and fixing metadata.
    2. Because a commit is an entire working directory, you sometimes have changes in one file on one branch, and changes in another file in another branch. Git can apply these entirely non-conflicting changes on your behalf, even though this is more than just shunting blocks.
    3. Any content changes more complex than that, and git calls you, a human, to decide what to do.
  • Remote repositories are usually “bare”, meaning they have no working directory checked out (no HEAD), and there are no humans where they are. With no humans present, only fast-forward merges are allowed.

Knowing that git is building itself a backward-looking sequence of commits pointing to other commits explains much (but nowhere near all) of its oddity. It’s a simple idea and clearly got the system far.

And yes, your code is “on the blockchain”. And if you’re looking for an immutable repository where each entry is linked to the previously to record your land deeds or what-have-you, maybe you’re looking for a git repository.

--

--

Ben Klemens

BK served as director of the FSF’s End Software Patents campaign, and is the lead author of Apophenia (http://apophenia.info), a statistics library.