Git: it’s a blockchain!
This is a git tutorial, in the form of an introduction to block chains/This is a learning-by-doing exercise in building a block chain, with application to git.
Block chain-wise, this is an extremely basic demo in which we build an extremely basic chain, which will at least give you a more concrete idea of what’s going on — and how simple it is. Your hands will wave that much less next time blockchains come up. What makes it immutable and distributable? Why do I not need to pay somebody a million dollars to implement one? If you know what a Merkle tree is, you’re already ahead of this post, but out of the millions of people who talk about block chains, a very great many don’t.
Git-wise, take this as an intermediate-level tutorial, in the sense that you may have already done one of those type this, now type this tutorials that give you the bare minimum you need to get by, but you want to have a better idea of what’s actually going on. When I google “git tutorial”, the search engine tells me it found “About 565,000 results”. That means I can write a why article, and leave the how to the others.
OK, let’s build some block chains! I won’t mention git again for a long while. If you want to spoil it and jump to that part, <ctrl>-F on this page for “Everything above applies directly to git.”
I know it makes the post look needlessly intimidating, but in the spirit of radical replicability, I’m going to give you the commands to replicate the entire process on your own personal computer. If you cut/paste the blocks of code to your POSIX command line (“the bash prompt”), you’ll get the results shown and build the same databases I’m building. If you’re not in the mood to read other people’s code, feel free to skip the replication script to the output showing what’s new at each step. You’re not being tested on this.
Step 1: generate a file and its hash.
Start with the first block.
mkdir bc
cd bc
cat << 'x----x' > block1
This is our ICO. We're excited!
We are gonna make $$$$$
x----x
echo "Print block1 and its SHA256 hash:"
cat block1
shasum block1
echo "Now add a second file:"
sed 's/\$//' < block1 > block2
cat block2
shasum block*
Output:
Print block1 and its SHA256 hash:
This is our ICO. We're excited!
We are gonna make $$$$$
ebd82b8c386fabeaaadc0556f088a15d424bca66 block1
Now add a second file:
This is our ICO. We're excited!
We are gonna make $$$$
ebd82b8c386fabeaaadc0556f088a15d424bca66 block1
bc415b9add511c162ab680cbbe6b8c27e7ab7d55 block2
The script writes two lines of text to the file block1
(via “Here document”). Then, we use the shasum
program to reduce that file to a single unreasonably long hexadecimal number — a hash.
The SHA algorithm itself is far beyond the scope of this post, but here’s what you need to know:
- It’s replicable, deterministic, nonrandom. Anywhere in the world, the
shasum
will give you the same value given the same inputs. - I can check for tampering. Knowing that
block1
has a hash ofebd8
, if you give me some text and tell me it’sblock1
, I can definitively verify your claim. If you were a true jerk, you’d tell me my hash is wrong and the hash of your version ofblock1
is the true, correct hash; we’ll address that shortly. - If you tried this and have a different output, it means your computer and mine do line endings differently, so in the context here we built entirely different files.
block2
differs fromblock1
by lowering expectations, from making $$$$$ to making $$$$, and you can see that single-character change entirely rewrote the hash. - There is no way to back out the original data. Given the hash for
block1
, I can’t extract much information at all aboutblock1
itself. That’s where the “secure” in Secure Hash Algorithm came from. After all, your input could be famously long, and you’d still get a one-line string of digits back.
Step 2: The block trick.
In which we write the hash of block1
into the text of block2
.
getsum() {
shasum $1 | sed 's/\(.......\).*/\1/'
}
echo "The previous block was: $(getsum block1)" >> block2
cat << x----x > block3
Bad news: the bottom fell out and our ICO is now worthless.
Have a nice day.
The previous block was: $(getsum block2)
x----x
echo "Here is the new block2:"
cat block2
echo "
And here is block3:"
cat block3
Output
Here is the new block2:
This is our ICO. We're excited!
We are gonna make $$$$
The previous block was: ebd82b8
And here is block3:
Bad news: the bottom fell out and our ICO is now worthless.
Have a nice day.
The previous block was: 282540d
The whole trick, the magic of it all, was to add to each file a note giving the hash of the previous file. This procedure is sometimes called a “Merkle tree”, after the person who patented a comparable concept for binary trees in 1979. [Merkle got lucky; the USPTO, especially at the time, frowned heavily at patents on methods of arranging information. Look up the printed matter doctrine, which was still respected by USPTO at that time.]
What does this trick do for us?
- If I have
block3
, I know the hash forblock2
, so I can verify whether it’s been modified. Once I’m satisfied withblock2
, I can pull the hash forblock1
and do the same with it. - In Step 1, the truly devious changed
block1
and also changed the report of its hash, so their fake block would pass the faked hash check. Now they can’t do that because we wrote the hash forblock1
in the text ofblock2
. If they change the hash printed inblock2
, that dominos to change the hash forblock2
, which got printed inblock3
. To tell one lie, you now have to tell a whole chain of them. - The records themselves now record the sequence. It’s not just naming them block 1–2–3 that links the blocks now.
Step 3: a database of blocks
Here, we go all-in on the fact that the chain of records is now encoded in the blocks themselves. We’re going to rename the blocks with their hashes, and leave ourselves a note about where the first file is:
main=$(getsum block3)
block2=$(getsum block2)
echo "main=$(getsum block3)" > Readme
rename_blocks () {
shasum block* | sed 's/\(.......\)[^ ]* \+\(.*\)/mv \2 \1/' > go
source ./go
rm go
}
rename_blocks
echo "Directory listing:"
ls
echo "Readme:"
cat Readme
Output:
Directory listing:
282540d d1aacf7 ebd82b8 Readme
Readme:
main=d1aacf7
The directory listing now looks more like a database: a list of blobs in no particular order.
In theReadme
, we have a note that what I’m now calling the main thread starts at block d1aacf7. Did you ever do those scavenger hunts where you get a clue, which sends you to another spot with another clue pointing you to the next spot down? I hope you enjoyed those, because that’s what’s become of our data. I can open the block listed as the head of the main thread, enjoy its contents, and check its note for the previous block in the chain, and repeat back to the beginning of the chain.
Please note a little sleight-of-terminology: I am effectively equating the main thread with the last block on its thread. That is, the thread as a sequence of three files has no associated hash, but we gave it a hash-by-proxy by declaring main to equal the last block, d1aacf7.
Step 4: an alt history
Let’s write a little shell function to trace a chain. Don’t worry if it’s illegible; it just reads the last line of the next file and gets the file name at the end of that line while it makes sense to do so.
Also, sorry I was a downer about the ICO failing. Let’s write an alt history where it doesn’t, then look at all the chains:
trace_chain () {
thisblock=$1
while [ "$thisblock" ]; do
echo "$thisblock"
thisblock=$(sed -n '$p' $thisblock | grep -o '[0-9a-f]*$')
done
echo '<EOB>'
}
cat << x----x > block4
We won! We're all millionaires!
We don't know where the money came from, but we have it!
The previous block was $block2
x----x
echo "alt=$(getsum block4)" >> Readme
rename_blocks
source Readme
echo "Alt block:"
cat $alt
echo "
Directory listing:"
ls
echo "
Readme:"
cat Readme
echo "
Main chain:"
trace_chain $main
echo "
Alt chain:"
trace_chain $alt
Output:
Alt block:
We won! We're all millionaires!
We don't know where the money came from, but we have it!
The previous block was 282540d
Directory listing:
282540d d1aacf7 da6eb63 ebd82b8 Readme
Readme:
main=d1aacf7
alt=da6eb63
Main chain:
d1aacf7
282540d
ebd82b8
<EOB>
Alt chain:
da6eb63
282540d
ebd82b8
<EOB>
We now have two threads. The main is unaffected by this new step, and the alt behaves just like main, in that we can trace it back to the original post. We know the threads branched apart at the second block, 282540d; they have that and all prior blocks in common.
Step 5: clone
Let’s make a copy of the database, which means plain old copying of blocks and the Readme to a new directory. I’ll call one space the origin and the other the clone. We’ll continue the story in that new space.
cd ..
cp -r bc bc2
cd bc2
cat << x----x > block400
I'm cashing out.
This block records my selling all my coins.
The previous block was $alt
x----x
cat << x----x > Readme
main=$main
alt=$(getsum block400)
x----x
rename_blocks
source Readme
echo "New block:"
cat $alt
echo "This index:"
cat Readme
echo "The origin's index:"
cat ../bc/Readme
Output:
New block:
I'm cashing out.
This block records my selling all my coins.
The previous block was da6eb63
This index:
main=d1aacf7
alt=773a512
The origin's index:
main=d1aacf7
alt=da6eb63
Thanks to the database arrangement, making a clone of the original project just means copying over all the blocks, and the metadata in the Readme. If we wanted to be clever and only copy one thread, we could do that by tracing along the thread and copying only the relevant blocks. If we’re concerned that there were errors in copying, we of course have the hashes and can verify everything. Git allows “shallow clones” wherein you pull only those objects needed for the last handful of steps in the chain.
The lead developer of Git was the same guy who led development of the Linux operating system, and both make political statements about what it means to have ownership of software. At this point, we’ve stripped away the components other revision control systems had, like a server/client model or system to lock files or nontrivial metadata store, that gave one central maintainer more control than anybody else. If you have the blocks and know which commit is main
, then your clone and the origin are indistinguishable, and your contributions have symmetric treatment on the blockchain to everybody else’s.
We now have several threads running: the new alt thread, the original alt thread, the new main thread, and the original main thread, where right now the two main threads coincide, but who knows what the future will bring.
Step 6: push back
I’m done working in the copy, and I want to ship my changes to the origin.
cp $alt ../bc
cat << x----x > ../bc/Readme
main=$main
alt=$alt
x----x
cd ../bc
echo "The database at the origin:"
ls
echo "The alt chain at the origin:"
trace_chain $alt
Output:
The database at the origin:
282540d 773a512 d1aacf7 da6eb63 ebd82b8 Readme
The alt chain at the origin:
773a512
da6eb63
282540d
ebd82b8
<EOB>
How easy is that? All we had to do was copy the new block, and rewrite the index to point to the new head of the alt branch. We’ll call this a fast-forward, because we just replayed the new blocks onto the original alt branch, and git was designed in the mid ’00s when people still had cassette tape players.
Note that there was no “intelligence” required at the origin. All it had to do was add blocks to its database and note where the alt branch wound up. The trace_chain
function worked without issue, because when it looked for the parent of the just-added block, that parent was already present in the database.
Step 7: muck it all up
In this step, we’re going to add blocks to the main thread in both the origin copy and the clone.
cat << x----x > block300
Let's try again!
We're going to make it all back!
The previous block was $main
x----x
main_origin=$(getsum block300)
rename_blocks
cat << x----x > Readme
main=$main_origin
alt=$alt
x----x
cd ../bc2
cat << x----x > block300
We've all become librarians.
A library is a growing organism.
The previous block was $main
x----x
main_bc2=$(getsum block300)
rename_blocks
cat << x----x > Readme
main=$main_bc2
alt=$alt
x----x
echo "In repository2, the new block:"
cat $main_bc2
echo "
In repository2, the main thread:"
trace_chain $main_bc2
cd ../bc
echo "In the origin repository, the new block:"
cat $main_origin
echo "
In origin, the main thread:"
trace_chain $main_origin
Output:
In repository2, the new block:
We've all become librarians.
A library is a growing organism.
The previous block was bf0595c
In repository2, the main thread:
40b3e94
d1aacf7
282540d
ebd82b8
<EOB>
In the origin repository, the new block:
Let's try again!
We're going to make it all back!
The previous block was d1aacf7
In origin, the main thread:
da80065
d1aacf7
282540d
ebd82b8
<EOB>
Can we apply a fast-forward? In the clone, main is now bf0595c (librarianship), and I can of course copy block bf0595c to the origin database. And I could change the annotation in the origin to say that main=bf9595c. So, what happened to block d1aacf7 (try again)? It had been main, but now it’s nameless and lost. Symmetrically, I could fast-forward the clone to look like the origin, and trample the librarians in the process.
Content-wise, of course, entirely different things happened: in one, our narrator became a librarian, and in the other our narrator ran a new ICO, and it would take human intelligence to decide which of these alternatives actually happened. To make pushing from the second repository to the origin work:
- Copy the blocks from remote to origin.
- An intelligent designer at the origin looks at both blocks and makes a decision, keeping one, the other, or some sort of melange between them.
- They write the result to a new block, which has two parents.
Everything above applies directly to git.
OK, we’ve built a chain of little three-line text files, and it’s immutable, and it’s distributed. Also, it’s useful: this is how projects are managed in the present day.
- The biggest jump to Git is that, in our toy database, a block was a single file; in git, a block — herein a commit or snapshot , what gets written when you type
git commit
— holds everything you are tracking in a working directory, bundled into one blob. (In the object store of the.git
directory, it could be multiple files, an implementation detail we won’t think about here.) We had an easy time looking at flat files using plain oldcat
, but checking out a git snapshot is more like unfurling a tent, unpacking all your equipment into it, and zipping yourself in. The commit you currently have checked out is given the tag HEAD. The elements of the commit you are about to check in next is called “the index,” and now you need index management commands likegit add
,git rm
,git mv
. In an earlier version of this post, I’d had a step zipping a directory into a zip or tar file — stuffing the tent and everything in it into your backpack — then putting that entire blob on the chain. - Of course, the SHA256 hash is used as the name of the commit, just as we did here. The parent(s) is recorded in the metadata, and if so desired you can get it via, e.g.
git log --pretty=%P -1 HEAD
. If you remove
the-1
you’ll get the full list of all the parents, tracing back the chain. The system is doing what thetrace_chain
function did above, stepping from commit to commit. - Branches don’t have SHA names. They’re defined entirely by the head of the branch, just as we did in the Readme above. Note that “head” here (last in a branch) is an entirely different use from “HEAD” above (checked out into working directory), though on most days they coincide.
- Cloning a repository is cheap, mostly just a question of copying blocks around. Want to delve into history or try flamboyant branching tricks? Just
git clone your_repository a_clone
and experiment in the copy all you want (or try the worktree command, which does effectively the same thing). - What comes after a block in a chain? It’s hard to tell. Our system is great for walking backward, but we can’t write down a block’s future successors into that block — that would change the block. And as you can see, we could have as many successors as we wanted. When I started using git, I’d expected the history you get from the log to be like a subway line, where if you’re at a commit somewhere in the middle you could look forward or backward. It bothered me that if you’re in one of those middle commits you could only see the history before that point. But the backward-looking is baked into the block chain, and the only mechanism to find what comes next would be to check every snapshot and see if it has the right parent.
- You have three types of merge:
1. Fast-forwards are easy, and just require copying blocks and fixing metadata.
2. Because a commit is an entire working directory, you sometimes have changes in one file on one branch, and changes in another file in another branch. Git can apply these entirely non-conflicting changes on your behalf, even though this is more than just shunting blocks.
3. Any content changes more complex than that, and git calls you, a human, to decide what to do. - Remote repositories are usually “bare”, meaning they have no working directory checked out (no HEAD), and there are no humans where they are. With no humans present, only fast-forward merges are allowed.
Knowing that git is building itself a backward-looking sequence of commits pointing to other commits explains much (but nowhere near all) of its oddity. It’s a simple idea and clearly got the system far.
And yes, your code is “on the blockchain”. And if you’re looking for an immutable repository where each entry is linked to the previously to record your land deeds or what-have-you, maybe you’re looking for a git repository.