Spotlight: Git objects

Mittwoch, 17. Februar 2016

This will become a table of contents (this text will be scraped). {:toc}

Have you ever wondered how Git handles files and folders in the magical .git folder? Well, I did and hence I started to investigate further. In this article I will present the results and explain the way how Git objects are interconnected.

Git¹ is a source code management (SCM) system invented back in 2005 by Linus Torvalds, founder and inventor of Linux Kernel². His intention was to implement a distributed SCM system which supports common development workflows more easily compared to other SCMs like SVN. This includes that commits can be made locally without having the need of a remote repository or server. In addition, compared to SVN which uses incremented revision numbers, Git has to rely on some other techniques since there may occur revision collisions when merging local and remote changes, otherwise. Hence Git introduces hashes using SHA1³ for each and every resource which is tracked. The reason for using SHA1 is simple, as it is a fast but reliable algorithm to calculate hashes for a given input. Although it is not very secure—it should not be used for hashing passwords anymore due to collision attacks—it is sufficient for the purpose of tracking file contents as Git does.

How does Git handle contents?

Git relies on the filesystem of the underlying operating system. A filesystem basically has two different concepts: Files and folders. Although Git does not track folders but files, there are two corresponding abstractions in Git’s source code: Tree and Blob. A Blob contains the value of a file whereas a Tree is the pendant to a folder by listing all contained Blobs.

Blob object

Let’s assume that we have a file called test.txt which contains a simple, but well-known string: Hello, World!. I already wrote, that Git tracks files by its contents using a SHA1 hash. When hashing the string above one will get 60fde9c2310b0d4cad4dab8d126b04387efba289 as value.

Let’s verify whether Git calculates the same hash value by creating a new Git repository, add the file and inspect the .git/objects/ folder:

$ git init .
$ echo -n Hello, World! > test.txt
$ git add .

The .git/objects/ folder now contains one new folder having one file:

.git
 `-- objects
     +-- b4
     |   `-- 5ef6fec89518d314f546fd6c3025367b721684
     +-- info
     `-- pack

The file and folder names are obviously hashed values of something. In order to investigate the contents of a file, one can use the command git cat-file -p b45e which prints out:

$ git cat-file -p b45e
Hello, World!

Wait, what?! Why does the file has a hash value of b4/5ef6fec89518d314f546fd6c3025367b721684? We did calculate some different hash value which is 60fde9c2310b0d4cad4dab8d126b04387efba289, didn’t we? Well, yes, we did, but Git tracks a little bit more information than just the plain content of a file: It also respects the file size as well as the type of Git object that is tracked. In this case the file has a size of 13 byte and is of type Blob. These information and the content of the file are taken, put into a new object and then hashed:

blob 13\0Hello, World!

The format for Blobs is specified as follows: blob<blank><filesize in byte><null byte><file content>. The null byte is used to separate the content of a file from the git specific header which contains meta data. If we hash the resulting string above once more, SHA1 returns b45ef6fec89518d314f546fd6c3025367b721684. Looks familiar, right? Yes, it is the very same hash as generated by Git!

We are done, aren’t we? But wait… What about folders? Until now, Git has tracked the file and its content, so we will have a look at folders now.

Tree objects

Since we are very happy having a mature test.txt file now, let’s first commit that and than investigate further.

git commit -m "Initial commit."

Let’s check the .git/objects/ folder once more. We will see that a new file has been created by Git with a hash value of 34/1cf04522a24fcf326c5e46ff7ce4f66ff310dd. In order to look inside that file, we can again use the git cat-file command as shown below.

.git
 `-- objects
     +-- b4
     |   |-- 5ef6fec89518d314f546fd6c3025367b721684
     |   `-- 587014507e76d7dcf5b5299949fae0b12b06ab
     +-- 34
     |   `-- 1cf04522a24fcf326c5e46ff7ce4f66ff310dd
     +-- info
     `-- pack

$ git cat-file -p 341c
100644 blob b45ef6fec89518d314f546fd6c3025367b721684    test.txt

The content is part of a Git tree object which is the way Git tracks folders and contents. Some of you might know, that Git won’t track empty folders. This is because a tree object has to have at least one element inside, which is false when no files are contained in a folder.

Commit object

There is another new file which hasn’t been investigated, yet, having a hash value b4/587014507e76d7dcf5b5299949fae0b12b06ab. Once again we can use the magic git command to look inside the file.

$ git cat-file -p b458
tree 341cf04522a24fcf326c5e46ff7ce4f66ff310dd
author Stephan Köninger <github@stekoe.de> 1455660483 +0100
committer Stephan Köninger <github@stekoe.de> 1455660483 +0100

Initial commit.

This object obviously contains the message we entered when committing the test.txt file. Hence we can assume that this is the object which contains all information necessary for a commit. The first line of the commit object points to the tree object we have seen in the previous chapter. If one draws the different objects as a graph the result will look like as follows.

{% include image.html file=“git-object-tree.svg” caption=“Fig.1: Simple Git object tree” %}

A more complex graph can be seen in the next figure which depicts two commits. The second commit points back to its previous commit using a parent-child association. As can be seen in the picture as well, the second commit introduces a new subfolder containing one file. This assumes, that a tree object can not just contain blobs but tree objects as well.

{% include image.html file=“more-complex-git-object-graph.svg” caption=“Fig.2: Extended Git object tree” %}

Conclusion

In this blog post I have shown the main objects which are used by Git to track changes, files and their folders. We have also learned that Git uses hashes for revision numbers as well as tracing the object’s content. In order to illustrate the way git calculates hashes, I have implemented the algorithm using JavaScript. The resulting code can be found in the appendix of this blog post.

Appendix

{% gist SteKoe/20462841fa2b47ef6bcc %}

*[Blob]: Binary Large Object *[SCM]: Source Code Management *[SHA1]: Secure Hash Algorithm *[SVN]: Apache Subversion