Git stores everything. It is a version control tool after all. What is once committed will always be available for anyone to checkout. But how does git stores all versions of your files? Does it copy every version? No. It just detects changes and takes note of them. After all, adding a new function to your class is just inserting some lines into a text file. But this doesn’t always help you to save repository space.
If you’ve added a file then deleted it in the next commit, it doesn’t go away. It has to be stored somewhere to allow you to access it in the future if you wish. These kinds of commits might cause your repository to grow and make it uncomfortably large for cloning. “There must be a way to clean this mess!” I hear you say and there is! But first, we have to figure out why your git repository is unnecessarily large.
I want to demonstrate this on an example repository that I will create on my home directory. Then we will collect the data to find out why this repository takes up so much free space yet empty.
cagdas@cagdas-mint:~$ mkdir repo
cagdas@cagdas-mint:~$ cd repo/
cagdas@cagdas-mint:~/repo$ git init
Initialized empty Git repository in /home/cagdas/.git/
cagdas@cagdas-mint:~$ echo "test message" > testfile
cagdas@cagdas-mint:~/repo$ git add testfile
cagdas@cagdas-mint:~/repo$ git commit -am "init"
[master (root-commit) 85f3338] init
1 file changed, 1 insertion(+)
create mode 100644 testfile
cagdas@cagdas-mint:~/repo$ echo "second init" >> testfile
cagdas@cagdas-mint:~/repo$ git commit -am "second commit"
[master 968d90d] second commit
1 file changed, 1 insertion(+)
cagdas@cagdas-mint:~/repo$ dd if=/dev/zero of=3gib-file bs=1073741824 count=3
3+0 records in
3+0 records out
3221225472 bytes (3,2 GB, 3,0 GiB) copied, 3,37279 s, 955 MB/s
cagdas@cagdas-mint:~/repo$ git add 3gib-file
cagdas@cagdas-mint:~/repo$ git commit -am "adds 3gb file"
[master 4ac6eda] adds 3gb file
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 3gib-file
cagdas@cagdas-mint:~/repo$ mv ~/video.mp4 .
cagdas@cagdas-mint:~/repo$ git add video.mp4
cagdas@cagdas-mint:~/repo$ git commit -am "adds video"
[master 1fe0acc] adds video
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100755 video.mp4
cagdas@cagdas-mint:~/repo$ rm video.mp4
cagdas@cagdas-mint:~/repo$ git commit -am "removes video"
[master 6779dfc] removes video
1 file changed, 0 insertions(+), 0 deletions(-)
delete mode 100755 video.mp4
1. Check current directory sizes in your repository
The first thing I’ll recommend is to check if any large and unnecessary file is still in our project. Surprise surprise! It is not! Yeah, remember I’ve deleted it.
cagdas@cagdas-mint:~/repo$ ls -altr
total 16
-rw-r--r-- 1 cagdas cagdas 25 Sep 1 23:36 testfile
drwxr-xr-x 3 cagdas cagdas 4096 Sep 2 00:02 .
drwxr-xr-x 8 cagdas cagdas 4096 Sep 2 00:02 .git
drwxr-xr-x 46 cagdas cagdas 4096 Sep 2 19:19 ..
cagdas@cagdas-mint:~/repo$ du -h
4,0K ./.git/objects/info
8,0K ./.git/objects/39
8,0K ./.git/objects/47
8,0K ./.git/objects/96
8,0K ./.git/objects/4a
8,0K ./.git/objects/8a
4,0G ./.git/objects/pack
8,0K ./.git/objects/b7
8,0K ./.git/objects/75
8,0K ./.git/objects/1f
8,0K ./.git/objects/da
8,0K ./.git/objects/1b
8,0K ./.git/objects/85
8,0K ./.git/objects/67
4,0G ./.git/objects
8,0K ./.git/info
8,0K ./.git/logs/refs/heads
12K ./.git/logs/refs
20K ./.git/logs
4,0K ./.git/refs/tags
8,0K ./.git/refs/heads
16K ./.git/refs
52K ./.git/hooks
4,0K ./.git/branches
4,0G ./.git
4,0G .
So as you can see we have a huge .git directory while the only file I have in my project is just a small text file called "testfile"
. Why is that? Because I’ve moved committed and deleted several files to and from my repository. I want to try and find out which commit is causing my "git/object/pack
” file to grow so large. To do that we first need to find that large pack file in our git repository.
2. Find large pack files in git repository
cagdas@cagdas-mint:~/repo$ cd .git/objects/pack/
cagdas@cagdas-mint:~/repo/.git/objects/pack$ ls -altr
total 4170984
-r--r--r-- 1 cagdas cagdas 3130927 Sep 1 23:42 pack-ded0f1878c9d35bce7a53a0083af2bca3975f563.pack
-r--r--r-- 1 cagdas cagdas 1100 Sep 1 23:42 pack-ded0f1878c9d35bce7a53a0083af2bca3975f563.idx
-r--r--r-- 1 cagdas cagdas 4267932401 Sep 2 00:01 pack-bbd317c8ab9b282182a18017e49e5692769557b7.pack
-r--r--r-- 1 cagdas cagdas 1100 Sep 2 00:01 pack-bbd317c8ab9b282182a18017e49e5692769557b7.idx
drwxr-xr-x 2 cagdas cagdas 4096 Sep 2 00:01 .
drwxr-xr-x 16 cagdas cagdas 4096 Sep 2 00:02 ..
3. View the commits in a pack file
With its large size compared to other pack files, It seems obvious our friend pack-bbd317
is hiding something large. To see what a pack file holds we can use a command called verify-pack with --verbose
flag. So If we call verify-pack on pack-bbd317 we can see which commits are in it.
cagdas@cagdas-mint:~/repo/.git/objects/pack$ git verify-pack -v pack-bbd317c8ab9b282182a18017e49e5692769557b7.pack
7cde50ff361fb1388f9c0e22759965de1a5c96bc blob 4284456062 4267932369 12
non delta: 1 object
pack-bbd317c8ab9b282182a18017e49e5692769557b7.pack: ok
The format of this output as it was mention in the documentation, is like this. So we can see our commits size on the third column.
SHA-1 type size size-in-packfile offset-in-packfile
4. Sort commits by size
Because my git repository is fairly new and doesn’t have too many commits in it, verify-pack is showing only one commit in my pack file. But If you have an old repository this list will be very, very long. In these kinds of situations, it is a bit harder to sort lines by size so you need to work on that data a bit. I’ve personally found it easier to replace spaces with commas and saving it as a CSV file in order to sort it in Calc or Excel. You can use [^\S\r\n]+
regex to replace all with sed or any text editor that supports regex matching.
5. View changes in a commit
After this point, you can locate and view the changes in that exact commit to figure out what happened. Sometimes you just forget to add a file to .gitignore and commit it without noticing or someone who is not familiar with the project commits something without knowing it is unnecessary. In these kinds of times, we need to figure out what happened in the past and try to clean it. Yes, you can clean it. But that’s a topic for another day. 🙂