Mr Branding: Managing Huge Repositories with Git

Tuesday, September 1, 2015

Managing Huge Repositories with Git

huge git cat Linus Torvalds created Git in the mid 2000s to solve a problem that other open source version control systems at that time could not—to be distributed, reliable and fast.

As he mentions in this Google tech talk on Git, Git was created out of necessity for the Linux project. At the time the talk was given, Git was very young and people were getting used to it. It seemed to solve all the problems faced by version control software, and this contributed to its meteoric rise.

Git's Shortcomings

Fast forward a few years, and people started noticing the first real flaw in Git: it was difficult to handle very large repositories. How large are we talking here? It's Facebook large. Facebook's team projected that in a few years, a simple git status would take up to half a minute to show the result, as Facebook adds a large number of commits from thousands of developers every day. Facebook shifted its whole codebase to Mercurial, and its team actively started contributing to Mercurial to meet Facebook's needs.

Where did Git fail? A Mercurial contributor, Prasoon Shukla, on the comparison of scaling in Git and Mercurial, says that this can be attributed to the way Mercurial and Git store commits. Mercurial manages a couple of objects (or files) for each file in your repository, whereas Git creates an object for each commit. Therefore, on increasing the number of commits, the number of objects in Mercurial remains constant, in contrast to a linear increase in Git. Therefore, when you run a simple git status command, Git has to sift through all these objects, which takes a considerable amount of time (in spite of the high efficiency of Git).

Another area where Git might fall short is managing large binary files in the repository. Because Git tracks the changes in files, it's not able to interpret the content of binary files. And the size of the repository increases with every commit, because Git has to store the exact binary, rather than the change from the last version.

Over the years, developers of Git have tried to solve these problems. Each third-party service has come up with solutions to enable Git to manage larger repositories—such as GitHub's Large File Storage extension.

This post looks at techniques that can be used to handle large repositories in Git—in terms of large histories, as well as the presence of large binary files, or both.

Projects with a Large Number of Commits

I'll firstly look at a few ways to manage repositories with large histories more efficiently.

Shallow clone of repositories

As mentioned earlier, the primary reason why projects with large histories slow down is the huge number of commits. In a distributed system like Git, when you clone a repository, its full project history gets downloaded. However, Git provides a way to specify the number of commits you want to have in your clone of a project. This is known as a shallow clone. When you get the number of commits down, your Git operations run faster.

To perform shallow cloning, you need to add the --depth option, with the number of commits we want, to the clone command:

git clone --depth [number_of_commits] [url_of_remote]

In earlier versions of Git, there was limited support for shallow clones. If your truncated history didn't stretch long enough, you weren't allowed to push or pull. However, with the release of Git 1.9.0, support for shallow clones was increased significantly.

Clone a single branch

When you clone a repository, all the branches in the remote get downloaded. (If you run git branch in a newly cloned repository, it shows only the master branch. You should run git branch -a to list all the branches that were a part of the remote.) It's probable that many of the commits present in other branches are irrelevant to one developer's work. Therefore, you can clone just the master or the branch relevant to your development. Doing so significantly reduces the number of commits that make up the history of the cloned version, especially if branches in the repository have divergent histories.

To clone only a single branch of a remote, you can run the following command:

git clone [url_of_remote] --branch [branch_name] --single-branch

This command instructs Git to clone only the branch_name branch from the remote.

Continue reading %Managing Huge Repositories with Git%

by Shaumik Daityari via SitePoint

Mr Branding