This is super interesting, as I maintain a 1M commits / 10GB size repo at work, ...

maccard · 2025-03-16T23:29:18 1742167758

Funny you say this. At my last job I managed a 1.5TB perforce depot with hundreds of thousands of files and had the problem of “how can we speed up CI”. We were on AWS, so I synced the repo, created an ebs snapshot and used that to make a volume, with the intention of reusing it (as we could shove build intermediates in there too.

It was faster to just sync the workspace over the internet than it was to create the volume from the snapshot, and a clean build was quicker from the just sync’ed workspace than the snapshotted one, presumably to do with however EBS volumes work internally.

We just moved our build machines to the same VPC as the server and our download speeds were no longer an issue.

coredog64 · 2025-03-17T14:50:01 1742223001

When you create an EBS volume from a snapshot, the content is streamed in from S3 on a pull-through basis. You can enable FSR which creates the EBS volume with all the data up front, but it is an extra cost option.

maccard · 2025-03-17T15:04:53 1742223893

Yeah, this is exactly my point. Despite provisioning (and paying for) io1 ssd’s it doesn’t matter because you’re still pulling through on demand over a network connection to access it.

It was faster to just not do any of this. At my current job we pay $200/mo for a single bare metal server, and our CI is about 50% quicker than it was for 20% of the price.

jayd16 · 2025-03-17T05:13:08 1742188388

Hmm I don't know that making a new volume from a snap should fundamentally be faster than what a P4 sync could do. You're still paying for a full copy.

You could have possibly had existing volumes with mostly up to date workspaces. Then you're just paying for the attach time and the sync delta.

maccard · 2025-03-17T08:52:08 1742201528

> I don't know that making a new volume from a snap should fundamentally be faster than what a P4 sync could do. You're still paying for a full copy.

My experience with running a c++ build farm in the cloud is that in theory all of this is true but in practice it costs an absolute fortune, and is painfully slow. At the end of the day it doesn’t matter if you’ve provisioned io1 storage; you’re still pulling it across something that vaguely resembles a SAN, and that most of the operations that AWS perform are not as quick as you think they are. It took about 6 minutes to boot a windows ec2 instance, for example. Our incremental build was actually quicker than that, so we spent more time waiting for the instance to start up and attach to our volume cache than we did actually running CI. The cost of the machines was expensive that we couldn’t justify keeping them running all day.

> You could have possibly had existing volumes with mostly up to date workspaces.

This is what we did for incremental builds. The problem was when you want an extra instance that volume needs to be created. We also saw roughly a 5x difference in speed (IIRC, this was 2021 when I set this up) between a noop build on a mounted volume and a noop build that we had just performed the build on.

dijit · 2025-03-16T23:59:49 1742169589

I used to use fuse and overlayfs for this, I’m not sure it still works well as I’m not a build engineer and I did it for myself.

Its a lot faster in my case (little over 3TiB for latest revision only).

maccard · 2025-03-17T08:44:55 1742201095

There’s a service called p4vfs [0] which does this for p4. The problem we had with this at the time was that unfortunately our build tool scanned everything (which was slow in and of itself) but that caused p4vfs to pull the file anyway. So it didn’t actually help.

[0] https://help.perforce.com/helix-core/server-apps/p4vfs/curre...

jclarkcom · 2025-03-16T23:58:35 1742169515

VMware?

maccard · 2025-03-17T08:42:21 1742200941

What about it?

captn3m0 · 2025-03-16T23:43:24 1742168604

The linux kernel does the same thing, and publishes bundle files over CDN[0] for CI systems using a script called linux-bundle-clone[1]

[0]: https://www.kernel.org/best-way-to-do-linux-clones-for-your-...

[1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/mricon/k...

schacon · 2025-03-17T09:13:35 1742202815

This is fascinating, I didn't know they did this. This is actually not using the built in functionality that Git has, they use a shell script that does basically the same thing rather than just advertising the bundle refs.

However, the shell script they use doesn't have the bug that I submitted a patch to address - it should have all the refs that were bundled.

miyuru · 2025-03-17T02:53:04 1742179984

If I read the script correctly, it still points to git.kernel.org

however, it seems to use the git bundle technique mentioned in the article.

opello · 2025-03-17T04:51:29 1742187089

git.kernel.org hits one of the frontends based on geographic location. I'm not sure how often it's discussed, but see [1], and also `dig git.kernel.org`.

[1] https://www.reddit.com/r/linux/comments/2xqn12/im_part_of_th...

bastardoperator · 2025-03-17T17:41:51 1742233311

Have you looked into Scalar? It's built into MSFT git and designed to deal with repos that are much larger internally.

  microsoft/git is focused on addressing these performance woes and making the monorepo developer experience first-class. The Scalar CLI packages all of these recommendations into a simple set of commands.

https://github.com/microsoft/scalar

https://github.com/microsoft/git

jakub_g · 2025-03-17T17:46:45 1742233605

scalar and msft git (whose many features made it into mainline by now) addresses mostly things like improving local speed by enabling filesystem caching etc.

It doesn't address the issue of "how to clone entire 10GB with full history faster". (Although it facilitates sparse checkouts, which can be beneficial for "multi-repos" where it makes sense to only clone a part of repo, like in old good svn.)

schacon · 2025-03-17T07:10:36 1742195436

To try this feature out, you could have the server advertise a bundle ref file made with `git bundle create [bundle-file] --branches` that is hosted on a server within your network - it _should_ make a pretty big difference in local clone times.

schacon · 2025-03-17T07:11:12 1742195472

The `--branches` option will work with how git works today. If my patch gets in, future versions of Git will be better with `--all`.

yjftsjthsd-h · 2025-03-17T05:32:26 1742189546

I can't imagine you haven't looked at this, but I'm curious: Do shallow clones help at all, or if not what was the problem with them? I'm willing to believe that there are usecases that actually use 1M commits of history, but I'd be interested to hear what they are.

jakub_g · 2025-03-17T08:39:45 1742200785

People really want to have history locally so that "git blame" / GitLens IDE extension work locally.

schacon · 2025-03-17T09:14:24 1742202864

These days if you do a blobless clone, Git will ask for missing files as it needs them. It's slower, but it's not broken.

jakub_g · 2025-03-17T10:58:34 1742209114

Maybe I was doing something wrong, but I had a very bad experience with - tbh don't remember, either blobless or treeless clone - when I evaluated it on a huge fast-moving monorepo (150k files, 100s of merges per day).

I cloned the repo, then was doing occasional `git fetch origin main` to keep main fresh - so far so good. At some point I wanted to `git rebase origin/main` a very outdated branch, and this made git want to fetch all the missing objects, serially one by one, which was taking extremely long compared to `git fetch` on a normal repo.

I did not find a way to to convert the repo back to "normal" full checkout and get all missing objects reasonably fast. The only way I observed happening was git enumerating / checking / fetching missing objects one by one, which in case of 1000s of missing objects takes so long that it becomes impractical.

schacon · 2025-03-17T15:54:34 1742226874

The brand newest version of Git has a new `git backfill` command that may help with this.

https://git-scm.com/docs/git-backfill

jakub_g · 2025-03-17T17:48:29 1742233709

Nice timing! Thanks!

p_wood · 2025-03-18T10:41:18 1742294478

For rebasing `--reapply-cherry-picks` will avoid the annoying fetching you saw. `git backfill` is great for fetching the history of a file before running `git blame` on that file. I'm not sure how much it will help with detecting upstream cherry-picks.

jakub_g · 2025-03-18T19:41:09 1742326869

Oh, interesting! Tbh I don't fully understand what "--reapply-cherry-picks" really does, because the docs are very concise and hand-wavy, and _why_ it doesn't need the fetches? Why it is not the default?

schacon · 2025-03-17T07:07:41 1742195261

Yeah, it basically has to advertise everything it has, so if you have a lot of references, it can be a quite large exchange before anything is done.

schacon · 2025-03-17T07:08:11 1742195291

You can see basically what part of the communication is by running `git ls-remote` and see how big it is.

jakub_g · 2025-03-17T11:06:38 1742209598

Indeed, `git ls-remote` produces 14MB output; interestingly, 12MB of it are `refs/pull/<n>/head` as it lists all PRs (including closed ones), and the repo has had ~200,000 PRs already.

It seems like large GitHub repos get an ever-growing penalty for GitHub exposing `refs/pull/...` refs then, which is not great.

I will do some further digging and perhaps reach out to GitHub support. That's been very helpful, thanks Scott!

kvemkon · 2025-03-17T15:50:51 1742226651

Have you switched already to the "new" git protocol version 2? [1]

> An immediate benefit of the new protocol is that it enables reference filtering on the server-side, this can reduce the number of bytes required to fulfill operations like git fetch on large repositories.

[1] https://github.blog/changelog/2018-11-08-git-protocol-v2-sup...

djfivyvusn · 2025-03-17T00:02:53 1742169773

Have you tried downloading the .zip archive of the repo? Or does that run into similar throttling?

jakub_g · 2025-03-17T08:40:17 1742200817

.zip archive of the repo has just current code checkout, no git history

sunnybeetroot · 2025-03-17T19:15:05 1742238905

Why does a user need all 1M commits? Can they perform their work with only a few?