Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Going down the rabbit hole of Git's new bundle-URI (blog.gitbutler.com)

293 points by chmaynard 331 days ago | 92 comments

jakub_g 328 days ago [-]

This is super interesting, as I maintain a 1M commits / 10GB size repo at work, and I'm researching ways to have it cloned by the users faster. Basically for now I do a very similar thing manually, storing a "seed" repo in S3 and having a custom script to fetch from S3 instead of doing `git clone`. (It's faster than cloning from GitHub, as apart from not having to enumerate millions of objects, S3 doesn't throttle the download, while GH seem to throttle at 16MiB/s.)

Semi-related: I always wondered but never got time to dig into what exactly are the contents of the exchange between server and client; I sometimes notice that when creating a new branch off main (still talking the 1M commits repo), with just one new tiny commit, the amount of data the client sends is way bigger than I expected (tens of MBs). I always assumed the client somehow established with the server that it has a certain sha, and only uploads missing commit, but it seems it's not exactly the case when creating a new branch.

maccard 328 days ago [-]

Funny you say this. At my last job I managed a 1.5TB perforce depot with hundreds of thousands of files and had the problem of “how can we speed up CI”. We were on AWS, so I synced the repo, created an ebs snapshot and used that to make a volume, with the intention of reusing it (as we could shove build intermediates in there too.

It was faster to just sync the workspace over the internet than it was to create the volume from the snapshot, and a clean build was quicker from the just sync’ed workspace than the snapshotted one, presumably to do with however EBS volumes work internally.

We just moved our build machines to the same VPC as the server and our download speeds were no longer an issue.

coredog64 327 days ago [-]

When you create an EBS volume from a snapshot, the content is streamed in from S3 on a pull-through basis. You can enable FSR which creates the EBS volume with all the data up front, but it is an extra cost option.

maccard 327 days ago [-]

Yeah, this is exactly my point. Despite provisioning (and paying for) io1 ssd’s it doesn’t matter because you’re still pulling through on demand over a network connection to access it.

It was faster to just not do any of this. At my current job we pay $200/mo for a single bare metal server, and our CI is about 50% quicker than it was for 20% of the price.

jayd16 327 days ago [-]

Hmm I don't know that making a new volume from a snap should fundamentally be faster than what a P4 sync could do. You're still paying for a full copy.

You could have possibly had existing volumes with mostly up to date workspaces. Then you're just paying for the attach time and the sync delta.

maccard 327 days ago [-]

> I don't know that making a new volume from a snap should fundamentally be faster than what a P4 sync could do. You're still paying for a full copy.

My experience with running a c++ build farm in the cloud is that in theory all of this is true but in practice it costs an absolute fortune, and is painfully slow. At the end of the day it doesn’t matter if you’ve provisioned io1 storage; you’re still pulling it across something that vaguely resembles a SAN, and that most of the operations that AWS perform are not as quick as you think they are. It took about 6 minutes to boot a windows ec2 instance, for example. Our incremental build was actually quicker than that, so we spent more time waiting for the instance to start up and attach to our volume cache than we did actually running CI. The cost of the machines was expensive that we couldn’t justify keeping them running all day.

> You could have possibly had existing volumes with mostly up to date workspaces.

This is what we did for incremental builds. The problem was when you want an extra instance that volume needs to be created. We also saw roughly a 5x difference in speed (IIRC, this was 2021 when I set this up) between a noop build on a mounted volume and a noop build that we had just performed the build on.

dijit 328 days ago [-]

I used to use fuse and overlayfs for this, I’m not sure it still works well as I’m not a build engineer and I did it for myself.

Its a lot faster in my case (little over 3TiB for latest revision only).

maccard 327 days ago [-]

There’s a service called p4vfs [0] which does this for p4. The problem we had with this at the time was that unfortunately our build tool scanned everything (which was slow in and of itself) but that caused p4vfs to pull the file anyway. So it didn’t actually help.

[0] https://help.perforce.com/helix-core/server-apps/p4vfs/curre...

jclarkcom 328 days ago [-]

VMware?

maccard 327 days ago [-]

What about it?

captn3m0 328 days ago [-]

The linux kernel does the same thing, and publishes bundle files over CDN[0] for CI systems using a script called linux-bundle-clone[1]

[0]: https://www.kernel.org/best-way-to-do-linux-clones-for-your-...

[1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/mricon/k...

schacon 327 days ago [-]

This is fascinating, I didn't know they did this. This is actually not using the built in functionality that Git has, they use a shell script that does basically the same thing rather than just advertising the bundle refs.

However, the shell script they use doesn't have the bug that I submitted a patch to address - it should have all the refs that were bundled.

miyuru 328 days ago [-]

If I read the script correctly, it still points to git.kernel.org

however, it seems to use the git bundle technique mentioned in the article.

opello 327 days ago [-]

git.kernel.org hits one of the frontends based on geographic location. I'm not sure how often it's discussed, but see [1], and also `dig git.kernel.org`.

[1] https://www.reddit.com/r/linux/comments/2xqn12/im_part_of_th...

bastardoperator 327 days ago [-]

Have you looked into Scalar? It's built into MSFT git and designed to deal with repos that are much larger internally.

  microsoft/git is focused on addressing these performance woes and making the monorepo developer experience first-class. The Scalar CLI packages all of these recommendations into a simple set of commands.

https://github.com/microsoft/scalar

https://github.com/microsoft/git

jakub_g 327 days ago [-]

scalar and msft git (whose many features made it into mainline by now) addresses mostly things like improving local speed by enabling filesystem caching etc.

It doesn't address the issue of "how to clone entire 10GB with full history faster". (Although it facilitates sparse checkouts, which can be beneficial for "multi-repos" where it makes sense to only clone a part of repo, like in old good svn.)

schacon 327 days ago [-]

To try this feature out, you could have the server advertise a bundle ref file made with `git bundle create [bundle-file] --branches` that is hosted on a server within your network - it _should_ make a pretty big difference in local clone times.

schacon 327 days ago [-]

The `--branches` option will work with how git works today. If my patch gets in, future versions of Git will be better with `--all`.

yjftsjthsd-h 327 days ago [-]

I can't imagine you haven't looked at this, but I'm curious: Do shallow clones help at all, or if not what was the problem with them? I'm willing to believe that there are usecases that actually use 1M commits of history, but I'd be interested to hear what they are.

jakub_g 327 days ago [-]

People really want to have history locally so that "git blame" / GitLens IDE extension work locally.

schacon 327 days ago [-]

These days if you do a blobless clone, Git will ask for missing files as it needs them. It's slower, but it's not broken.

jakub_g 327 days ago [-]

Maybe I was doing something wrong, but I had a very bad experience with - tbh don't remember, either blobless or treeless clone - when I evaluated it on a huge fast-moving monorepo (150k files, 100s of merges per day).

I cloned the repo, then was doing occasional `git fetch origin main` to keep main fresh - so far so good. At some point I wanted to `git rebase origin/main` a very outdated branch, and this made git want to fetch all the missing objects, serially one by one, which was taking extremely long compared to `git fetch` on a normal repo.

I did not find a way to to convert the repo back to "normal" full checkout and get all missing objects reasonably fast. The only way I observed happening was git enumerating / checking / fetching missing objects one by one, which in case of 1000s of missing objects takes so long that it becomes impractical.

schacon 327 days ago [-]

The brand newest version of Git has a new `git backfill` command that may help with this.

https://git-scm.com/docs/git-backfill

jakub_g 327 days ago [-]

Nice timing! Thanks!

p_wood 326 days ago [-]

For rebasing `--reapply-cherry-picks` will avoid the annoying fetching you saw. `git backfill` is great for fetching the history of a file before running `git blame` on that file. I'm not sure how much it will help with detecting upstream cherry-picks.

jakub_g 326 days ago [-]

Oh, interesting! Tbh I don't fully understand what "--reapply-cherry-picks" really does, because the docs are very concise and hand-wavy, and _why_ it doesn't need the fetches? Why it is not the default?

schacon 327 days ago [-]

Yeah, it basically has to advertise everything it has, so if you have a lot of references, it can be a quite large exchange before anything is done.

schacon 327 days ago [-]

You can see basically what part of the communication is by running `git ls-remote` and see how big it is.

jakub_g 327 days ago [-]

Indeed, `git ls-remote` produces 14MB output; interestingly, 12MB of it are `refs/pull/<n>/head` as it lists all PRs (including closed ones), and the repo has had ~200,000 PRs already.

It seems like large GitHub repos get an ever-growing penalty for GitHub exposing `refs/pull/...` refs then, which is not great.

I will do some further digging and perhaps reach out to GitHub support. That's been very helpful, thanks Scott!

kvemkon 327 days ago [-]

Have you switched already to the "new" git protocol version 2? [1]

> An immediate benefit of the new protocol is that it enables reference filtering on the server-side, this can reduce the number of bytes required to fulfill operations like git fetch on large repositories.

[1] https://github.blog/changelog/2018-11-08-git-protocol-v2-sup...

djfivyvusn 328 days ago [-]

Have you tried downloading the .zip archive of the repo? Or does that run into similar throttling?

jakub_g 327 days ago [-]

.zip archive of the repo has just current code checkout, no git history

sunnybeetroot 327 days ago [-]

Why does a user need all 1M commits? Can they perform their work with only a few?

ks2048 328 days ago [-]

How much bandwidth and time is wasted cloning the entire history of large projects when people only need single snapshot in a single branch?

According to SO, newer versions of git can do,

  git init
  git remote add origin <url>
  git fetch --depth 1 origin <sha1>
  git checkout FETCH_HEAD

acheong08 328 days ago [-]

git clone --depth 1 works as well. If you're just cloning to build and not contributing it makes much more sense

hiccuphippo 327 days ago [-]

There's also the partial clone which has the tree but not the blobs:

git clone --filter=blob:none

Reommended for developers by github over the shallow clone: https://github.blog/open-source/git/get-up-to-speed-with-par...

jakub_g 327 days ago [-]

Note: this makes sense on CI for a throwaway build, but not for a local dev clone. Blobless clones break or make painfully slow and expensive many local git operations.

pabs3 326 days ago [-]

Also --filter tree:0

mikepurvis 328 days ago [-]

Github can also just serve you a tarball of a snapshot, which is faster and smaller than a shallow clone (and therefore it's the preferred option for a lot of source package managers, like nix, homebrew, etc).

It’s frustrating that tarball urls are a proprietary thing and not something that was ever standardized in the git protocol.

Cthulhu_ 327 days ago [-]

Yeah that's what I try to push for if the user (CI, whichever) just wants the files, using "git archive --remote=" is the fastest way to get just the files.

However, a lot of CIs / build processes rely on the SHA of the head as well, although I'm sure that's also cheap / easy to do without cloning the whole repository.

But that falls apart when you want to make a build / release and generate a changelog based on the commits. But, that's not something that happens all that often in the greater scheme of things.

mikepurvis 327 days ago [-]

As long as there's some envvars with the SHA, branch name, remote, etc, all that should be handleable by a wrapper (or git itself) being able to fall back on those in instances where it's invoked in a tarball of a repo rather than a real repo.

EDIT: Or alternatively (and probably better), the forges could include a dummy .git directory in the tarball that declares it an "archive"-type clone (vs shallow or full), and the git client would read that and offer the same unshallow/fetch/etc options that are available to a regular shallow clone.

skissane 328 days ago [-]

> It’s frustrating that tarball urls are a proprietary thing and not something that was ever standardized in the git protocol.

I think there’s a lot of stuff which is common to the major Git hosters (GitHub, GitLab, etc) - PRs/MRs, issues, status checks, etc - which I wish we had a common interoperable protocol for. Every forge has its own REST API which provides many of the same operations and fields just in an incompatible way. There really should be standardisation in this area but I suppose that isn’t really in the interests of the major incumbents (especially GitHub) since it would reduce the lock-in due to switching costs

mikepurvis 327 days ago [-]

Yeah, the motivation question is definitely a tricky one. A common REST story also feels like a piece of eventually getting to federated PRs between forges, though it may well be that that's just impossible, particularly given that GitLab has been thinking about it for a decade and hasn't even got a story for federation between instances of itself much less with Github or Bitbucket:

https://gitlab.com/gitlab-org/gitlab/-/issues/14116

p_wood 326 days ago [-]

> It’s frustrating that tarball urls are a proprietary thing and not something that was ever standardized in the git protocol.

`git archive --remote` will create a tarball from a remote repository so long as the server has enabled the appropriate config

arkh 327 days ago [-]

> If you're just cloning to build

... the last commit. If you have to rollback a deployment, you'll want to add some depth to your clone.

327 days ago [-]

jes5199 328 days ago [-]

I have a vague recollection that GitHub is optimized for whole repo cloning and they were asking projects not to do shallow fetching automatically, for performance reasons

nyanpasu64 328 days ago [-]

As I understand this issue affected Homebrew and CocoaPods: https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...

> Apparently, most of the initial clones are shallow, meaning that not the whole history is fetched, but just the top commit. But then subsequent fetches don't use the --depth=1 option. Ironically, this practice can be much more expensive than full fetches/clones, especially over the long term. It is usually preferable to pay the price of a full clone once, then incrementally fetch into the repository, because then Git is better able to negotiate the minimum set of changes that have to be transferred to bring the clone up to date.

sureIy 328 days ago [-]

I don't know if that applies anymore or if it doesn't apply on GitHub Actions, but shallow clones is the default there. See `actions/checkout`

jakub_g 327 days ago [-]

GH Actions generally need a throwaway clone. The issue with shallow clones is that subsequent fetches can be expensive. But in CI most of the time you don't need to fetch after clone.

masklinn 327 days ago [-]

It’s subsequent updates which didn’t work well. And it is (or at least was) a limitation of git itself.

bobbylarrybobby 328 days ago [-]

I believe there is a bit of a footgun here because if you don't git clone then you don't fetch all branches, just the default. Can be very confusing and annoying if you know a branch exists on remote but don't have it locally (the first time you hit it, at least).

autarch 328 days ago [-]

> This has resulted in a contender for the world's smallest open source patch:

Hah, got you beat: https://github.com/eki3z/mise.el/pull/12/files

It's one ASCII character, so a one-byte patch. I don't think you can get smaller than that.

wavemode 327 days ago [-]

A commit which does nothing more than change permissions of a file would probably beat that, from an information theory perspective.

You might say, "nay! the octal triple of a file's unix permissions requires 3+3+3 bits, which is 9, which is greater than the 8 bits of a single ascii character!"

But, actually, Git does not support any file permissions other than 644 and 755. So a change from one to the other could theoretically be represented in just one bit of information.

masklinn 327 days ago [-]

> Git does not support any file permissions other than 644 and 755.

It does not but kinda does, git stores the entire unix mode in the tree object (in ascii encoded octal to boot), it has a legacy 775 mode (which gets signaled by fsck —strict, maybe, if that’s been fixed), and it will internally canonicalise all sorts of absolute nonsense.

chungy 327 days ago [-]

> But, actually, Git does not support any file permissions other than 644 and 755. So a change from one to the other could theoretically be represented in just one bit of information.

Git also supports mode 120000 (symbolic links), so you could add another bit of information there.

ithkuil 327 days ago [-]

Ok. But you need to specify the name of the file. So it also needs to be a one character filename.

I wonder if file deletion is theoretically even smaller than permission change.

arghwhat 327 days ago [-]

> So it also needs to be a one character filename.

No one counts the path being changed, just like they also don't count the commit message, committer and author name/email, etc.

retroflexzy 328 days ago [-]

There is a cursor rendering fix in xf86-video-radeonhd (or perhaps -radeon) that flips a single bit.

It took the group several years to narrow in on.

immibis 327 days ago [-]

The Dolphin patch that fixed the heat haze in Dragon Roost Island was also a single bit, changing a 3 to a 7 in source code, and also took several years.

https://dolphin-emu.org/blog/2014/01/06/old-problem-meets-it...

This comment has been delayed by about 12 hours due to the Hacker News rate limit.

328 days ago [-]

timdorr 328 days ago [-]

Sure you can: https://github.com/timdorr/-/commit/9e5a571abd3fc4f8714e8c40...

falcor84 328 days ago [-]

What's the story behind that? Did you just deploy a blank commit to trigger a hook?

nine_k 328 days ago [-]

Only accepted and merged commits count!

schacon 327 days ago [-]

Damn you!

I did find it a little funny that my patch was so small but my commit message was so long. Also, I haven't successfully landed it yet, I keep being too lazy to roll more versions.

ZeWaka 328 days ago [-]

That's a line modification, so presumably you'd count just an insertion or just a deletion as 'smaller'.

autarch 328 days ago [-]

Yes, but so is the PR shown in the article. You're not going to get a diff that's less than one line unless you are using something besides the typical diff and patch tools.

san1t1 328 days ago [-]

My smallest PR was adding a missing executable file permission.

autarch 328 days ago [-]

I think that wins, since presumably it was smaller than a one-byte change. I guess the smallest would be a single-bit file mode change, maybe?

dullcrisp 328 days ago [-]

Only the user permission? Or group and others also?

Izkata 327 days ago [-]

In terms of file contents, I have a bugfix commit somewhere that just adds a single newline to a config file (no line deleted in the diff). We were using something that had its own own custom parser for what looked like an INI file, and we managed to find an edge case where the wrong number of blank lines would break it.

pR0Ps 324 days ago [-]

Same: https://github.com/ianare/exif-py/pull/75/files

I feel like a bit of a fraud because this was the PR that got me the "Mars 2020 Contributor" badge...

geenat 328 days ago [-]

git needs built in handling of large binary files without a ton of hassle, it's all I ask. It'd make git universally applicable to all projects.

mercurial had it for ages.

svn had it for ages.

perforce had it for ages.

just keep the latest binary, or last x versions. Let us purge the rest easily.

pjc50 327 days ago [-]

As someone who used and administered p4 for ages, I regard git as a regression in this regard. Making git a fully distributed system is really expensive for certain use cases. My current employer still uses p4 for large integrated-circuit workflow assets.

A previous workplace was trying to migrate from svn to git, when we realized that every previous official build had checked in the resulting binaries. A sane thing to do in svn, where the cost is only on the server, but would have resulted in a naive conversion costing 50Gb on every client.

GrantMoyer 328 days ago [-]

Getting better: https://git-scm.com/docs/partial-clone

Cthulhu_ 327 days ago [-]

If git lfs fulfills this role, it could become pre-installed with new git installations.

WorldMaker 327 days ago [-]

It is preinstalled on Windows and macOS today by the official installers. Ask your Linux distribution if out-of-the-box git lfs is right for you.

robertlagrant 327 days ago [-]

Nothing to do with the article, but I appreciate the slightly idiosyncratic GitButler YouTube videos that explain how bits of Git work.

schacon 327 days ago [-]

If you want us to cover something on Bits and Booze, just let me know! :)

andrewshadura 328 days ago [-]

Interestingly, Mercurial had solved the bundles more than ten years ago and back then they already worked better than Git's today

capitainenemo 328 days ago [-]

Not the only mercurial feature where that's the case.. sad, I keep rooting for the project to implement mercurial frontend over a git db, but they seem to be limited by missing git features.

kps 328 days ago [-]

Jujutsu (jj) is heavily inspired by Mercurial (though with some significant differences) and can operate with git as a storage backend. https://github.com/jj-vcs/jj

capitainenemo 328 days ago [-]

Yeah. That sounds a bit right. But from what I read they were limited in what they could implement due to missing git features. For example phases.

https://ahal.ca/blog/2024/jujutsu-mercurial-haven/ was a post on that.

But, it looks like they are trying, and at least they imposed some sanity like in the base commit ID. I wonder if they have anything like hg grep --all, hg absorb and hg fa --deleted yet.

They do have revsets ♥

kps 327 days ago [-]

I should say I've never been expert with any VCS — just enough to get things done. I know it has `absorb`, and `grep` can be approximated with a `diff_contains()` revset, but it's not pretty (and aliases are not powerful enough to compensate). Hopefully polish will come over time. I don't know `hg fa` at all.

I'm cautiously optimistic for jj's future because the git backend eliminates the main barrier to adoption.

capitainenemo 327 days ago [-]

hg fa is basically just an optimised blame/annotate "fast-annotate" the awesome feature is --deleted which gives you the file with all the prior removals in context with revision info in it, which is incredibly useful for seeing how the file changed over time in a fairly concise way. Not a full unreadable mess, but those removals often have key details...

327 days ago [-]

dgfitz 328 days ago [-]

Someone once put together an llm backed list of things people on hn post about a lot, mine was about this “other” dvcs system.

It is superior, and it’s not even much of a comparison.

Already__Taken 328 days ago [-]

I used mercurial in anger for about 9 months or something, with a gitlab fork too. when git goes wrong there's forums, blogs, books and manuals. When hg does it's a python stack trace, good luck.

capitainenemo 328 days ago [-]

When I've had mercurial issues, I went to the mercurial channel on libera, or to their manual. But then, haven't ended up with a stack trace yet.

theamk 327 days ago [-]

how did it solve them, and how are mercurial's bundles better than git's ones?

if I am reading the manpage right, the feature set seems pretty compatible. "hg bundle" looks pretty identical to "git bundle".. and "hg clone"'s "-R" option seems pretty similar to "git clone"'s "--reference".

nine_k 328 days ago [-]

But branches were more problematic.

capitainenemo 328 days ago [-]

Mercurial has had git-like "lightweight branches"/bookmarks without the revision record of mercurial named branches for over 15 years. There are good reasons to use the traditional branches though.

https://mercurial.aragost.com/kick-start/en/bookmarks/

DrinkyBird 328 days ago [-]

The topics[0] feature in the evolution extension is probably even closer to Git branches, since they are completely mutable and needn't be a permanent part of your repo. Bookmarks are just pointers to changesets, and although that's technically how Git branches work, it's not how they work in practice in Mercurial because of its focus on immutability (and because hg and git work differently).

[0]: https://www.mercurial-scm.org/doc/evolution/tutorials/topic-...

mbac32768 328 days ago [-]

One consequence of git clone is that if you have mega repos, it kind of ejects everything else from your cache for no win.

You'd actually rather special case full clones and instruct the storage layer to avoid adding to the cache for the clone. But this isn't always possible to do.

Git bundles seem like a good way to improve the performance of other requests, since they punt off to a CDN and protect the cache.

jedimastert 328 days ago [-]

This actually might solve a massive CI problem we've been having...will report back tomorrow

jwpapi 328 days ago [-]

[flagged]

Rendered at 16:23:45 GMT+0000 (Coordinated Universal Time) with Vercel.