Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲TernFS – An exabyte scale, multi-region distributed filesystem (xtxmarkets.com)

251 points by rostayob 142 days ago | 109 comments

charleshn 142 days ago [-]

A few questions if the authors are around!

> Is hardware agnostic and uses TCP/IP to communicate.

So no RDMA? It's very hard to make effective use of modern NVMe drives bandwidth over TCP/IP.

> A logical shard is further split into five physical instances, one leader and four followers, in a typical distributed consensus setup. The distributed consensus engine is provided by a purpose-built Raft-like implementation, which we call LogsDB

Raft-like, so not Raft, a custom algorithm? Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?

> Read/write access to the block service is provided using a simple TCP API currently implemented by a Go process. This process is hardware agnostic and uses the Go standard library to read and write blocks to a conventional local file system. We originally planned to rewrite the Go process in C++, and possibly write to block devices directly, but the idiomatic Go implementation has proven performant enough for our needs so far.

The document mentions it's designed to reach TB/s though. Which means that for an IO intensive workload, one would end up wasting a lot of drive bandwidth, and require a huge number of nodes.

Modern parallel filesystems can reach 80-90GB/s per node, using RDMA, DPDK etc.

> This is in contrast to protocols like NFS, whereby each connection is very stateful, holding resources such as open files, locks, and so on.

This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).

No mention of the way this was developed and tested - does it use some formal methods, simulator, chaos engineering etc?

jleahy 141 days ago [-]

> So no RDMA?

We can saturate the network interfaces of our flash boxes with our very simple Go block server, because it uses sendfile under the hood. It would be easy to switch to RDMA (it’s just a transport layer change) but right now we didn’t need to. We’ve had to make some difficult prioritisation decisions here.

PRs welcome!

> Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?

We’re used to building things like this, trading systems are giant distributed systems with shared state operating at millions of updates per second. We also cheated, right now there is no automatic failover enabled. Failures are rare and we will only enable that post-Jepsen.

If we used somebody else’s implementation we would never be able to do the multi-master stuff that we need to equalise latency for non-primary regions.

> This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).

Even NFSv3 needs a duplicate request cache because requests are not idempotent. Idempotency of all requests is hard to achieve but rewarding.

AtlasBarfed 142 days ago [-]

Not to mention you simply want a large distributed system implemented in multiple clouds / on prems / use cases, with battle tested procedures on node failure, replacement, expansion, contraction, backup/restore, repair/verification, install guides, an error "zoo".

Not to mention a Jepsen test suite, detailed CAP tradeoff explanation, etc.

There's a reason those big DFS at the FAANGs aren't really implemented anywhere else: they NEED the original authors with a big, deeply experienced infrastructure/team in house.

menaerus 141 days ago [-]

DeepSeek team, which is also an HFT shop, also implemented their DFS - https://github.com/deepseek-ai/3FS

Yoric 141 days ago [-]

My memories are a bit sketchy, but isn't CAP worked around by the eventual consistency of Paxos/Raft/...?

immibis 141 days ago [-]

The protocols you mentioned are always consistent. You will know if they are not consistent because they will not make progress. Yes there's a short delay where some nodes haven't learned about the new thing yet and only know that they're about to learn the new thing, but that's not what's meant by "eventual consistency", which is when inconsistent things may happen and become consistent at some time later. In Paxos or Raft, nodes that know the new consistent data is about to arrive can wait for it and present the illusion of a completely consistent system (as long as the network isn't partitioned so the data eventually arrives). These protocols are slow. so they're usually only used for the most important coordination, like knowing which servers are online.

CAP cannot be worked around. In the event of a partition, your system is either C or A, no buts. Either the losing side of the partition refuses to process writes and usually reads as well (ensuring consistency and ensuring unavailability) or it does not refuse (ensuring availability and ensuring data corruption). There are no third options.

Well, some people say the third option is to just make sure the network is 100% reliable and a partition never occurs. That's laughable.

AtlasBarfed 141 days ago [-]

Yes, so when I see a distributed system that does not tell me explicitly whether or not it is sacrificing consistency or availability, I get suspicious.

Or has mechanisms for tuning on a request basis what you want to prioritize: consistency or availability, and those depend on specific mechanisms for reads and writes.

If I don't see a distributed system that explains such things, then I'm assuming that they made a lot of bad assumptions.

Yoric 141 days ago [-]

> Yes there's a short delay where some nodes haven't learned about the new thing yet, but that's not what's meant by "eventual consistency", which is when inconsistent things may happen and become consistent at some time later.

Thanks, I haven't looked at these problems in a while.

> In the event of a partition, your system is either C or A, no buts.

Fair enough. Raft and Paxos provide well-understood tradeoffs but not a workaround.

foota 142 days ago [-]

Out of curiosity, you seem knowledgeable here, is it possible to do NVME over RDMA in public cloud (e.g., on AWS)? I was recently looking into this and my conclusion was no, but I'd love to be wrong :)

stonogo 142 days ago [-]

Amazon FSx for Lustre is the product. They do have information on DIY with the underlying tech: https://aws.amazon.com/blogs/hpc/scaling-a-read-intensive-lo...

foota 142 days ago [-]

Thanks for the link! I had seen this, but it wasn't clear to me either how to configure the host as an nvme-of target, nor whether it would actually bypass the host CPU. The article (admittedly now 4 years old) cites single digit GB/second, while I was really hoping for something closer to the full NVME bandwidth. Maybe that's just a reflection of the time though, drives have gotten a lot faster since then.

Edit: this is more like what I was hoping for: https://aws.amazon.com/blogs/aws/amazon-fsx-for-lustre-unloc... although I wasn't looking for a file system product. Ideally a tutorial like... "Create a couple VMs, store a file on one, do XYZ, and then read it from another with this API" was what I was hoping for, or at least some first party documentation of how to use these things together.

stonogo 142 days ago [-]

Probably something more like this? https://github.com/aws-samples/amazon-fsx-tutorial/tree/mast...

foota 141 days ago [-]

So... The issue is that I'm not using lustre. As far as I can tell, NVME over fabrics (nvme-of) for RDMA is implemented by kernel modules nvmet-rdma and nvme-rdma (the first being for the target). This kernel modules supports infibiband and I think fiber channels, but _not_ EFA, and EFA itself is not an implementation of infiniband. There are user space libraries that paper over these differences when using them for just network transport (E.g., libfabrics) and EFA sorta pretends to be IB, but afaict this is just meant to ease integration at the user space level. Unfortunately, since there's no kernel module support for EFA in the nvme-of kernel modules, it doesn't seem possible to use without lustre. I don't know exactly how they're doing it for lustre clients. There seems to be a lustre client kernel module though, so my guess is that it's in there? The lustre networking module, lnet, does have an EFA integration, but it seems to only be as a network transit. I don't see anything in lustre about nvme-of though, so I'm not sure.

Maybe there's something I'm missing though, and it'll just work if I give it try :)

lustre-fan 141 days ago [-]

Yeah, Lustre supports EFA as a network transit between a Lustre client and a Lustre server. It's lnet/klnds/kefalnd/ in the Lustre tree. But Lustre doesn't support NVMeoF directly. It uses a custom protocol. And neither does EFA. Someone would have to modify the NVMeoF RDMA target/host drivers to support it. EFA already supports in-kernel IB clients (that's how Lustre uses EFA today). So it's not an impossible task. It's just that no one has done it.

foota 141 days ago [-]

Hey, thanks for the comment! Also, I'm amused by the specificity of your account haha, do you have something set to monitor HN for mentions of Lustre?

> "But Lustre doesn't support NVMeoF directly. It uses a custom protocol."

Could you link me to this? I searched the lustre repo for nvme and didn't see anything that looked promising, but would be curious to read how this works.

> "And neither does EFA. Someone would have to modify the NVMeoF RDMA target/host drivers to support it."

To confirm, you're saying there'd need to be something like an EFA equivalent to https://kernel.googlesource.com/pub/scm/linux/kernel/git/tor... (and corresponding initiator code)?

> "EFA already supports in-kernel IB clients (that's how Lustre uses EFA today). So it's not an impossible task. It's just that no one has done it."

I think you're saying there's already in-kernel code for interfacing with EFA, because this is how lnet uses EFA? Is that https://kernel.googlesource.com/pub/scm/linux/kernel/git/tyc...? I found this but I wasn't sure if this was actually the dataplane (for lack of a better word) part of things, from what I read it sounded like most of the dataplane was implemented in userspace as a part of libfabric, but it sounds like I might be wrong.

Does this mean you can generally just pretend that EFA is a normal IB interface and have things work out? If that's the case, why doesn't NVME-of just support it naturally? Just trying to figure out how these things fit together, I appreciate your time!

In case you're curious, I have a stateful service that has an NVME backed cache over object storage and I've been wondering what it would take to make it so that we could run some proxy services that can directly read from that cache to scale out the read throughput from an instance.

lustre-fan 140 days ago [-]

> do you have something set to monitor HN for mentions of Lustre?

Nothing, beside browsing hackernews a bit too much.

> "But Lustre doesn't support NVMeoF directly. It uses a custom protocol."

To be specific, Lustre is a parallel filesystem. Think of it like a bigger version of NFS. You format the NVMe as ext4 or ZFS and mount them as Lustre servers. Once you have an MGS, MDS, and OSS - you can mount the servers as a filesystem. Lustre won't export the NVMe to client as a block device. But you could mount individual Lustre files as a block device, if you want.

> To confirm, you're saying there'd need to be something like an EFA equivalent to https://kernel.googlesource.com/pub/scm/linux/kernel/git/tor... (and corresponding initiator code)?

Essentially, yeah.

> I think you're saying there's already in-kernel code for interfacing with EFA, because this is how lnet uses EFA?

Yes. EFA implements kernel verbs support. Normal user-space applications use user verbs i.e. https://www.kernel.org/doc/html/latest/infiniband/user_verbs.... Kernel verbs support allows kernel-space applications to also use EFA. This is currently implemented in the out-of-tree version of the EFA driver https://github.com/amzn/amzn-drivers/tree/master/kernel/linu.... Lustre interfaces with that with the driver in lnet/klnds/efalnd/. NVMeoF would need some similar glue code.

> Does this mean you can generally just pretend that EFA is a normal IB interface and have things work out? If that's the case, why doesn't NVME-of just support it naturally? Just trying to figure out how these things fit together, I appreciate your time!

There are some similarities (the EFA driver is implemented in the IB subsystem, after all). But the semantics for adding/removing ports/interfaces would be different - so it wouldn't "just work" without some changes. I don't know the full scope of the changes (I haven't dived into it too much). Although, I suspect support would look fairly similar to drivers/nvme/target/rdma.c.

> In case you're curious, I have a stateful service that has an NVME backed cache over object storage and I've been wondering what it would take to make it so that we could run some proxy services that can directly read from that cache to scale out the read throughput from an instance.

If you're looking for a scale out cache in front of s3, that's essentially Lustre/s3 integration https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-dr.... It's a filesystem, so I guess it depends on how your service expects access objects.

harshaw 142 days ago [-]

Sounds more like an object system (immutable) with the veneer of a file system for their use cases. I sort of read the doc - sounds like data is replicated and not erasure encoded (so perhaps more expensive?).

I think many people have said this, but "file systems" get a lot easier if you don't have to worry about overwrites, appends, truncates, etc. Anyway, always interesting to see what people come up with for their use cases.

rostayob 142 days ago [-]

We do use Reed-Solomon codes, as the blog post explains.

Balinares 141 days ago [-]

Append-only is pretty much the only way to do robust replicated storage at scale, else you get into scenarios where, instead of a given logical block having two possible states (either existing somewhere or not existing anywhere), the block can exist with multiple values at different times, including an unbounded number of invalid ones, for instance in case a client died halfway through a block mutation. Immutability is just plain a very strong invariant.

It also does not at all preclude implementing a read-write layer on top of it, for instance with a log-structured FS design. That's however the solution to a problem these people are, it seems, not having.

rickette 142 days ago [-]

Over 500PB of data, wow. Would love to know how and why "statistical models that produce price forecasts for over 50,000 financial instruments worldwide" require that much storage.

guerby 142 days ago [-]

If you keep all order book changes for a large number of financial instruments volume adds up quickly.

rickette 142 days ago [-]

Would that kind of data not compress like crazy? Or would they need to keep all that data hot and fast.

guerby 142 days ago [-]

From just a single exchange you can reach up to 1 million messages of order book change per second

https://www.nasdaqtrader.com/snippets/inet2.html

   Message Volume  1,684,103,265
   Messages per Second  1,134,640
   Order Volume  871,875,595
   Orders per Second  581,696
   Share Volume  12,814,454,760
   Executions per Second  193,350

Also if you look at equity derivative products which have parameters like type call/put, strike, maturity can be hundreds of financial products for one underlying stock.

I worked in this sector and volume of data is a real challenge, no wonder you often get custom software to handle that :)

rickette 141 days ago [-]

Thanks for the insight!

pcthrowaway 141 days ago [-]

How do you propose lossless compression for all orderbook data? Of course if you are willing to lose granularity/information, it can be compressed a lot

taneliv 137 days ago [-]

I would imagine to lesser extent government policy changes and news articles, and to larger extent online discussions on topics relevant to these instruments. Models then attempt to extract signals with predictive value from all the noise. Probably contains non-trivial amount of history to correlate words to market performance in the past, say 20 years or more.

But it's really just a guess, I haven't worked in this domain.

Beijinger 142 days ago [-]

Me too. Is is really hard for me to understand, what XTX is actually doing. Trading? VC? AI/ML?

Have you seen their portfolio?

PS: Company seems legit. Impressive growth. But I still don't understand what they are doing. Provide "electronic liquidity". Well....

orbifold 142 days ago [-]

computing correlations between 50.000 financial instruments (X^T X) and doing linear regression ;).

adastra22 142 days ago [-]

High frequency trading.

mrbluecoat 142 days ago [-]

Cool project and kudos for open sourcing it. Noteworthy limitation:

> TernFS should not be used for tiny files — our median file size is 2MB.

jandrewrogers 142 days ago [-]

I have worked on exabyte-scale storage engines. There is a good engineering reason for this type of limitation.

If you had 1 KiB average file size then you have quadrillions of metadata objects to quickly search and manage with fine-granularity. The kinds of operations and coordination you need to do with metadata is difficult to achieve reliably when the metadata structure itself is many PB in size. There are interesting edge cases that show up when you have to do deep paging of this metadata off of storage. Making this not slow requires unorthodox and unusual design choices that introduce a lot of complexity. Almost none of the metadata fits in memory, including many parts of conventional architectures we assume will always fit in memory.

A mere trillion objects is right around the limit of where the allocators, metadata, etc can be made to scale with heroic efforts before conventional architectures break down and things start to become deeply weird on the software design side. Storage engines need to be reliable, so avoiding that design frontier makes a lot of sense if you can avoid it.

It is possible to break this barrier but it introduces myriad interesting design and computer science problems for which there is little literature.

toast0 142 days ago [-]

Small files suck on normal filesystems too. There's reasons to have them, but if the stars align and you can go from M directories of N directories of O files to M directories of N files with O sub-files, it can make a lot of operations way faster, but probably not updates to individual sub-files (but if you're updating all the files and can update all of the M/N.db at once, then that probably is faster)

stuartjohnson12 142 days ago [-]

This sounds like a fascinating niche piece of technical expertise I would love to hear more about.

What are the biggest challenges in scaling metadata from a trillion to a quadrillion objects?

jandrewrogers 142 days ago [-]

It is dependent on the intended workload but there are a few common design problems. Keep in mind that you can't just deal in the average case, you have to design for the worst possible cases of extremely skewed or pathologically biased distributions. A lot of the design work is proving worst case resource bounds under various scenarios and then proving the worst case behavior of designs intended to mitigate that.

An obvious one is bulk deletion, which is rarely fast at any scale. This may involve trillions of updates to search indexing structures, which in naive implementations could look like pointer-chasing across disk. Releasing storage to allocators has no locality because you are streaming the allocations to release off that storage in semi-random order. It is unhelpfully resistant to most scheduling-based locality optimization techniques. You also want to parallelize this as much as possible and some of these allocators will be global-ish.

The most interesting challenge to me is meta-scheduling. Cache replacement algorithms usually don't provide I/O locality at this scale so standard mitigations for cache-resistant workloads like dynamic schedule rewriting and latency-hiding are used instead. Schedulers are in the center of the hot path so you really want these to be memory-resident and fast. Their state size is loosely correlated with the number of objects, so in some extreme cases these can easily exceed available memory on large servers. You can address this by designing a "meta-scheduler" that adaptively optimizes the scheduling of scheduler state, so that the right bits are memory-resident at the right time so that the scheduler can optimally schedule its workload. It is difficult to overstate how much of a breaking change to conventional architecture this turns out to be. These add some value even if the state is memory resident but they greatly increase design complexity and make tail latencies more difficult to manage.

A more basic challenge is that you start dealing with numbers that may not be representable in 64-bits. Similarly, many popular probabilistic algorithms may not offer what you need when the number of entities is this large.

I aggressively skirted these issues for a long time before relenting. I deal more with database storage engines than filesystems, but to a first approximation "files" and "shards" are equivalent for these purposes.

stuartjohnson12 141 days ago [-]

This is fascinating. If you wrote a long essay about this, I (and probably most of hacker news) would surely love to read it.

comprev 142 days ago [-]

This is quite fascinating, thank you!

KaiserPro 142 days ago [-]

you really notice metadata performance (try a git checkout on EFS on AWS. loads of small files takes fucking ages) However EFS is actually pretty fast. you can get decent throughput if you're writing to just one file. but if you're trying to open 1000 1meg files to read from vs 1 1G file, it'll be much slower (unless they'd dramatically improved performance recently)

Trying to have a fast globally consistent database for quadrillion items in the _same_ name space is super hard. You need to chose a tradeoff between speed, partition resistance and consistency.

You're much better off sharding into discreet logical units. Its very rare that you need a global namespace for a filesystem. For VFX where we used lustre a lot, the large namespace was a nice to have, it was more about getting a raid-0 across file servers (well object stores) to get performance.

For filesystems specifically, if you're using folders, then you don't actually need to guarantee much outside of a folder. So long as filenames are unique to that folder, you can get away with a lot of shit you can't do in a normal database. you also don't need directories to be on the same filesystem (well in linux at least) so you can also shard by using directories as a key.

The directory-key-filesystem approach is actually hilariously simple, fast scalable and reliable. If a single server/Fs goes down it only takes out that area. On the downside it does mean that you can overwhelm/get hot spots.

dekhn 142 days ago [-]

We are truly spoiled by all the improvements that went into local filesystems that are lacking in network filesystems. So much of our perception of "computer is fast" is really just write-caching, read-caching, read-ahead.

KaiserPro 141 days ago [-]

Oh nvme and commodity 10/40/100gig networks mean that NFS shares can be _faster_ than local disk

In 2008 when I was a youngen, 100tb filesystem that could sustain 1-3gigabytes of streaming throughput took something like 40 racks. Huge amounts of cost and power were needed to set it up and maintain it. Any kind of random IO would kneecap the performance for everyone

Now you can have a 2u server with 100tb of NVME storage and the only bottleneck is the network adaptor! not only that but its pretty cheap too.

Eikon 142 days ago [-]

Shameless plug: https://github.com/Barre/ZeroFS

I initially developed it for a usecase where I needed to store billions of tiny files, and it just requires a single s3 bucket as infrastructure.

HackerNewt-doms 137 days ago [-]

> ZeroFS - The Filesystem That Makes S3 your Primary Storage.

What is the motivation to use s3 as primary storage?

Datagenerator 142 days ago [-]

Interesting, can I use SeaweedFS as bucket provider?

Eikon 141 days ago [-]

If SeaweedFS supports conditional PUTs, yes.

mdaniel 141 days ago [-]

It claims to have it: https://github.com/seaweedfs/seaweedfs/wiki/S3-Conditional-O...

heipei 142 days ago [-]

Yeah, that was the first thing I checked as well. Being suited for small / tiny files is a great property of the SeaweedFS system.

pandemic_region 142 days ago [-]

What happens if you put a tiny file on it then? Bad perf, possible file corruption, ... ?

jleahy 142 days ago [-]

It's just not optimised for tiny files. It absolutely would work with no problems at all, and you could definitely use it to store 100 billion 1kB files with zero problems (and that is 100 terabytes of data, probably on flash, so no joke). However you can't use it to store 1 exabyte of 1 kilobyte files (at least not yet).

KaiserPro 142 days ago [-]

Bad space efficiency and possibly exhausting your inode system (ie there is space left on the device, but you can't put any files on it)

redundantly 142 days ago [-]

Probably wasting space and lower performance.

stonogo 142 days ago [-]

...which places it firmly in the "just like every other so-called exascale file system." We already had GPFS...

ttfvjktesd 142 days ago [-]

How does TernFS compare to CephFS and why not CephFS, since it is also tested for the multiple Petabyte range?

rostayob 142 days ago [-]

(Disclaimer: I'm one of the authors of TernFS and while we evaluated Ceph I am not intimately familiar with it)

Main factors:

* Ceph stores both metadata and file contents using the same object store (RADOS). TernFS uses a specialized database for metadata which takes advantage of various properties of our datasets (immutable files, few moves between directories, etc.).

* While Ceph is capable of storing PBs, we currently store ~600PBs on a single TernFS deployment. Last time we checked this would be an order of magnitude more than even very large Ceph deployments.

* More generally, we wanted a system that we knew we could easily adapt to our needs and more importantly quickly fix when something went wrong, and we estimated that building out something new rather than adapting Ceph (or some other open source solution) would be less costly overall.

mgrandl 142 days ago [-]

There are definitely insanely large Ceph deployments. I have seen hundreds of PBs in production myself. Also your usecase sounds like something that should be quite manageable for Ceph to handle due to limited metadata activity, which tends to be the main painpoint with CephFS.

rostayob 142 days ago [-]

I'm not fully up to date since we looked into this a few years ago, at the time the CERN deployments of Ceph were cited as particularly large examples and they topped out at ~30PB.

Also note that when I say "single deployment" I mean that the full storage capacity is not subdivided in any way (i.e. there are no "zones" or "realms" or similar concepts). We wanted this to be the case after experiencing situations where we had significant overhead due to having to rebalance different storage buckets (albeit with a different piece of software, not Ceph).

If there are EB-scale Ceph deployments I'd love to hear more about them.

mrngm 141 days ago [-]

Ceph has opt-in telemetry since a couple of years. This dashboard[0] panel suggests there are about 4-5 clusters (that send telemetry) within the 32-64 PiB range.

It would be really interesting to see larger clusters join in on their telemetry as well.

[0] https://telemetry-public.ceph.com/d/ZFYuv1qWz/telemetry?orgI...

mdaniel 141 days ago [-]

That was interesting, thank you. Other folks may also enjoy its sibling dashboard showing the used capacity https://telemetry-public.ceph.com/d/ZFYuv1qWz/telemetry?orgI...

mgrandl 142 days ago [-]

There are much larger Ceph clusters, but they are enterprise owned and not really publicly talked about. Sadly I can’t share what I personally worked on.

rostayob 142 days ago [-]

The question is whether there are single Ceph deployments are that large. I believe Hetzner uses Ceph for its cloud offering, and that's probably very large, but I'd imagine that no single tenant is storing hundreds of PBs in it. So it's very easy to shard across many Ceph instances. In our use-case we have a single tenant which stores 100s of PBs (and soon EBs).

ttfvjktesd 142 days ago [-]

Digital Ocean is also using Ceph[1]. I think these cloud providers could easily have 100s of PBs Clusters at their size, but it's not public information.

Even smaller company's (< 500 employees) in today's big data collection age often have more than 1 PB of total data in their enterprise pool. Hosters like Digital Ocean hosts thousands of these companies.

I do think that Ceph will hit performance issues at that size and going into the EB range will likely require code changes.

My best guess would be that Hetzner, Digital Ocean and similar, maintain their own internal fork of Ceph and have customizations that tightly addresses their particular needs.

[1]: https://www.digitalocean.com/blog/why-we-chose-ceph-to-build...

kachapopopow 142 days ago [-]

Ceph is more of: here's a raw block of data, do whatever the hell you want with it, not really good for immutable data.

mgrandl 142 days ago [-]

Well sure you would have to enforce immutability at the client side.

kachapopopow 142 days ago [-]

It's more that it has all the systems to allow mutability which add a lot of overhead when used as an immutable system.

eps 142 days ago [-]

Last point is an extremely important advantage that is often overlooked and denigrated. But having a complex system that you know inside-out because you made it from scratch pays in gold in the long term.

_jsmh 142 days ago [-]

Any compression at the filesystem level?

rostayob 142 days ago [-]

No, we have our custom compressor as well but it's outside the filesystem.

jleahy 142 days ago [-]

The seamless realtime intercontinental replication is a key feature for us, maybe the most important single feature, and AFAIK you can’t do that with Ceph (even if Ceph could scale to our original 10 exabyte target in one instance).

cmdrk 142 days ago [-]

CephFS implements a (fully?) POSIX filesystem while it seems that TernFS makes tradeoffs by losing permissions and mutability for further scale.

Their docs mention they have a custom kernel module, which I suppose is (today) shipped out of tree. Ceph is in-tree and also has a FUSE implementation.

The docs mention that TernFS also has its own S3 gateway, while RADOSGW is fully separate from CephFS.

jcul 142 days ago [-]

My (limited) understanding is that cephfs, RGW (S3), RBD (block device) are all different things using the same underlying RADOS storage.

You can't mount and access RGW S3 objects as cephfs or anything, they are completely separate (not counting things like goofys, s3fs etc.), even if both are on the same rados cluster.

Not sure if TernFS differs there, would be kind of nice to have the option of both kinds of access to the same data.

KaiserPro 142 days ago [-]

Ceph isn't that well suited for high performance. its also young and more complex than you'd want it to be (ie you get a block storage system, which you then have to put a FS layer on after.)

if you want performance, then you'll probably want lustre, or GPFS, or if you're rich a massive isilon system.

eps 142 days ago [-]

That was a good read. Compliments to the chefs.

It'd be helpful to have a couple of usage examples that illustrate common operations, like creating a file or finding and reading one, right after the high-level overview section. Just to get an idea what happens at the service level in these cases.

bitonico 142 days ago [-]

Yes, that would be very useful, we just didn't get to it and we didn't want perfect to be the enemy of good, since otherwise we would have never open sourced :).

But if we have the time it would definitely be a good addition to the docs.

mindslight 142 days ago [-]

Is anyone else bored of seeing the endless line of anti-human-scale distributed filesystems?

It's like the engineers building them keep trying to scratch their own itch for a better filesystem that could enable seamless cross-device usage, collaboration, etc. But the engineers only get paid if they express themselves in terms of corporate desires, and corpos aren't looking to pay them to solve those hard problems. So they solve the horizontal scaling problem for the thousandth time, but only end up creating things that requires a full time engineer (or perhaps even a whole team) to use. Hooray, another centralizing "distributed" filesystem.

maknee 140 days ago [-]

Great to see another distributed file system open sourced! It has some interesting design decisions.

Have a couple of questions:

- How do you go about benchmarking throughput / latency of such a system? Curious if it's different compared to how other distributed filesystems benchmark their systems.

- Is network or storage the bottleneck for nodes (at least for throughput)?

  - From my observations from RDMA-based distributed filesystems, network seems to be the case.

- How does the system respond to rand / seq + reads / writes? A lot of systems struggle to scale writes. Does this matter for what workload TernFS is designed for?

- Very very interesting to go down the path of writing a kernel module instead of using FUSE or writing a native client in userspace (referring to 3FS [1])

  - Any crashes in production? And how do you go about tracking it down?

  - What's the difference in performance between using the kernel module versus using FUSE?

[1] https://github.com/deepseek-ai/3FS/blob/main/docs/design_not...

hintymad 142 days ago [-]

> Most of the metadata activity is contained within a single shard: > > - File creation, same-directory renames, and deletion. > - Listing directory contents. > - Getting attributes of files or directories.

I guess this is a trade-off between a file system and an object store? As in S3, ListObjects() is a heavy hitter and there can be potentially billions of objects under any prefix. Scanning only on a single instance won't be sufficient.

jeffinhat 142 days ago [-]

It's definitely a different use case but given they haven't had to tap into their follower replicas for scale, it must be pretty efficient and lightweight. I suspect not having ACLs helps. They also cite a minimum 2MB size, so not expecting exabtyes of little bytes.

I wonder if a major difference is listing a prefix in object storage vs performing recursive listings in a file system?

Even in S3, performing very large lists over a prefix is slow and small files will always be slow to work with, so regular compaction and catching file names is usually worthwhile.

jleahy 141 days ago [-]

2MB median to be fair, so half of our files are under 2MB.

chatmasta 142 days ago [-]

> There's a reason why every major tech company has developed its own distributed filesystem

I haven't worked at FAANG, but is this a well-known fact? I've never heard of it. Unless they're referring to things like S3? Are these large corps running literal custom filesystem implementations?

foota 142 days ago [-]

Not sure about everyone, but probably yes. Imo the killer feature is durability and replication. You can use single disks if you don't care about data loss, but once you need to start replicating data you need a distributed filesystem.

Tectonic is Facebooks, Google's is Colossus. I'm not sure about the others.

StrangeDoctor 142 days ago [-]

Deepseek has their own and they’re relatively small https://github.com/deepseek-ai/3FS

It’s specialized knowledge, hard to do “correctly” (read posix here) but obtainable and implementable by a small team if you pick your battles right. Also supporting very specific use cases helps a lot.

It’s also pretty easy to justify as the hardware and software from vanguard tech companies is outrageously expensive. I used to develop software for a blue colored distributed filesystem.

allset_ 142 days ago [-]

I can only comment about the one I work for, but yes. It's also discussed publicly to some degree.

https://cloud.google.com/blog/products/storage-data-transfer...

loeg 142 days ago [-]

Yes, at least Facebook and Google have distributed file services. (Facebook's was historically based on HDFS, which was based on an old Google paper.)

mdaniel 142 days ago [-]

GPLv2-or-later, in case you were wondering https://github.com/XTXMarkets/ternfs/blob/7a4e466ac655117d24...

coolspot 142 days ago [-]

Licensing

TernFS is Free Software. The default license for TernFS is GPL-2.0-or-later.

The protocol definitions (go/msgs/), protocol generator (go/bincodegen/) and client library (go/client/, go/core/) are licensed under Apache-2.0 with the LLVM-exception. This license combination is both permissive (similar to MIT or BSD licenses) as well as compatible with all GPL licenses. We have done this to allow people to build their own proprietary client libraries while ensuring we can also freely incorporate them into the GPL v2 licensed Linux kernel.

hardwaregeek 142 days ago [-]

Hudson River Trading's distributed file system for comparison: https://www.hudsonrivertrading.com/hrtbeat/distributed-files...

rickette 142 days ago [-]

Cool. Also https://github.com/deepseek-ai/3FS by DeepSeek which came out of High-Flyer a Chinese HFT firm.

Balinares 141 days ago [-]

This is a ridiculously valuable piece of tech to be open sourcing, wow. My thanks to whoever fought that battle and made this happen.

rostayob 141 days ago [-]

There was no battle actually, everybody (well, my boss and my boss's boss) was very supportive of open sourcing. And thanks for the kind words!

d12bb 142 days ago [-]

> The firm started out with a couple of desktops and an NFS server, and 10 years later ended up with tens of thousands of high-end GPUs, hundreds of thousands of CPUs, and hundreds of petabytes of storage.

So much resources for producing nothing of real value. What a waste.

Great project though, appreciate open sourcing it.

candiddevmike 142 days ago [-]

If price action trading is horoscopes for adults, they're a modern a day oracle.

EugeneG 142 days ago [-]

In theory what they are doing of value, is that at any time you can go to an exchange and say "I want to buy x" or "I want to sell y" and someone will buy it from you our sell it from you... at a price that's likely to be the accurate price.

At the extreme if nobody was providing this service, investors (e.g. pension funds), wouldn't be confident that they can buy/sell their assets as needed in size and at the right price... and because of that, in aggregate stocks would be worth less, and companies wouldn't be able to raise as much capital.

The theoretical model is: - You want to have efficient primary markets that allow companies to raise a lot of assets at the best possible prices - To enable efficient primary markets, investors want efficient secondary markets (so they don't need to buy and hold forever, but feel they can sell) - To enable efficient secondary markets, you need many folks that are in the business of XTX ... it just so happens that XTX is quite good at it, and so they do a lot of this work.

lsecondario 142 days ago [-]

> In theory

> At the extreme

> The theoretical model

These qualifiers would seem to belie the whole argument. Surely the volume of HFT arbitrage is some large multiple of what would be necessary to provide commercial liquidity with an acceptable spread?

formerly_proven 142 days ago [-]

Does the HFT volume actually matter? Is it a real problem that the HFT volume exceeds the theoretical minimum amount of volume needed to maintain liquid markets?

scandox 142 days ago [-]

Your comment contradicts itself. They produced this project at least.

miovoid 142 days ago [-]

higher competition increases market efficiency - this is the real value

rob_c 141 days ago [-]

Or there's https://cvmfs.readthedocs.io/en/stable/

Beijinger 142 days ago [-]

Thanks a lot. Regarding your company, it is really hard for me to understand, what XTX is actually doing. Trading? VC? AI/ML?

jleahy 142 days ago [-]

Trading using ML.

sreekanth850 142 days ago [-]

Wow, great project.

nunobrito 142 days ago [-]

Thanks for sharing.

bananapub 142 days ago [-]

seems like a colossusly nice design.

arcade79 142 days ago [-]

Sadly lacking a nice big table to lay out the metadata on.

VikingCoder 142 days ago [-]

I see what you did there.

jleahy 142 days ago [-]

could be a tectonic shift in the open source filesystem landscape?

YouAreWRONGtoo 142 days ago [-]

[dead]

771753 142 days ago [-]

10000000000B أبي خشب وحديد درع الفضاء سراح الفضاء

eigenvalue 142 days ago [-]

This sounds like it would be a good underpinning for a decentralized blockchain file storage system with its focus on immutability and redundancy.

MadnessASAP 142 days ago [-]

But a blockchain is already immutable. It becomes decentralised and redundant if you have multiple nodes sharing blocks.

No need for an underpinning, it is the underpinning.

KaiserPro 142 days ago [-]

if you're storing the blocks in one place, its not decentralised.

The metadata would be crucial for performance, and given that I assume you'll want a full chain of history for every file, your metadata table will get progressively bigger every time you do any kind of metadata operation.

Plus you can only have one person write metadata at one time, so you're gonna get huge top of line blocking.

mrtesthah 142 days ago [-]

And yet no one needed a blockchain to implement this.

doctorpangloss 142 days ago [-]

Ha ha, I forecast, SPY goes up, and I’ve already made more money than XTX or any of its clients…

Look I like technology as much as anyone. Improbable spent $500 million on product development, and its most popular product is its grpc-web client. It didn't release any of its exotic technology. You could also go and spend that money on making $500m of games without any exotic technology, and also make it open source.

Rendered at 16:50:39 GMT+0000 (Coordinated Universal Time) with Vercel.