NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Show HN: Vortex – a high-performance columnar file format (github.com)
the_mitsuhiko 65 days ago [-]
> One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data within the file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to the file format specification.

That is quite interesting. One challenge in general with parqet and arrow in the otel / observability ecosystem is that the shape of data is not quite known with spans. There are arbitrary attributes on them, and they can change. To the best of my knowledge no particularly great solution exists today for encoding this. I wonder to which degree this system could be "abused" for that.

robert3005 65 days ago [-]
The thing we are trying to achieve is to be able to experiment and tune the way data is groupped on disk. Parquet has one way of laying data out, csv is another (though it's a text format so a bit moot), ORC is another, Lance has yet another different method. The file format itself stores how it's physically laid out on disk so you can tune and tweak physical layouts to match the specific storage needs of your system (this is the toolkit part where you can take vortex and use it to implement your own file format). Having said that we will have an implementation of file format that follows particular layout.
infogulch 65 days ago [-]
Wow, I think this is the thing I wished existed for years! Most file formats leave a huge compression opportunity on the table just because their choice of physical layout. (I call the simple case "striding order", idk) But getting it right takes a lot of experimentation which becomes too much churn for applications, and can result in storage layouts that are great for compression but are annoying to code against. So the obvious answer (to me at least) is that you need to decouple physical and logical layouts. I'm glad someone is finally trying it!
sa46 65 days ago [-]
Parquet also encodes the physical layout using footers [1], as does ORC [2]. Perhaps the author meant support for semi-structured data, like the spans you mention.

[1]: https://parquet.apache.org/docs/file-format/

[2]: https://orc.apache.org/specification/ORCv2/#file-tail

danking00 65 days ago [-]
Yeah we should be more clear in our description about how our footers differ from Parquet. Parquet is a bit more prescriptive; for example, it requires row groups which are not required by Vortex. If you have a column with huge values and another column of 8 bit ints, they can be paged separately, if you like.
gigatexal 65 days ago [-]
As someone who works in data schema on read formats like parquet are amazing. I hate having to guess schemas with CSVs.
physicsguy 65 days ago [-]
Pandera is quite nice for at least forcing validation in Pandas for this
marginalia_nu 65 days ago [-]
I've been experimenting with taking this self-description paradigm even farther, for a file format I've cooked up for ephemeral data in my search engine.

Basically, since I ended up building a custom library for this, I wanted to solve the portability problem by making it stupidly simple to reverse engineer, so I cooked up a convention where each column (and supporting column) is a file, with a file name that describes its format and role.

So a real-world production table looks like this if you ls in the directory (omitting a few columns for brevity):

  combinedId.0.dat.s64le.bin
  documentMeta.0.dat.s64le.bin
  features.0.dat.s32le.bin
  size.0.dat.s32le.bin
  termIds.0.dat-len.varint.bin
  termIds.0.dat.s64le[].zstd
  termMetadata.0.dat-len.varint.bin
  termMetadata.0.dat.s8[].zstd

The design goal is that just based on an ls output, someone who has never seen the code of the library producing the files should be able to trivially write code that reads it.
gatesn 65 days ago [-]
Internally the design of Vortex is very similar. The file consists of a whole bunch of "messages" (your files), which then have some metadata attached, and the read logic decides which messages it needs when.
hiatus 65 days ago [-]
Do you have a deeper writeup of this anywhere?
marginalia_nu 65 days ago [-]
Not yet, but I will compile one at some point. I'm in the middle of moving right now so I don't quite have the time to sit down and finish the write-up...
agoose77 65 days ago [-]
For fun, the ROOT file format used in high energy physics has this kind of feature: https://root.cern.ch/root/SchemaEvolution.pdf

It's also a very old format, so not without its warts :)

amadio 64 days ago [-]
jnordwick 65 days ago [-]
If it's in the footer, then I'm pending to the columns out of the question it seems without moving the footer.
yawnxyz 65 days ago [-]
I think it was in a blog post or a podcast (a16z with motherduck?) where they said Snowflake apparently largely solved this problem, but since it's proprietary and locked away, most people won't get a chance to use or implement it?

I have no idea since I've never had access to Snowflake...

cle 65 days ago [-]
Isn't this what the Arrow IPC File format does too? Is there something unique about this?
_willmanning 65 days ago [-]
Compression! Vortex can easily be 10x smaller than the equivalent Arrow representation (and decompresses very quickly into Arrow)
cle 65 days ago [-]
Nice!
Centigonal 65 days ago [-]
Let me see if I understand this right.

Vortex is a file format. In their canonicalized, uncompressed form, vortex files are simply Apache Arrow IPC files with some of the bits and bobs moved around a bit (enabling transformation to/from Arrow), plus some extra metadata about types, summary statistics, data layout, etc.

The Vortex spec supports fancy strategies for compressing columns, the ability to store summary statistics alongside data, and the ability to specify special compute operations for particular data columns. Vortex also specifies the schema of the data as metadata, separately from the physical layout of the data on disk. All Arrow arrays can be converted zero-copy into Vortex arrays, but not vice-versa.

Vortex also supports extensions in the form of new encodings and compression strategies. The idea here is that, as new ways of encoding data appear, they can be supported by Vortex without creating a whole new file format.

Vortex-serde is a serde library for Vortex files. In addition to classic serialization/deserialization, it supports giving applications access to all those fancy compute and summary statistics features I mentioned above.

You say "Vortex is a toolkit for working with compressed Apache Arrow arrays in-memory, on-disk, and over-the-wire," but that's kind of like saying "MKV is a toolkit for working with compressed AVI and WAV files." It sounds like Vortex is a flexible file spec that lets you:

1. Work with Arrow arrays on disk with options for compression.

2. Create files that model data that can't be modeled in Arrow due to Arrow's hard coupling between encoding and logical typing.

3. Utilize a bunch of funky and innovative new features not available in existing data file formats and probably only really interesting to people who are nerds about this (laypeople will be interested in the performance improvements, though).

miere 65 days ago [-]
Imagine explaining to a newcomer that you write your app using Vert.x, it consumes AI models from GCP Vertex and uses Vortex for its high-performance columnar file structure.
grumpy-cowboy 52 days ago [-]
A newcomer with a Vertx backpack (vertx.com) ;)
speed_spread 65 days ago [-]
You forgot mention of Verticle and Vectrex.
CapeTheory 64 days ago [-]
Also, the demo site is running on Vercel and the docs are written in LaTeX.
p2detar 65 days ago [-]
Vert.x has Verticles
mikeqq2024 65 days ago [-]
confusing indeed
ericyd 65 days ago [-]
Thank God this file format is written in Rust, otherwise I'd be extremely skeptical.
smartmic 65 days ago [-]
It‘s funny how „written in Rust“ has become a running gag here on HN - but only if mentioned already in the title…
keybored 65 days ago [-]
Is this a pun or something?
ericyd 65 days ago [-]
I was being sarcastic, yes. Also the title used to include "written in Rust"
neeh0 65 days ago [-]
It gave me a moment of pause why Rust is part of the equation, but I concluded I'm too dumb
aduffy 65 days ago [-]
Buried under the memes/vibes there is an actual reason this is important for data tools.

The previous generation of analytics/"Big Data" projects (think Hadoop, Spark, Kafka, Elastic) were all built in the JVM. They were monolithic distributed systems clusters hosted on VMs or on-premise. They were servers with clients implemented in Java. It is effectively impossible to embed a Java library into anything non-Java, the best you can do is fork a JVM with a carefully maintained classpath and hit it over the network (c.f. PySpark). Kafka has externally maintained bindings that lag the official JVM client.

Parquet was built during this era, so naturally its reference implementation was written in Java. For many years, the only implementation of Parquet was in Java. Even when parquet-cpp and subsequent implementations began to pop up, the Parquet Java implementation was still the best maintained. Over time as the spec got updated and new features made their way into Parquet, different implementations had different support. Files written by parquet-cpp or parquet-rs could not be opened via Spark or Presto.

The newer generation of data analytics tooling is meant to be easily embedded, so that generally means a native language that can export shared objects with a C ABI that can be consumed by the FFI layer of different languages. That leaves you a few options, and of those Rust is arguably the best for reasons of tooling and ecosystem, though different projects make different choices. DuckDB for example is an extremely popular library with bindings in several languages and it was built in C++ long after Rust became in-vogue.

While Vortex doesn't (yet) have a C API, we do have Python bindings that we expect to be the main way people use it.

beAbU 65 days ago [-]
For a while "written in Rust" was sort of a "trust me, bro" label. The hivemind asserted that something written in rust must be automatically good and safe, because rust is good and safe.

Thank god everyone wisened up. The tool maketh not the craftsman. These days the "written in rust" tag is met with knee jerk skepticism, as-if the hive mind over corrected.

cnity 64 days ago [-]
Now the pendulum has swung too far in the opposite direction. The linked repo README makes no song and dance, barely even a mention (and even then just to explain some setup alongside Python instructions), and yet here we are obsessing over the source language of the repo to no benefit.
gkapur 65 days ago [-]
Not an expert in the space at all and it does seem like people are exploring new file and table formats so that is really cool!

How does this compare to Lance (https://lancedb.github.io/lance/)?

What do you think the key applied use case for Vortex is?

jarpineh 65 days ago [-]
I do applaud this kind of work. Better, faster tooling for data as files and moving across runtimes is sorely needed.

Two things I would hope to see before I'd start using Vorter is geospatial data support (there's already Geoparquet [1]) and WASM for in-browser Arrow processing. Things like Lonboard [2] and Observable framework [3] rely on Parquet, Arrow and Duckdb files for their powerful data analytics and visualisation.

[1] https://geoparquet.org

[2] https://developmentseed.org/lonboard/latest/

[3] https://observablehq.com/framework/

jagged-chisel 65 days ago [-]
“Vortex is a toolkit for working with compressed Apache Arrow arrays in-memory, on-disk, and over-the-wire.”

So it’s a toolkit written in Rust. It is not a file format.

_willmanning 65 days ago [-]
Perhaps that verbiage is just confusing. "On-disk" sort of implies "file format" but could be more explicit.

That said, the immediate next line in the README perhaps clarifies a bit?

"Vortex is designed to be to columnar file formats what Apache DataFusion is to query engines (or, analogously, what LLVM + Clang are to compilers): a highly extensible & extremely fast framework for building a modern columnar file format, with a state-of-the-art, "batteries included" reference implementation."

jagged-chisel 65 days ago [-]
“Vortex is […] a highly extensible & extremely fast framework for building a modern columnar file format.”

It’s a framework for building file formats. This does not indicate that Vortex is, itself, a file format.

aduffy 65 days ago [-]
Will and I actually work on Vortex :wave:

Perhaps we should clean up the wording in the intro, but yes there is in fact a file format!

We actually built the toolkit first, before building the file format. The interesting thing here is that we have a consistent in-memory and on-disk representation of compressed, typed arrays.

This is nice for a couple of reasons:

(a) It makes it really easy to test out new compression algorithms and compute functions. We just implement a new codec and it's automatically available for the file format.

(b) We spend a lot of energy on efficient push down. Many compute functions such as slicing and cloning are zero-cost, and all compute operations can execute directly over compressed data.

Highly encourage you to checkout the vortex-serde crate in the repo for file format things, and the vortex-datafusion crate for some examples of integrating the format into a query engine!

kwillets 65 days ago [-]
Does this fragment columns into rowgroups like Parquet, or is it more of a pure columnstore? IME a data warehouse works much better if each column isn't split into thousands of fragments.
danking00 65 days ago [-]
Yeah, you and us are on the same page (heh). We don’t want the format to require row grouping. The file format has a layout schema written in a footer. A row group style layout is supported but not required. Specification of the layout will probably evolve, but currently the in-memory structure becomes the on-disk structure. So, if you have a ChunkedArray of StructArray of ChunkedArray you’ll get row groups and pages within them. If you had a StructArray of ChunkedArray you’ll just get per-column pages.

I’m working on the Python API now. I think we probably want the user to specify, on write, whether they want row groups or not and then we can enforce that as we write.

gazpacho 65 days ago [-]
Very cool! Any plans to offer more direct integrations with DataFusion, e.g. a `VertexReaderFactory`, hooks for pushdowns, etc?
aduffy 65 days ago [-]
We have a TableProvider for use with Datafusion, checkout this crate and its examples: https://github.com/spiraldb/vortex/tree/develop/vortex-dataf...
gazpacho 65 days ago [-]
Thanks!
jcgrillo 64 days ago [-]
This is awesome, you folks are doing great work. I also really enjoyed your blog posts on FSST[1] and FastLanes[2].

[1] https://blog.spiraldb.com/life-in-the-fastlanes [2] https://blog.spiraldb.com/compressing-strings-with-fsst/

xiaodai 65 days ago [-]
There are a bunch of these including fst in the R ecosystem. JDF.jl in the julia ecosystem etc.
danking00 65 days ago [-]
Thanks for introducing me to these other formats! I hadn't heard of them yet. All three of fst, JDF, and Vortex appear share the goal of high throughput (de)serialization of tabular data and random access to the data. However, it is not clear to me how JDF and fst permit random access on compressed data because both appear to use block compression (respectively Blosc and LZ4 or Zstd). While both Blosc and Zstd are extremely fast, accessing a single value of a single row necessarily requires decompressing a whole block of data. Instead of O(1) random access you get O(N_ROWS_PER_BLOCK) random access.

In Vortex, we've specifically invested in high throughput compression techniques that admit O(1) random access. These kinds of techniques are also sometimes called "lightweight compression". The DuckDB folks have a good writeup [1] on the common ones.

[1] https://duckdb.org/2022/10/28/lightweight-compression.html

kwillets 64 days ago [-]
This paper compares the benefits of lightweight compression and other techniques:

https://blog.acolyer.org/2018/09/26/the-design-and-implement...

xiaodai 64 days ago [-]
I see. Very nice. So it's a trade-off. I imagine the throughput of these light-weight compression suffers a little. In analytical workloads, it's common to do things like compute the mean of a vector or compute the gradient for this batch of data so random access appear less of an issue here.
danking00 64 days ago [-]
We’ll post a blog post soon with specific, benchmarked numbers, but, in this case, you can have your cake and eat it too!

The compression and decompression throughputs of Vortex (and other lightweight compression schemes) are similar or better than Parquet for many common datasets. Unlike Zstd or Blosc, the lightweight encodings are, generally, both computationally simple and SIMD friendly. We’re seeing multiple gibibytes per second on an M2 MacBook Pro on various datasets in the PBI benchmark [1].

The key insight is that most data we all work with has common patterns that don’t require sophisticated, heavyweight compression algorithm. Let’s take advantage of that fact to free up more cycles for compute kernels!

[1] https://github.com/cwida/public_bi_benchmark

xiaodai 64 days ago [-]
Cool looking forward to it.
Havoc 65 days ago [-]
Can one edit it in place?

That’s the main thing currently irritating me about parquet

aduffy 65 days ago [-]
You're unlikely to find this with any analytic file format (including Vortex). The main reason is that OLAP systems generally assume an immutable distributed object/block layer (S3, HDFS, ABFS, etc.).

It's then generally up to a higher-level component called a table format to handle the idea of edits. See for example how Apache Iceberg handles deletes https://iceberg.apache.org/spec/#row-level-deletes

Havoc 65 days ago [-]
I see. Hadn’t made the connection to S3 etc. that makes sense though. Thanks for explaining
slotrans 65 days ago [-]
This is true, and in principle a good thing, but in the time since Parquet and ORC were created GDPR and CCPA are things that have come to exist. Any format we build in that space, today, needs to support in-place record-level deletion.
aduffy 65 days ago [-]
Yea so the thing you do for this is called "compaction", where you effectively merge the original + edits/deletes into a new immutable file. You then change your table metadata pointer to point at the new compacted file, and delete the old files from S3.

Due to the way S3 and the ilk are structured as globally replicated KV stores, you're not likely to get in-place edits anytime soon, and until the cost structure incentivizes otherwise you're going to continue to see data systems that preference immutable cloud storage.

mkesper 65 days ago [-]
You can avoid that if you save only per-user encrypted content (expensive, I know). That way you just should have to revoke that key to remove access to the data. Advantage is you cannot forget any old backup etc.
FridgeSeal 65 days ago [-]
I mean, you can have it you’ve just got to be happy to bear the cost of rewriting the file every time you mutate a row.
runeblaze 65 days ago [-]
Did not read too deep into the original post, but if you use arrow you can (not sure if one ever should) do random lookups into the storage buffer and wipe out bytes + put tombstones, at least if the column is encoded "naively".

Of course if your arrow file is in some object store how you delete random bytes over that is unclear.

xiaodai 64 days ago [-]
Question: if vortex can "canonicalized" arrow vectors, why doesn't Arrow incorporate this feature?
Bnjoroge 65 days ago [-]
how does this compare to lance?
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 03:08:30 GMT+0000 (Coordinated Universal Time) with Vercel.