NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
We built a new powerful JSON data type for ClickHouse (clickhouse.com)
fuziontech 2 hours ago [-]
Using ClickHouse is one of the best decisions we've made here at PostHog. It has allowed us to scale performance all while allowing us to build more products on the same set of data.

Since we've been using ClickHouse long before this JSON functionality was available (or even before the earlier version of this called `Object('json')` was avaiable) we ended up setting up a job that would materialize json fields out of a json blob and into materialized columns based on query patterns against the keys in the JSON blob. Then, once those materialized columns were created we would just route the queries to those columns at runtime if they were available. This saved us a _ton_ on CPU and IO utilization. Even though ClickHouse uses some really fast SIMD JSON functions, the best way to make a computer go faster is to make the computer do less and this new JSON type does exactly that and it's so turn key!

https://posthog.com/handbook/engineering/databases/materiali...

The team over at ClickHouse Inc. as well as the community behind it moves surprisingly fast. I can't recommend it enough and excited for everything else that is on the roadmap here. I'm really excited for what is on the horizon with Parquet and Iceberg support.

breadwinner 1 hours ago [-]
If you're evaluating ClickHouse take a look at Apache Pinot as well. ClickHouse was designed for single-machine installations, although it has been enhanced to support clusters. But this support is lacking, for example if you add additional nodes it is not easy to redistribute data. Pinot is much easier to scale horizontally. Also take a look at star-tree indexes of Pinot [1]. If you're doing multi-dimensional analysis (Pivot table etc.) there is a huge difference in performance if you take advantage of star-tree.

[1] https://docs.pinot.apache.org/basics/indexing/star-tree-inde...

zX41ZdbW 9 minutes ago [-]
> ClickHouse was designed for single-machine installations

This is incorrect. ClickHouse is designed for distributed setups from the beginning, including cross-DC installations. It has been used on large production clusters even before it was open-sourced. When it became open-source in June 2016, the largest cluster was 394 machines across 6 data-centers with 25 ms RTT between the most distant data-centers.

haolez 1 hours ago [-]
What's the use case? Analytics on humongous quantities of data? Something besides that?
breadwinner 58 minutes ago [-]
Use case is "user-facing analytics", for example consider ordering food from Uber Eats. You have thousands of concurrent users, latency should be in milliseconds, and things like delivery time estimate must updated in real-time.

Spark can do analysis on huge quantities of data, and so can Microsoft Fabric. What Pinot can do that those tools can't is extremely low latency (milliseconds vs. seconds), concurrency (1000s of queries per second), and ability to update data in real-time.

Excellent intro video on Pinot: https://www.youtube.com/watch?v=_lqdfq2c9cQ

listenallyall 46 minutes ago [-]
I don't think Uber's estimated time-to-arrival is a statistic on which a database vendor, or development team, should brag about. It's horribly imprecise.
whalesalad 10 minutes ago [-]
I thought “humongous quantities of data” was a baseline assumption for a discussion involving clickhouse et all?
ramraj07 3 hours ago [-]
Great to see it in ClickHouse.

Snowflake released a white paper before its IPO days and mentioned this same feature (secretly exploding JSON into columns). Explains how snowflake feels faster than it should, they’ve secretly done a lot of amazing things and just offered it as a polished product like Apple.

everfrustrated 2 hours ago [-]
>Dynamically changing data: allow values with different data types (possibly incompatible and not known beforehand) for the same JSON paths without unification into a least common type, preserving the integrity of mixed-type data.

I'm so excited for this! One of my major bug-bears with storing logs in Elasticsearch is the set-type-on-first-seen-occurrence headache.

Hope to see this leave experimental support soon!

abe94 5 hours ago [-]
We've been waiting for more JSON support for Clickhouse - the new type looks promising - and the dynamic column, and no need to specifcy subtypes is particularly helpful for us.
notamy 3 hours ago [-]
Clickhouse is great stuff. I use it for OLAP with a modest database (~600mil rows, ~300GB before compression) and it handles everything I throw at it without issues. I'm hopeful this new JSON data type will be better at a use-case that I currently solve with nested tuples.
philosopher1234 2 hours ago [-]
Postgres should be good enough for 300GB, no?
whalesalad 8 minutes ago [-]
For write heavy workloads I find psql to be a dog tbh. I use it everywhere but am anxious to try new tools.

For truly big data (terabytes per month) we rely on BigQuery. For smaller data that is more OLTP write heavy we are using psql… but I think there is room in the middle.

tempest_ 53 minutes ago [-]
It depends, if you want to do any kind of aggregation, counts, or count distinct pg falls over pretty quickly.
notamy 2 hours ago [-]
Probably, but Clickhouse has been zero-maintenance for me + my dataset is growing at 100~200GB/month. Having the Clickhouse automatic compression makes me worry a lot less about disk space.
trollied 1 hours ago [-]
What on earth has gone wrong. Just use a database with rows and columns.

The amount of extra storage required to deal with a metadata about non-structured data is crazy.

JSON is ok to chuck data about.

rockostrich 49 minutes ago [-]
Analytical databases have rows and columns? What do you do when you're ingesting TBs, if not PBs, of unstructured data and need to make it actually useable.

A couple of MBs (or even GBs) for storage for metadata is peanuts compared to the actual data as well as the material savings when storing it in a column-oriented engine.

officex 4 hours ago [-]
Great to see! I remember checking you guys out in Q1, great team
baq 3 hours ago [-]
Clickhouse is criminally underused.

It's common knowledge that 'postgres is all you need' - but if you somehow reach the stage of 'postgres isn't all I need and I have hard proof' this should be the next tech you look at.

Also, clickhouse-local is rather amazing at csv processing using sql. Highly recommended for when you are fed up with google sheets or even excel.

mrsilencedogood 3 hours ago [-]
This is my take too. At one of my old jobs, we were early (very early) to the Hadoop and then Spark games. Maybe too early, because by the time Spark 2 made it all easy, we had already written a lot of mapreduce-streaming and then some RDD-based code. Towards the end of my tenure there, I was experimenting with alternate datastores, and clickhouse was one I evaluated. It worked really, really well in my demos. But I couldn't get buy-in because management was a little wary of the russian side of it (which they have now distanced/divorced from, I think?) and also they didn't really have the appetite for such a large undertaking anymore. (The org was going through some things.) (So instead a different team blessed by the company owner basically DIYd a system to store .feather files on NVME SSDs... anyway).

If I were still there, I'd be pushing a lot harder to finally throw away the legacy system (which has lost so many people it's basically ossified, anyway) and just "rebase" it all onto clickhouse and pyspark sparksql. We would throw away so much shitty cruft, and a lot of the newer mapreduce and RDD code is pretty portable to the point that it could be plugged into RDD's pipe() method.

Anyway. My current job, we just stood up a new product that, from day 1, was ingesting billions of rows (event data) (~nothing for clickhouse, to be clear. but obviously way too much for pg). And it's just chugging along. Clickhouse is definitely in my toolbox right after postgres, as you state.

osigurdson 2 hours ago [-]
Agree. CH is a great technology to have some awareness of. I use it for "real things" (100B+ data points) but honestly it can really simplify little things as well.

I'd throw in one more to round it out however. The three rings of power are Postgres, ClickHouse and NATS. Postgres is the most powerful ring however and lots of times all you need.

oulipo 3 hours ago [-]
would you recommend clickhouse over duckdb? and why?
PeterCorless 3 hours ago [-]
Note that every use case is different and YMMV.

https://www.vantage.sh/blog/clickhouse-local-vs-duckdb

55 minutes ago [-]
hn1986 1 hours ago [-]
Great link . Curious how it compares now that Duckdb is 1.0+
nasretdinov 2 hours ago [-]
IMO the only reason to not use ClickHouse is when you either have "small" amount of data or "small" servers (<100 Gb of data, servers with <64 Gb of RAM). Otherwise ClickHouse is a better solution since it's a standalone DB that supports replication and in general has very very robust cluster support, easily scaling to hundreds of nodes.

Typically when you discover the need for OLAP DB is when you reach that scale, so I'm personally not sure what the real use case for DuckDB is to be completely honest.

geysersam 59 minutes ago [-]
DuckDB probably performs better per core than clickhouse does for most queries. So as long as your workload fits on a single machine (it's likely that it does) it's often the most performant option.

Besides, it's so simple, just a single executable.

Of course if you're at a scale where you need a cluster it's not an option anymore.

zX41ZdbW 2 minutes ago [-]
The good parts of DuckDB that you've mentioned, including the fact that it is a single-executable, are modeled after ClickHouse.
theLiminator 3 hours ago [-]
Not to mention polars, datafusion, etc. Single node OLAP space is really heating up.
fiddlerwoaroof 2 hours ago [-]
Clickhouse scales from a local tool like Duckdb to a database cluster that can back your reporting applications and other OLAP applications.
CalRobert 2 hours ago [-]
Clickhouse and Postgres are just different tools though - OLTP vs OLAP.
3 hours ago [-]
dangsux 7 minutes ago [-]
[dead]
anonygler 3 hours ago [-]
I keep misreading this company as ClickHole and expecting some sort of satirical content.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 20:28:23 GMT+0000 (Coordinated Universal Time) with Vercel.