SQLite-based databases on the Postgres protocolblog.chiselstrike.com
maxmcd 11 days ago [-]
Wild stuff

- Fork of SQLite with new features https://github.com/libsql/libsql

- And you can embed it in your application and it will really talk to your SQLite/libsql database over the network (subject of this blogpost): https://github.com/libsql/sqld

- Oh, and if you're wondering about backup to S3, they have that too: https://github.com/libsql/bottomless

- Uh, sqld can integrated with this https://github.com/losfair/mvsqlite, so now your SQLite is backed by FoundationDB!?


- Meanwhile Litestream exists https://github.com/benbjohnson/litestream/

- Ben is developing https://github.com/superfly/litefs at Fly so that you can talk to your SQLite through a FUSE filesystem. (and work has stopped on the streaming replication effort https://github.com/benbjohnson/litestream/issues/8)

- And, of course, SQLite has announced a new backend that hopes to support concurrent writes and streaming replication: https://sqlite.org/hctree/doc/hctree/doc/hctree/index.html

What a time for SQLite

Presumably all of these things provide a wildly different experience in terms of consistency, availability, latency, complexity, and concurrent use. Would be so nice to read a lengthy blogpost comparing all or some of these with their pros and cons.

eduction 11 days ago [-]
> Presumably all of these things provide a wildly different experience in terms of consistency, availability, latency, complexity, and concurrent use. Would be so nice to read a lengthy blogpost comparing all or some of these with their pros and cons.

I’m glad you’re into it but none of that sounds fun or appealing to me. I have no issue with SQLite or any of these projects, but my default for stuff that needs eg concurrent writes or streaming replication is Postgres which was designed from the ground up to be this sort of database and had those features built into the mainline release for some time.

I trust the default of Postgres way more than I trust myself to pick some new fork of SQLite via what I learn in blog posts and have it be well fit to the problem and reliable and adaptable.

I get that SQLite can be easier to set up and faster on reads but I feel like we’ve been down this road before trading away Postgres for those sorts of seeming gains - I’m getting MongoDB flashbacks.

Don’t get me wrong, I’d choose SQLite for a local app storage solution in a heartbeat but I’m wary of overstretching it.

aaviator42 11 days ago [-]
I wrote a small PHP library that gives you a key-value storage interface to SQlite files: https://github.com/aaviator42/StorX

I've been dogfooding for a while by using it in my side projects.

And there's a basic API too, to use it over a network: https://github.com/aaviator42/StorX-API

vdm 11 days ago [-]
ICYMI, similar concept for python https://dataset.readthedocs.io/
maxmcd 11 days ago [-]
Questions for libsql:

- Looks like bottomless does automatic restoring of the database? Litestream seems to avoid this, I assume because of concerns about accidentally creating multiple writers on a rolling-deploy (or similar mistake). Any concerns about this possible footgun?

- Bottomless is a sqlite extension, not a separate process? Pros and cons compared to Litestream?

- Similar questions for sqld. How does sqld handle the lifecycle of deploys and multiple readers/writers talking to the database? Anything a new user should be concerned about?

psarna 11 days ago [-]
First of all, let me start by reiterating the first sentence from the bottomless repo - it's work in heavy progress (and we'll move it under the sqld/ repo soon, to keep everything in one place). Now, answers:

> - Looks like bottomless does automatic restoring of the database? Litestream seems to avoid this, I assume because of concerns about accidentally creating multiple writers on a rolling-deploy (or similar mistake). Any concerns about this possible footgun?

It's a valid concern, but what always happens on boot is starting a new generation (a generation is basically a snapshot of the main file + its continuously replicated WAL), distinguished by a uuid v7, the timestamped one. So even if a collision happens, it would be recoverable - e.g. one of the stray generations should be deleted.

> - Bottomless is a sqlite extension, not a separate process? Pros and cons compared to litestream?

The only con I see is that a bug in the extension could interfere with the database. As for pros: way less maintenance work, because everything is already embedded, we're also hooked into the database via a virtual WAL interface (libSQL-only), so we have full control over when to replicate, without having to observe the .wal file and reason about it from a separate process.

> - Similar questions for sqld. How does sqld handle the lifecycle of deploys and multiple readers/writers talking to the database? Anything a new user should be concerned about?

There are going to be multiple flavors of sqld, but the rough idea would be to trust the users to only launch a single primary. In the current state of the code, replicas contact the primary in order to register themselves, so the model is centralized. Once we build something on top of a consensus algorithm, leader election will be pushed to the algorithm itself.

maxmcd 11 days ago [-]
Very cool, thank you for the insights
stuaxo 11 days ago [-]
Related: michaelc @ uktrade did some intresting work streaming sqlite with python in this repo - https://github.com/uktrade/stream-sqlite
v3ss0n 9 days ago [-]
Cool but NIH syndrome.. I don't see any point here. SQLite was doing one thing and doing it well. The rest is well handled by pg
houqp 11 days ago [-]
bottomless looks really nice, thanks for sharing!
houqp 11 days ago [-]
Very cool and well executed project. Love the sprinkle of Rust in all the other companion projects as well :)

The ROAPI(https://github.com/roapi/roapi) project I built also happened to support a similar feature set, i.e. to expose sqlite through a variety of remote query interfaces including pg wire protocols, rest apis and graphqls.

cutler 11 days ago [-]
Excuse my ignorance but isn't the whole point of SQLite that it's embedded, not clint-server?
bob1029 11 days ago [-]
For us, a major point of SQLite is that we can execute queries in ~100uS when deployed on top of NVMe storage. Everything being in the same process is 50% of the value proposition for our product. Once you involve a network and another computer, it gets a lot slower unless you are doing RDMA-style shenanigans with very expensive hardware.

The other 50% of the value proposition is simplified operations in a B2B setting.

MuffinFlavored 11 days ago [-]
> For us, a major point of SQLite is that we can execute queries in ~100uS when deployed on top of NVMe storage.

Postgres can't achieve these query times?

bob1029 11 days ago [-]
How long does it take to round trip through the network stack of 2 machines in a typical datacenter?
bawolff 11 days ago [-]
But this article is about using sqlite over a web socket so both would pay that price.
jbverschoor 11 days ago [-]
Use a unix file socket for the connection, or just embed postgresql :-)

The nice thing about sqlite, i.m.o., is mainly that it is just one file.

MuffinFlavored 11 days ago [-]
I thought we'd be comparing apples-to-apples with Postgres running locally in this case?
bob1029 11 days ago [-]
Ok - How many microseconds does it take to round trip SELECT 1 to localhost on a typical cloud VM?

You are still using the network stack and involving multiple processes.

Spivak 11 days ago [-]
People really do be forgetting how unbelievably fast doing things in-process is. It’s why people suffer the complexity of threads instead multiple processes with ipc - like the network.
bob1029 11 days ago [-]

In-process is where you get things like instruction-level parallelism.

I think our university programs have failed us the most in this regard. ILP should be drilled into every student's brain as the penultimate performance hack. Once you start separating everything into a bunch of different buckets/systems/processes, none of these advantages can be explored anymore.

In some cases, ILP can give you a 100x improvement in performance. That's serial throughput. This is the reason many developers consider bit manipulation/hacking to be essentially free.

justinclift 11 days ago [-]
> Ok - How many microseconds does it take to round trip SELECT 1 to localhost on a typical cloud VM?

As a data point, when PG is on the same box as the querying app, you can run the connection over a unix domain socket instead of going over localhost.

In theory (!), that's supposed to have reduced latency + higher bandwidth too. It's been a bunch of years since I last saw benchmarks about it though, so things might have changed a bit in the meantime (no idea).

linuxdude314 11 days ago [-]
There is no “round trip” and no network stacked involved in querying a local Postgres instance (web app on same VM as db).

Attention is better spent designing an appropriate schema than worrying about the network being too slow to run a single query (if it is, you’ve already failed).

irq-1 11 days ago [-]
A Dev once told me that running SQL Server and IIS (Microsoft webserver) on the same machine would allow shared memory. Just pass a pointer from db to web; no copying of the data. Postgres doesn't do this AFAIK.
JohnBooty 11 days ago [-]
Purely FYI for those wondering: yeah, this is the default when connecting to MSSQL from the same box. It's not an IIS thing per se. Any app can do this with the appropriate connection string.



It's been ages since I worked in Microsoft land, but this is one of the reasons why MS shops often scale up instead of out. Imagine a single box with a dozens of beefy cores and a couple hundred gigs of RAM running both IIS and SQL. That's... a significant amount of power.

It's also a potentially massive blast radius for security oopsies, but I don't know the practical history of how secure it has or hasn't been, or how it integrates with things like clustering etc.

bob1029 11 days ago [-]
I've personally found that one computer is substantially easier to keep an eye on than many computers.
JohnBooty 11 days ago [-]
Yeah. And a lot of businesses just don't need the high availability that a cluster of lesser boxen can provide. If they need to have a few hours of downtime each month to install patches on the one-and-only Big Ass Server, that's not necessarily a problem. Oh, and even an idiot can admin the thing. (Not a literal idiot, but it's about the simplest admin experience possible)

I honestly think a lot of businesses would honestly be better off with this kind of simplified "scale up" architecture.

The downsides are clear and it's not the right fit for everything, obviously. But I see the IT world defaulting to these complicated 95384-piece AWS instances "Because that's how Netflix does it" or whatever and it's like... you're not Netflix and hardly anybody is.

Kinrany 11 days ago [-]
Ideally you want both an embedded database, a library for manipulating requests understood by the database, and a few libraries for exposing the same interface over different networks.
11 days ago [-]
Thaxll 11 days ago [-]
Basically re-inventing MySQL / PG but worse. Next step, we don't have auth over the network, let's bake in RBAC into SQLite.
vidarh 11 days ago [-]
From what else they're doing, it appears the point is to be able to use sqlite in serverless setups where MySQL/Postgres would be way too heavy to deploy on a per-customer or per-function basis.
matesz 11 days ago [-]
Honestly I don't get the point of edge. Do people really care whether their website loads in 50ms vs 300ms?
mirzap 11 days ago [-]
Of course they care. It directly impacts on conversion rates. I'm still stunned with simple fact that most people think 100ms doesn't make a difference (for their business).


billythemaniam 11 days ago [-]
While it matters, +-100ms doesn't matter in the vast majority of web use cases. There is even such a thing as too fast. If a user doesn't notice the page updated due to speed, that is also a poor experience. 300ms would actually feel incredibly fast to most web users for most websites. Less than 1 second is a good guideline for page load though as shown by your link.
jamil7 11 days ago [-]
> There is even such a thing as too fast. If a user doesn't notice the page updated due to speed, that is also a poor experience.

If you're relying on your app's execution time to demonstrate this to a user, then you have poor UX. Instant changes and updates have been the norm in mobile apps forever and optimistic UI is becoming pretty standard on the web too, see Linear for example.

matesz 11 days ago [-]
Can't they just cache let's say these 1000 marketing, sales and customer support pages into cdn then?
travisjungroth 11 days ago [-]
It applies to applications as well. My general beliefs on the topic: most improvements that people come up with don't move the needle for the business. Doing the same thing but a little better will sometimes have an impact, generally not. The exception to this is speed. Speed improvements deliver more consistently than anything else, and they keep delivering as you improve. Improving speed is the surest bet you can make.
VWWHFSfQ 11 days ago [-]
One request being 50ms or 300ms probably doesn't really matter.

But 10,000 requests _per second_ being 50 or 300ms matters a lot

maxmcd 11 days ago [-]
Sure it's on a bit of a hype wave at the moment, but yes, most internet users are mobile users that don't live in major metropolitan areas with robust internet infrastructure? I think at least having the option to serve some things at the edge can dramatically improve the browsing experience for those users (and many others!).
vidarh 11 days ago [-]
To take a different tack then the other two (at time of writing) replies to this: It's not just at the edge. Being able to add databases with wild abandon and just provide a key to an S3 compatible bucket to replicate it to, and not have to worry about any server setup or replication is appealing whether or not your setup is otherwise centralised. For some setups you do want to share data across user accounts, but for others, isolation along customer or individual user boundaries is not just fine but an advantage (no accidentally writing a query that returns too much data). And then, it's a plus if having a million databases and selectively spread them out over your servers is trivial.
matesz 11 days ago [-]
I never thought people take restricting access from within database seriously. Like row security policies [1] in postgres. But now as I look at it, it must be taken seriously, just because those features exist.

Is anybody here using it in production with success?

[1] https://www.postgresql.org/docs/current/ddl-rowsecurity.html

sally_glance 11 days ago [-]
Jep, my company uses Postgres RLS for a pretty big project for a client in the finance sector. It's the foundation of our multi-tenant setup there. We connect with a single DB user, but set a session variable at the start of each transaction. This session variable is then used by the RLS policies to filter the data.

Works like a charm, you basically get to build your app like it was dealing with a single tenant DB. Just make sure it's as hard as humanly possible for application developers to forget setting that tenant ID when they query... In our case we have a custom JPA base repository which takes care of that, and some PMD rules to catch rogue repos/queries.

aobdev 11 days ago [-]
If I understand correctly, it seems to be a cornerstone of Supabase’s Authorization features. https://supabase.com/docs/guides/auth/row-level-security
giraffe_lady 11 days ago [-]
Yes I've used it a bunch in production it's great. It is highly valued in some PII-conscious fields for compliance reasons that I don't fully understand, but becoming competent with postgres RLS has been a huge benefit for my career.

The main practical downside is that it is more precise and rigorous than your app- or business-level auth logic is likely to be, and has no real escape hatches. If you're trying to drop it on an existing system you're going to spend a lot of time going back and forth with whoever to shake out edge case ambiguities in your rbac policy that weren't obvious or relevant before.

pphysch 11 days ago [-]
> But now as I look at it, it must be taken seriously, just because those features exist.

You can do pretty much everything within Postgres, from ETL to Authz to serving HTML templates and GIS apps. However, that doesn't mean you should, or that anyone is seriously using it in production on a large scale (after evaluating alternatives).

michaelsalim 11 days ago [-]
Yep, I use it for all my projects now. It's nice to know that there's basically no way for me to accidently mess up access control other than messing up the roles. Got a loose SQL command running? Or even maybe SQL injection? No problem at all.
drej 11 days ago [-]
Note that this already exists on top of SQLite proper - authored by Ben Johnson (Litestream, Fly.io etc.) - https://github.com/benbjohnson/postlite
maxmcd 11 days ago [-]
I think they are quite different it seems. Postlite expects you to connect over the postgres wire protocal. Sqld is compiled into your application so your application behaves like it's talking to an embedded sqlite, the calls are then made over the network (using one of three available transports) before being returning to your application.
MuffinFlavored 11 days ago [-]
Dumb question, with all of this newfangled WASM stuff, why couldn't we also bake the Postgres server (and then client) into the code? I know the WASM runtime would need to expose the low-level ability to do filesystem operations, mmap()ing, network sockets, etc.
maxmcd 11 days ago [-]
Then you'd need to run and maintain Postgres, which is much more complicated, not just a single database file.

Postgres also can't be embedded (according to some brief googling), so you'd need to run it as a separate process.

stuaxo 11 days ago [-]
This would be handy for apps that use Postgres features, that would; nonetheless work in WASM.
ronanyeah 11 days ago [-]
I've been using render.com to host Rust+sqlite stacks recently, and I'd like to leverage this.

Should my Rust server be running sqld, and I would add a passthrough endpoint through its existing http api to the sqlite?

Or alternatively, should I be using sqld locally to access my sqlite instance through SSH or something?

chasil 11 days ago [-]
This is quite an accomplishment. It is unfortunate that it could not be done in the main project.
Existenceblinks 11 days ago [-]
Enlighten me, if it needs a host different network location than the app, what is the advantage over postgres, is it because it's easier to "setup" and "maintain?
glommer 11 days ago [-]
* extremely easy to get started. * unmatched testability, since you can now run the same code in CI and production and pass .db files to your tests. * extremely cheap and lightweight replication.
Existenceblinks 11 days ago [-]
What's kind of codes that run on CI AND production? You mean passing a prod-clone.db to test suits instead of fixtures?
travisjungroth 11 days ago [-]
I would imagine they mean passing tests.db to the suite instead of fixtures. Normally, you have a "pick 2" situation between run a DB with network access, run SQLite when testing, and run the same DB everywhere. This gets you all three (at other costs, of course). For tests, SQLite is great because it's super fast and easy/fast to backup/setup/teardown.
vidarh 11 days ago [-]
Given their webassembly sqlite function support and other changes, the advantage would seem to be that you're most of the way there to being able to provide "serverless" databases. Have a proxy handle auth, spin up an instance if it's not running/shut it down after a while, and add syncing of changes to/from object storage, and you're there.
jimperio 11 days ago [-]
Interesting, hadn't thought about it like that but it makes sense. Used SQLite for a desktop app but never thought it would make any sense on the server.
Existenceblinks 11 days ago [-]
> syncing of changes to/from object storage

So you are suggesting to have both the "serverless" db and the storage? If so, you now have 3 problems.

vidarh 11 days ago [-]
I'm describing capabilities that already exist. Reliably streaming sqlite data to object storage is not a new thing and supported by multiple separate implementations at this point.
Existenceblinks 11 days ago [-]
The problem of having multiple sources of data is Consistency, Availability, Partition tolerance. It also depends on IO characteristic of apps. Who are readers, who writers, how long stale data is acceptable. Basically all the distribute system problems. In web app, frontend + backend + db sit next to it is enough of problem (e.g. SPA vs MPA state problem)
vidarh 11 days ago [-]
Did not at any point suggest multiple sources of data.

EDIT: While there are things in their repos that suggests they might be thinking about moving towards allowing multiple writers, what's currently there suggests a single current active instance of each database, with the WAL being sync'ed to object storage so that in the case of failure and/or when doing a cold-start, the database can be brought back from object storage.

Existenceblinks 11 days ago [-]
Ok sorry, I think I misinterpret your architecture. I think you mean something like HA with Litestream.
vidarh 11 days ago [-]
Yes, similar to that.

So you'd put up a proxy to handle auth, and match an incoming request either to a running database or to a cold database. If it's for a running database, it'd replicate to S3 or similar. If it's for a cold database, you sync from S3 or similar and start the server side process.

To your point, you absolutely need to be able to reliably grant a lease of some sort to whichever frontend pulls down the database and starts and endpoint, or you're absolutely right you'll have huge problems.

Absolutely won't be suitable for every kind of workload, but if you've already committed to running your stuff in a serverless setup, having your database(s) handled that way might be appealing.

Existenceblinks 11 days ago [-]
> but if you've already committed to running your stuff in a serverless setup

Sounds good but I'm curious what's criteria (need) to architect like this in the first place.

vidarh 11 days ago [-]
Let's say you want to run a huge number of databases for customers; too much to run on an individual server. Now you have to shard. You can either try to split them between multiple MySQL/Postgres etc. servers, that are now each individually major risk factors, or you can design your system so you can just hook up more servers at will as long as the largest individual customer database can run on a single instance.

I've run large numbers of Postgres databases, and it's not hard to automate, but it's hard to optimise for a setup where the usage patterns of individual databases are hard to estimate. Is your customer using it for batch jobs, or for persistent streams of data? Who can you colocate with whom? When the cost of shutting a database down one place and migrating it elsewhere becomes very low, this kind of scenario can potentially become a lot easier.

In terms of from the user perspective, I'd expect you really wouldn't care, other than in terms of cold start times and cost. Except perhaps for batch jobs etc., where being able to write apps that checks state, obtains a lease, downloads the most recent version does it's job and uploads the result to durable storage might well be convenient rather than having to e.g. keep a bunch of databases constantly running.

Existenceblinks 11 days ago [-]
> a huge number of databases for customers

Is this one.db per customer? If so, how do you deal with schema migration?

vidarh 11 days ago [-]
Carefully ;) One option is to build your app to check for migrations on startup. I've run systems that'd do schema migrations on a per user basis for data stored in objects per user, and it worked just fine across a userbase of a couple of million accounts, on the basis that the only clients that connect directly also owns the schema, so nothing ever connects without immediately checking if a migration needs to run.

But consider that for setups like this the database might well be the customers own database, so their schema might not be something it's your job to touch at all.

bawolff 11 days ago [-]
I don't get why you would want this. It seems like its taking away all the key advantages of the sqlite approach and leaving only the disadvantages.
_448 11 days ago [-]
A noob question, how does one specify unique constraint on a column in SQlite?
newaccount2021 11 days ago [-]
philipwhiuk 11 days ago [-]
"libSQL" is a silly name. It's not a library is it?

I wish people would at least vaguely try to be helpful naming products.

zffr 11 days ago [-]
From the SQLite homepage:

"SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine."

Wouldn't a fork also be a library?

another2another 9 days ago [-]
Well, it can be linked as an external library, probably also statically linked (? never tried), or - how I use it, compiled into your program as a single sqlite.c file.
Spivak 11 days ago [-]
It seems like it’s a library you’re meant to embed in an application and that sqld is just one application built on libSQL.

It does seem like it makes more sense to call it something that alludes that it’s postgres

glommer 11 days ago [-]
of course it is a library. An embedded database is essentially a library that you add to your application.

That you can build stuff around it, doesn't invalidate the fact that the core of it is a library.