I've been thinking about this a lot - nearly every problem these days is a synchronisation problem. You're regularly downloading something from an API? Thats a sync. You've got a distributed database? Sync problem. Cache Invalidation? Basically a sync problem. You want online and offline functionality? sync problem. Collaborative editing? sync problem.
And 'synchronisation' as a practice gets very little attention or discussion. People just start with naive approaches like 'download whats marked as changed' and then get stuck in the quagmire of known problems and known edge cases (handling deletions, handling transport errors, handling changes that didn't get marked with a timestamp, how to repair after a bad sync, dealing with conflicting updates etc).
The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types https://crdt.tech which is essentially restricting your data and the rules for dealing with conflicts to situations that are known to be resolvable and then packaging it all up into an object.
klabb3 2 days ago [-]
> The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types https://crdt.tech
I will go against the grain and say CRDTs have been a distraction and the overfocus on them have been delaying real progress. They are immature and highly complex and thus hard to debug and understand, and have extremely limited cross-language support in practice - let alone any indexing or storage engine support.
Yes, they are fascinating and yes they solve real problems but they are absolute overkill to your problems (except collab editing), at least currently. Why? Because they are all about conflict resolution. You can get very far without addressing this problem: for instance a cache, like you mentioned, has no need for conflict resolution. The main data store owns the data, and the cache follows. If you can have single ownership, (single writer) or last write wins, or similar, you can drop a massive pile of complexity on the floor and not worry about it. (In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs – I would bet they have an extremely bespoke and domain-specific data structures that are inspired by CRDTs.)
Instead, what I believe we need is end-to-end bidirectional stream based data communication, simple patch/replace data structures to efficiently notify of updates, and standard algorithms and protocols for processing it all. Basically adding async reactivity on the read path of existing data engines like SQL databases. I believe even this is a massive undertaking, but feasible, and delivers lasting tangible value.
mweidner 2 days ago [-]
Indeed, the simple approach of "send your operations to the server and it will apply them in the order it receives them" gives you good-enough conflict resolution in many cases.
It is still tempting to turn to CRDTs to solve the next problem: how to apply server-side changes to a client when the client has its own pending local operations. But this can be solved in a fully general way using server reconciliation, which doesn't restrict your operations or data structures like a CRDT does. I wrote about it here: https://mattweidner.com/2024/06/04/server-architectures.html...
klabb3 19 hours ago [-]
Just got to reading this.
> how to apply server-side changes to a client when the client has its own pending local operations
I liked the option of restore and replay on top of the updated server state. I’m wondering when this causes perf issues? First local changes should propagate fast after eg a network partition, even if the person has queued up a lot of them (say during a flight).
Anyway, my thinking is that you can avoid many consensus problems by just partitioning data ownership. The like example is interesting in this way. A like count is an aggregate based on multiple data owners, and everyone else just passively follows with read replication. So thinking in terms of shared write access is the wrong problem description, imo, when in reality ”liked posts” is data exclusively owned by all the different nodes doing the liking (subject to a limit of one like per post). A server aggregate could exist but is owned by the server, so no shared write access is needed.
Similarly, say you have a messaging service. Each participant owns their own messages and others follow. No conflicts are needed. However, you can still break the protocol (say liking twice). Those can be considered malformed and eg ignored. In some cases, you can copy someone else’s data and make it your own: for instance to protect against impersonations: say that you can change your own nickname, and others follow. This can be exploited to impersonate but you can keep a local copy of the last seen nickname and then display a ”changed name” warning.
Anyway, I’m just a layman who wants things to be simple. It feels like CRDTs have been the ultimate nerd-snipe, and when I did my own evaluations I was disappointed with how heavyweight and opaque they were a few years ago (and probably still).
ochiba 2 days ago [-]
> Yes, they are fascinating and yes they solve real problems but they are absolute overkill to your problems (except collab editing), at least currently. Why? Because they are all about conflict resolution. You can get very far without addressing this problem: for instance a cache, like you mentioned, has no need for conflict resolution. The main data store owns the data, and the cache follows. If you can have single ownership, (single writer) or last write wins, or similar, you can drop a massive pile of complexity on the floor and not worry about it. (In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs – I would bet they have an extremely bespoke and domain-specific data structures that are inspired by CRDTs.)
I agree with this. CRDTs are cool tech but I think in practice most folks would be surprised by the high percentage of use cases that can be solved with much simpler conflict resolution mechanism (and perhaps combined with server reconciliation as Matt mentioned). I also agree that collaborative document editing is a niche where CRDTs are indeed very useful.
satvikpendem 2 days ago [-]
You might not need a CRDT [0]. But also, CRDTs are the future [1].
> In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs
Or CRDTs at all. Google Docs is based on operational transforms and Figma on what they call multiplayer technology.
halfcat 2 days ago [-]
> I believe we need is end-to-end bidirectional stream based data communication
I suspect the generalized solution is much harder to achieve, and looks more like batch-based reconciliation of full snapshots than streaming or event-driven.
The challenge is if you aim to sync data sources where the parties managing each data source are not incentivized to provide robust sync. Consider Dropbox or similar, where a single party manages the data set, and all software (server and clients), or ecosystems like Salesforce and Mulesoft which have this as a stated business goal, or ecosystems like blockchains where independent parties are still highly incentivized to coordinate and have technically robust mechanisms to accomplish it like Merkle trees and similar. You can achieve sync in those scenarios because independent parties are incentivized to coordinate (or there is only one party).
But if you have two or more independent systems, all of which provide some kind of API or import/export mechanisms, you can never guarantee those systems will stay in sync using a streaming or event-driven approach. And worse, those systems will inevitably drift out of sync, or even more worse, will propagate incorrect data across multiple systems, which can then only be reconciled by batch-like point-in-time snapshots, which then begs the question of why use streaming if you ultimately need batch to make it work reliably.
Put another way, people say batch is a special case of streaming, so just use streaming. But you could also say streaming is a fragile form of sync, so just use sync. But sync is a special case of batch, so just use batch.
josephg 2 days ago [-]
I agree! Lots more things are sync. Also: the state of my source files -> my compiler (in watch mode), about 20 different APIs in the kernel - from keyboard state to filesystem watching to process monitoring to connected USB devices.
Also, http caching is sort of a special case of sync - where the cache (say, nginx) is trying to keep a synchronised copy of a resource from the backend web server. But because there’s no way for the web server to notify nginx that the resource has changed, you get both stale reads and unnecessary polling. Doing fan-out would be way more efficient than a keep alive header if we had a way to do it!
CRDTs are cool tech. (I would know - I’ve been playing with them for years). But I think it’s worth dividing data interfaces into two types: owned data and shared data. Owned data has a single owner (eg the database, the kernel, the web server) and other devices live down stream of that owner. Shared data sources have more complex systems - eg everyone in the network has a copy of the data and can make changes, then it’s all eventually consistent. Or raft / paxos. Think git, or a distributed database. And they can be combined - eg, the app server is downstream of a distributed database. GitHub actions is downstream of a git repo.
I’ve been meaning to write a blog post about this for years. Once you realise how ubiquitous this problem is, you see it absolutely everywhere.
miki123211 2 days ago [-]
And then there's the third super-special category of shared data with no central server, and where only certain users should be allowed to perform certain operations. This comes up most often in p2p networks, censorship resistance etc.
In most cases, the easiest approach there is just "slap a blockchain on it", as a good and modern (think Ethereum, not Bitcoin) blockchain essentially "abstracts away" the decentralization and mostly acts like a centralized computer to higher layers.
That is certainly not the only viable approach, and I wish we looked at others more. For example, a decentralized DNS-like system, without an attached cryptocurrency, but with global consensus on what a given name points to, would be extremely useful. I'm not convinced that such a thing is possible, you need some way of preventing one bad actor from grabbing all the names, and monetary compensation seems like the easiest one, but we should be looking in this direction a lot more.
josephg 1 days ago [-]
> And then there's the third super-special category of shared data with no central server, and where only certain users should be allowed to perform certain operations. This comes up most often in p2p networks, censorship resistance etc.
In my mind, this is just the second category again. It’s just a shared data system, except with data validation & Byzantine fault tolerance requirements.
It’s a surprisingly common and thorny problem. For example, I could change my local git client to generate invalid / wrong hashes for my commits. When I push my changes, other peers should - in some way - reject them. PVH (of Ink&Switch) has a rule when thinking about systems like this. He says you’re free to deface your own copy of the US constitution. But I don’t have to pull your changes.
Access control makes the BFT problem much worse. The classic problem is that if two admins concurrently remove each other, it’s not clear what happens. In a crdt (or git), peers are free to backdate their changes to any arbitrary point in the past. If you try and implement user roles on top of a crdt, it’s a nightmare. I think CRDTs are just the wrong tool for thinking about access control.
jkaptur 2 days ago [-]
I can't wait to read that blog post. I know you're an expert in this and respect your views.
One thing I think that is missing in the discussion about shared data (and maybe you can correct me) is that there are two ways of looking at the problem:
* The "math/engineering" way, where once state is identical you are done!
* The "product manager" way where you have reasonable-sounding requests like "I was typing in the middle of a paragraph, then someone deleted that paragraph, and my text was gone! It should be its own new paragraph in the same place."
Literally having identical state (or even identical state that adheres to a schema) is hard enough, but I'm not aware of techniques to ensure 1) identical state 2) adhering to a schema 3) that anyone on the team can easily modify in response to "PM-like" demands without being a sync expert.
ochiba 2 days ago [-]
> And 'synchronisation' as a practice gets very little attention or discussion. People just start with naive approaches like 'download whats marked as changed' and then get stuck in the quagmire of known problems and known edge cases (handling deletions, handling transport errors, handling changes that didn't get marked with a timestamp, how to repair after a bad sync, dealing with conflicting updates etc).
I've spent 16 years working on a sync engine and have worked with hundreds of enterprises on sync use cases during this time. I've seen countless cases of developers underestimating the complexity of sync. In most cases it happens exactly as you said: start with a naive approach and then the fractal complexity spiral starts. Even if the team is able to do the initial implementation, maintaining it usually turns into a burden that they eventually find too big to bear.
danielvaughn 2 days ago [-]
CRDTs work well for linear data structures, but there are known issues with hierarchical ones. For instance, if you have a tree, then two clients could send a transaction that would cause a node to be a parent of itself.
That said, there’s work that has been done towards fixing some of those issues.
Evan Wallace (I think he’s the CTO of Figma) has written about a few solutions he tried for Figma’s collaborative features. And then Martin Kleppmann has a paper proposing a solution:
Martin Kleppmann in one of his recent talks about the future of local-first, mentions the need for a generic sync service for the 'local-first end-game' [0] as he calls it. Standardization is needed. Right now everyone and their mother is doing sync differently and building production platforms around their own protocols and mechanisms.
The problem is that the requirements can be vastly different. A collaborative editor is very different to say syncing encrypted blobs. Perhaps there is a one size fits all but I doubt it.
I've been working on sync for the latter use case for a while and CRDTs would definitely be overkill.
layer8 2 days ago [-]
Automatic conflict resolution will always be limited. For example, who seriously believes that we’ll ever be able to fully automate the handling of merge conflicts in version control? (Even if recorded every single edit operation on the syntax-tree level.) And in regular documents the situation is worse, because you don’t have formal parsers and type checkers and unit tests for them. Even for schematized structured data, there are similar issues on the semantic level, that a mere “it conforms to the schema” doesn’t solve.
lifty 18 hours ago [-]
Indeed. So conflict resolution that takes input from the user needs to be part of the protocol. Just like in Git.
jdvh 2 days ago [-]
As long as all clients agree on the order of CRDT operations then cycles are no problem. It's just an invalid transaction that can be dropped. Invalid or contradictory updates can always happen (regardless of sync mechanism) and the resolution is a UX issue. In some cases you might want to inform the user, in other cases the user can choose how to resolve the conflict, in other cases quiet failure is fine.
jakelazaroff 2 days ago [-]
Unfortunately, a hard constraint of (state-based) CRDTs is that merging causally concurrent changes must be commutative. ie it is possible that clients will not be able to agree on the order of CRDT operations, and they must be able to arrive at the same state after applying them in any order.
jdvh 2 days ago [-]
I don't think that's required, unless you definitionally believe otherwise.
When clients disagree about the the order of events and a conflict results then clients can be required to roll back (apply the inverse of each change) to the last point in time where all clients were in agreement about the world state. Then, all clients re-apply all changes in the new now-agreed-upon order. Now all changes have been applied and there is agreement about the world state and the process starts anew.
This way multiple clients can work offline for extended periods of time and then reconcile with other clients.
satvikpendem 2 days ago [-]
Eg-walker seems similar to what you're proposing [0]. A more in-depth video by the creator [1].
You're free to argue that this isn't "pure" CRDT, but the CRDT algorithm still runs normally, just a bit later than it otherwise would.
mrkeen 2 days ago [-]
I've looked at CRDTs, and the concept really appeals to me in the general case, but in the specific cases, my design always ends up being "keep-all-the-facts" about a particular item. But then you defer the problem of 'which facts can I throw away?'. It's like inventing a domain-specific GC.
I'd love to hear about any success cases people have had with CRDTs.
FjordWarden 2 days ago [-]
There was an article on this website not so long ago about using CRDTs for collaborative editing and there was this silly example to show how leaky this abstraction can be. What if your have the word "color" and one user replaces it with "colour" and another deletes the word, what does the CRDT do in this case? Well it merges this two edits into "u". This sort of makes me skeptical of using CRDTs for user facing applications.
jakelazaroff 2 days ago [-]
There isn’t a monolithic “CRDT” in the way you’re describing. CRDTs are, broadly, a kind of data structure that allows clients to eventually agree on a final state without coordination. An integer `max` function is a simple example of a CRDT.
The behavior the article found is peculiar to the particular CRDT algorithms they looked at. But they’re probably right that it’s impossible for all conflicting edits to “just work” (in general, not just with CRDTs). That doesn’t mean CRDTs are pointless; you could imagine an algorithm that attempts to detect such semantic conflicts so the application can present some sort of resolution UI.
> There isn’t a monolithic “CRDT” in the way you’re describing.
I can't blame people for thinking otherwise, pretty much every self-called "CRDT library" I've come across implements exactly one such data structure, maybe parameterized.
It's like writing a "semiring library" and it's simply (min, +).
jdvh 2 days ago [-]
It's still early, but we have a checkpointing system that works very well for us. And once you have checkpoints you can start dropping inconsequential transactions in between checkpoints, which you're right, can be considered GC. However, checkpointing is desirable anyway otherwise new users have to replay the transaction log from T=0 when they join, and that's impractical.
dtkav 2 days ago [-]
I've also had success with this method.
"domain-specific GC" is a fitting term.
yccs27 2 days ago [-]
For me the main issue with CRDTs is that they have a fixed merge algorithm baked in - if you want to change how conflicts get resolved, you have to change the whole data structure.
WorldMaker 2 days ago [-]
I feel like the state-of-the-art here is slowly starting to change. I think CRDTs for too many years got too caught up in "conflict-free" as a "manifest destiny" sort of thing more than "hope and prayer" and thought they'd keep finding the right fixed merged algorithm for every situation. I started watching CRDTs from the perspective of source control and having a strong inkling that "data is always messy" and "conflicts are human" (conflicts are kind of inevitable in any structure trying to encode data made by people).
I've been thinking for a bit that it is probably about time the industry renamed that first C to something other than "conflict-free". There is no freedom from conflicts. There's conflict resistance, sure and CRDTs can provide in their various data structures a lot of conflict resistance. But at the end of the day if the data structure is meant to encode an application for humans, it needs every merge tool and review tool and audit tool it can offer to deal with those.
I think we're finally starting to see some of the light in the tunnel in the major CRDT efforts and we're finally leaving the detour of "no it must be conflict-free, we named it that so it must be true". I don't think any one library is yet delivering it at a good high level, but I have that feeling that "one of the next libraries" is maybe going to start getting the ergonomics of conflict handling right.
dtkav 2 days ago [-]
This seems right to me -- imagine being able to tag objects or sub-objects with conflict-resolution semantics in a more supported way (like LWW, edits from a human, edits from automation, human resolution required (with or without optimistic application of defaults, etc).
Throwing small language models into the mix could make merging less painful too — like having the system take its best guess at what you meant, apply it, and flag it for later review.
satvikpendem 2 days ago [-]
I just want some structure where it is conflict-free most of the time but I can write custom logic in certain situations that is used, sort of like an automated git merge conflict resolution function.
2 days ago [-]
dtkav 2 days ago [-]
I've been running into this with automated regex edits. Our product (Relay [0]) makes Obsidian real-time collaborative using yjs, but I've been fighting with the automated processes that rewrites markdown links within notes.
The issue happens when a file is renamed by one client, and then all other clients pick up the rename and make the change to the local files on disk. Since every edit is broken down into delete/keep/insert runs, the automated process runs rapidly in all clients and can break the links.
I could limit the edits to just one client, but it feels clunky. Another thought I've had is to use ytext annotations, or just also store a ymap of the link metadata and only apply updates if they can meet some kind of check (kind of like schema validation for objects).
If anyone has a good mental model for modeling automated operations (especially find/replace) in ytext please let me know! (email in bio).
Absolutely. My current product relies heavily on a handful of partner systems and, adds an opinionated layer on top of these systems, and propagates data to CRM, DW, and other analytical systems.
One early insight was that we needed a representation of partner data in our database (and the downstream systems need a representation of our opinionated view as well). This is clearly an (eventually consistent) synchronization problem.
We also realized that we often either fail to sync (due to bugs, timing, or whatever) and need a regular process to resync data.
We've ended up with a homegrown framework that does both things, such that the same business logic gets used in both cases. This also makes it easy to backfill data if a chosen representation changes)
We're now on the third or fourth iteration of this system and I'm pretty happy with it.
delusional 2 days ago [-]
Once you add a periodic resync you have moved the true synchronization away from the online "(eventually consistent) synchronization" and into the batch resync. At that point the online synchronization is just a performance optimization on top of the batch resync.
I've been in that situation a lot, and I'd always carefully consider if you even need the online synchronization at that point. It's pretty rarely required.
jbmsf 2 days ago [-]
In our case it absolutely is. There are user facing flows that require data from partner systems to complete. Waiting for the next sync cycle isn't a good UX.
pwdisswordfishz 2 days ago [-]
> Cache Invalidation? Basically a sync problem.
Does naming things and off-by-one errors also count?
mattnewport 2 days ago [-]
UI is also a sync problem if you squint a bit. React like systems are an attempt to be a sync engine between model and view in a sense.
Multiplayer games too.
mackopes 2 days ago [-]
I'm not convinced that there is one generalised solution to sync engines. To make them truly performant at large scale, engineers need to have deep understanding of the underlying technology, their query performance, database, networking, and build a custom sync engine around their product and their data.
Abstracting all of this complexity away in one general tool/library and pretending that it will always work is snake oil. There are no shortcuts to building truly high quality product at a large scale.
wim 2 days ago [-]
We've built a sync engine from scratch. Our app is a multiplayer "IDE" but for tasks/notes [1], so it's important to have a fast local first/office experience like other editors, and have changes sync in the background.
I definitely believe sync engines are the future as they make it so much easier to enable things like no-spinners browsing your data, optimistic rendering, offline use, real-time collaboration and so on.
I'm also not entirely convinced yet though that it's possible to get away with something that's not custom-built, or at least large parts of it. There were so many micro decisions and trade-offs going into the engine: what is the granularity of updates (characters, rows?) that we need and how does that affect the performance. Do we need a central server for things like permissions and real-time collaboration? If so do we want just deltas or also state snapshots for speedup. How much versioning do we need, what are implications of that? Is there end-to-end-encryption, how does that affect what the server can do. What kind of data structure is being synced, a simple list/map, or a graph with potential cycles? What kind of conflict resolution business logic do we need, where does that live?
It would be cool to have something general purpose so you don’t need to build any of this, but I wonder how much time it will save in practice. Maybe the answer really is to have all kinds of different sync engines to pick from and then you can decide whether it's worth the trade-off not having everything custom-built.
Optimally, a sync engine would have the ability to be configed to have the best settings for the project (e.g. central server or completely decentralised). It'd be great if one engine would be so performant/configurable, but having a lot of sync engines to choose from for your project is the best alternative.
btw: excellent questions to ask / insights - about the same I also came across in my lo-fi ventures.
Would be great if someone could assemble all these questions in a "walkthrough" step-by-step interface and in the end, the user gets a list of the best matching engines.
Edit: Mh ... maybe something small enough to vibe code ... if someone is interested to help let me know!
jdvh 2 days ago [-]
Completely decentralized is cool, but I think there are two key problems with it.
1) in a decentralized system who is responsible for backups? What happens when you restore from a backup?
2) in a decentralized system who sends push notifications and syncs with mobile devices?
I think that in an age of $5/mo cloud vms and free SSL having a single coordination server has all the advantages and none of the downsides.
tonsky 2 days ago [-]
- You can have many sync engines
- Sync engines might only solve small and medium scale, that would be a huge win even without large scale
thr0w 2 days ago [-]
> Abstracting all of this complexity away in one general tool/library and pretending that it will always work is snake oil.
Remember Meteor?
xg15 2 days ago [-]
That might be true, but you might not have those engineers or they might be busy with higher-priority tasks:
> It’s also ill-advised to try to solve data sync while also working on a product. These problems require patience, thoroughness, and extensive testing. They can’t be rushed. And you already have a problem on your hands you don’t know how to solve: your product. Try solving both, fail at both.
Also, you might not have that "large scale" yet.
(I get that you could also make the opposite case, that the individual requirements for your product are so special that you cannot factor out any common behavior. I'd see that as a hypothesis to be tested.)
tbrownaw 2 days ago [-]
> decoupled from the horrors of an unreliable network
The first rule of network transparency is: the network is not transparent.
> Or: I’ve yet to see a code base that has maintained a separate in-memory index for data they are querying
Is boost::multi_index_container no longer a thing?
Also there's SQLite with the :memory: database.
And this ancient 4gl we use at work has in-memory tables (as in database tables, with typed columns and any number of unique or not indexes) as a basic language feature.
anonyfox 2 days ago [-]
In Elixir/Erlang thats quite common I think, at least I do this for when performance matters. Put the specific subset of commonly used data into a ETS table (= in memory cache, allowing concurrent reads) and have a GenServer (who owns that table) listen to certain database change events to update the data in the table as needed.
Helps a lot with high read situations and takes considerable load off the database with probably 1 hour of coding effort if you know what you're doing.
TeMPOraL 2 days ago [-]
> Is boost::multi_index_container no longer a thing?
Depends on the shop. I haven't seen one in production so far, but I don't doubt some people use it.
> Also there's SQLite with the :memory: database.
Ah, now that's cheating. I know, because I did that too. I did that because of the realization that half the members I'm stuffing into classes to store my game state are effectively poor man's hand-rolled tables, indices and spatial indices, so why not just use a proper database for this?.
> And this ancient 4gl we use at work has in-memory tables (as in database tables, with typed columns and any number of unique or not indexes) as a basic language feature.
Which one is this? I've argued in the past that this is a basic feature missing from 4GL languages, and a lot of work in every project is wasted on hand-rolling in-memory databases left and right, without realizing it. It would seem I've missed a language that recognized this fact?
(But then, so did most of the industry.)
tbrownaw 2 days ago [-]
> Which one is this? I've argued in the past that this is a basic feature missing from 4GL languages, and a lot of work in every project is wasted on hand-rolling in-memory databases left and right, without realizing it. It would seem I've missed a language that recognized this fact?
I've been very curious about electric -- the idea of giving your application a replicated subset of your databse, using your api as a proxy, is quite interesting for apps where the business layer between the db and the client is thin (our case).
edit: Also their decision to make it just one way sync makes a LOT of sense. Write access brings a lot of scary cases, so by making it only read sync eases some of my anxieties. I can still use Rest / RPC for updating the data
mentalgear 2 days ago [-]
Convex I didnt know yet - looks really crisp (even has svelte support) !
Do you have experience with it? Does it support (decentralized) E2E?
aboodman 2 days ago [-]
No, Convex is a client/server system like zero, electric, instant, powersync.
If you want a fully decentralized system, check out jazz. It is the best of these currently IMO.
hop_n_bop 2 days ago [-]
more the basis for building your own
backend sync-solution in Go than a complete
product, this library does an rsync-like
protocol to minimize data transferred to sync
up two filesystems; its very general building block:
> have a theory that every major technology shift happened when one part of the stack collapsed with another.
If that was true, we would ultimately end up with a single layer. Instead I would say that major shifts happen when we move the boundaries between layers.
The author here proposes to replace servers by synced client-side data stores.
That is certainly a good idea for some applications, but it also comes with drawbacks. For example, it would be easier to avoid stale data, but it would be harder to enforce permissions.
worthless-trash 2 days ago [-]
I feel like this is the "serverless" discussion all over again.
There was still a server, its just not YOUR server. In this case, there will still be servers, just maybe not something that you need to manage state on.
This misnaming creates endless conflict when trying to communicate this with hyper excited management who want to get on the latest trend.
Cant wait to be on the meeting and hearing: "We dont need servers when we migrate to client side data stores".
TeMPOraL 2 days ago [-]
I think the management isn't hyper excited about naming - in fact, they couldn't care less for what the name means (it's just a buzzword). What they're excited about is what the thing does - which is, turn more capex into opex. With "cloud", we can subscribe to servers instead of owning them. With "serverless", we can subscribe directly to what servers do, without managing servers themselves. Etc.
Diederich 2 days ago [-]
Recently, something quite rare happened. I needed to Xerox some paper documents. Well, such actions are rare today, but years ago, it was quite common to Xerox things.
Over time, the meaning of the word 'Xerox' changed. More specifically, it gained a new meaning. For a long time, Xerox only referred to a company named in 1961. Some time in the late 60s, it started to be used as a verb, and as I was growing up in the 70s and 80s, the word 'Xerox' was overwhelmingly used in its verb form.
Our society decided as a whole that it was ok for the noun Xerox to be used a verb. That's a normal and natural part of language development.
As others have noted, management doesn't care whether the serverless thing you want to use is running on servers or not. They care that they don't have to maintain servers themselves. CapEx vs OpEx and all that.
I agree that there could be some small hazard with the idea that, if I run my important thing in a 'serverless' fashion, then I don't have to associate all of the problems/challenges/concerns I have with 'servers' to my important thing.
It's an abstraction, and all abstractions are leaky.
If we're lucky, this abstraction will, on average, leak very little.
philsnow 2 days ago [-]
> Over time, the meaning of the word 'Xerox' changed. More specifically, it gained a new meaning. For a long time, Xerox only referred to a company named in 1961. Some time in the late 60s, it started to be used as a verb, and as I was growing up in the 70s and 80s, the word 'Xerox' was overwhelmingly used in its verb form.
https://www.youtube.com/watch?v=PZbqAMEwtOE#t=5m58s I don't think this dramatization (of a court proceedings from 2010) is related to Xerox's plight with losing their trademark, but said dramatization is brilliant nonetheless
szundi 2 days ago [-]
[dead]
zx8080 2 days ago [-]
> decoupled from the horrors of an unreliable network
There's no such thing as reliable network in the world. The world is network connected, there's almost no local-only systems anymore (for a long long time now).
Some engineers dream that there's some cases when network is reliable, like when a system fully lives in the same region and single AZ. But even then it's actually not reliable and can have some glitches quite frequently (like once per month or so, depending on some luck).
01HNNWZ0MV43FF 2 days ago [-]
True. Even the network between the CPU and an SD card or USB drive is not reliable
jimbokun 2 days ago [-]
I believe the point is that given an unreliable network, it's nice to have access to all the data available locally up to the point when you had a network issue. And then when the network is working again, your data comes up to date with no extra work on the application developer's part.
tonsky 2 days ago [-]
> There's no such thing as reliable network in the world
I’m not saying there is
PaulHoule 2 days ago [-]
Lotus Notes was a product far ahead of its time (nearly forgotten today) which was an object database with synchronization semantics. They made a lot of decisions that seem really strange today, like building an email system around it, but that empowered it for long-running business workflows. It's something everybody in the low-code/no-code space really needs to think about.
myflash13 2 days ago [-]
Locally synced databases seem to be a new trend. Another example is Turso, which works by maintaining a sort of SQLite-DB-per-tenant architecture. Couple that with WASM and we’ve basically come full circle back to old school desktop apps (albeit with sync-on-load). Fat client thin client blah blah.
skybrian 2 days ago [-]
This is also a tricky UI problem. Live updates, where web pages move around on you while you’re reading them, aren’t always desirable. When you’re collaborating with someone you know on the same document, you want to see edits immediately, but what about a web forum? Do you really need to see the newest responses, or is this a distraction? You might want a simple indicator that a reload will show a change, though.
A white paper showing how Instant solves synchronization problems might be nice.
slifin 2 days ago [-]
I'm surprised to see Tonsky here
Mostly because I consider the state of the art on this to be Clojure Electric and he presumably is aware of it at least to some degree but does not mention it
tonsky 2 days ago [-]
Clojure Electric is different. It’s not really a sync, it’s more of a thin client. It relies of having fast connection to server at all times, and re-fetches everything all the time. They innovation is that they found a really, really ergonomic way to do it
dustingetz 2 days ago [-]
Electric’s network state distribution is fully incremental, i’m not sure what you mean by “re-fetches everything all the time” but that is not how i would describe it.
If you are referring to virtual scroll over large collections - yes, we use the persistent connection to stream the window of visible records from the server in realtime as the user scrolls, affording approximately realtime virtual scroll over arbitrarily large views (we target collections of size 500-50,000 records and test at 100ms artificial RT latency, my actual prod latency to the Fly edge network is 6ms RT ping), and the Electric client retains in memory precisely the state needed to materialize the current DOM state, no more no less. Which means the client process performance is decoupled from the size of the dataset - which is NOT the case for sync engines, which put high memory and compute pressure on the end user device for enterprise scale datasets. It also inherits the traditional backend-for-frontend security model, which all enterprise apps require, including consumer apps like Notion that make the bulk of their revenue from enterprise citizen devs and therefore are exposed to enterprise data security compliance. And this is in an AI-focused world where companies want to defend against AI scrapers so they can sell their data assets to foundation model providers for use in training!
Which IMO is the real problem with sync engines: they are not a good match for enterprise applications, nor are they a good match for hyper scale consumer saas that aspire to sell into enterprise. So what market are they for exactly?
quotemstr 2 days ago [-]
Clojure Electric is proprietary software, which disqualifies it immediately no matter its other purported benefits
mananaysiempre 2 days ago [-]
I’m also surprised, but more because I remember very vividly his previous post on sync[1] which described a much more user-friendly (andm much less startup-friendly) system.
thank you for mentioning! I have been reading a lot about sync engines and never saw Clojure Electric being mentioned here on HN!
ForTheKidz 2 days ago [-]
> You’ll get your data synced for you
How does this happen without an interface for conflict resolution? That's the hard part.
phito 2 days ago [-]
Right, first thing I did after opening the article is CTRL-F'ing for conflict, and got zero result. How are they not talking about the only real problem about the local-first approach? The rest is just boiler plate code.
All this recent hype about sync engines and local first applications completely disregards conflict resolution. It's the reason why syncing isn't mainstream already and it isn't solved and arguably cannot be.
Imagine if git just on its own picked what to keep and what to throw away when there's a conflict. You fundamentally need the user to make the choice.
aboodman 2 days ago [-]
Zero (zerosync.dev) uses transactional conflict resolution, which is what our prior product Replicache and Reflect both used. It is very similar to what multiplayer games have done for decades.
It works really well and we and our customers have found it to be quite general.
It allows you to run an arbitrary transaction on the sever side to decide what to do in case of conflicts. It is the software equivalent of git asking the user what to do. Zero asks your code what to do.
But it asks it in the form of the question "please run the function named x with these inputs on the current backend db state". Which is a much more ergonomic way to ask it than "please do a 3-way merge between these three states".
Conflict resolution is not the reason why there has not been a general-purpose sync engine. None of our customers have ~ever complained about conflict resolution.
The reason there has not been a general-purpose sync engine is actually on the read side:
- Previous sync engines really want you to sync all data. This is impractical for most apps.
- Previous sync engines do not have practical approaches to permissions.
These problems are being solved in next generation of sync engines.
I think with good presence (being able to see what other users are doing) and an app that isn't used offline, conflicts are essentially not a problem. As long as whatever is resolving the conflicts resolves them in a way that doesn't break the app, e.g. making sure there aren't cycles in some multiplayer app with a tree datastructure. Sounds like Zero has the right idea here, I'll build something on it imminently to try it out.
Sammi 2 days ago [-]
Agree that if you don't have offline support, then conflict resolution is such a minor issue that you can just do "last write wins" and call it a day.
Sammi 2 days ago [-]
"It is the software equivalent of git asking the user what to do. Zero asks your code what to do."
You are asking the dev what to do. You are _not_ asking the user what to do. This is akin of the git devs baking in a choice into git on what to keep in a merge conflict.
It's hard to trust you guys when you misrepresent like this. I thought long and hard on whether to respond confrontationally like this, but decided you really need to hear the push back on this.
aboodman 2 days ago [-]
lol wut?
I represented that we ask the dev what to do:
> Zero asks your code what to do
You agree that's what we do:
> You are asking the dev what to do. You are _not_ asking the user what to do.
I get that your actual issue is you don't think that what we do is "the software equivalent of git asking the user what to do". But like, I also said what we do concretely in the same paragraph. It's not like I was trying to hide something. This is a metaphor for how to understand our approach to conflict resolution that works for most developers. Like all metaphors it is not perfect.
FWIW, there is nothing stopping a developer from having this function just save off a forked copy and ask the user what to do. Some developers do this.
Also FWIW, Zero does not allow offline writes specifically because we want to educate people how to properly handle conflicts before we do. I see down-thread this is the majority of your concern.
Sammi 2 days ago [-]
I assumed you were doing offline support yeah. I've heard a lot about local first development lately, so I guessed this what what you guys are tackling too.
Without offline support AND you're doing real time updating of data, then conflict resolution is not a real world practical concern. Users will be looking at the same data at the same time anyways, so they generally see what data won out in case of a conflict, as they are looking at real time data as they are editing.
IF you had offline support, and for other sync engines that do: There is a real and meaningful difference between a backend dev and an end user of the application choosing what to do in case of a conflict. A backend dev cannot make a general case algorithm that knows that two end users want to keep or throw away in a conflict, because this is completely situational - users could be doing whatever. And if you push the conflict resolution to the end users, then you are asking a lot of those users. They need to be technically inclined and motivated people in order to take the time to understand and resolve the conflict. Like with git users.
aboodman 2 days ago [-]
> Without offline support AND you're doing real time updating of data, then conflict resolution is not a real world practical concern.
I disagree with this. There are many real-world cases where keywise lww does the wrong thing. The article I linked up-thread covers many of them. Even a simple counter does the wrong thing.
This is where robust conflict resolution really matters in these systems, not the long-time offline case people often ask about.
You need robust conflict resolution to make correct software and maintain invariants in the face of write/write systems.
> A backend dev cannot make a general case algorithm that knows that two end users want to keep or throw away in a conflict, because this is completely situational - users could be doing whatever. And if you push the conflict resolution to the end users, then you are asking a lot of those users. They need to be technically inclined and motivated people in order to take the time to understand and resolve the conflict. Like with git users.
I agree completely. In my opinion the ideal offline-first write/write UI has never been built, but the team at Ink & Switch are closest:
I think the perfect UX in many cases is that syncs goes ahead and tries to land the offline writes, but the user has a history UI where they can see what happened. Like how many collaborative apps do today.
But importantly in this UI the app would represent branches and merges. But unlike Git's fine grained branch/merge points, in this UI it would literally represent points where people went offline and made changes.
Users could then go back and recover the version of their data from when they were offline, or compare (probably manually in two tabs) the two different versions of the data and recover.
This does still ask users to compare and resolve conflicts in the worst case, but it is not a blocking operation or one that is final. The more common case is the user will go ahead with the merge and sometimes find some corruption. They can always go back and see what went wrong after the fact and fix. This seems like the right tradeoff to me of making the common case (no conflict) easy and automatic but making the uncommon but scary case at least not dangerous.
There also needs to be clear first-class UX telling users that they're going offline and what will happen when they come online.
I'm looking forward to someday working on this, but it's not what our users ask about most often so we're just disabling offline writes for now.
probabletrain 2 days ago [-]
> Previous sync engines really want you to sync all data
Linear had to do all sorts of shenanigans to be able to sync all data, for orgs with lots of it – there's a talk on that here:
No. When we started the project we used prolly trees initially, but we don't need the content addressing feature for Zero and all the hashing was quite expensive. So now we just use a plain immutable b-tree:
> All this recent hype about sync engines and local first applications completely disregards conflict resolution
The main concern of sync engines is precisely the conflict resolution! Everything else is simple in comparison.
The good news is that under some circumstances it is possible to solve conflicts without user intervention. The simplest example is a counter that can only be incremented.
More advanced data structures automatically solving conflicts exists, for example solving conflicts for strings exists, and those are good enough for a text editor.
I agree that there will be conflicts that are resolved in a way that yields non-sensical text, for example if there are 2 edits of the sentence "One cat":
One cat => Two cats
One cat => One dog
The resulting merge may be something like "Two cats dog".
Something else (the user, an LLM...) will then have to fix it.
But that's totally OK, because in practice this will happen extremely rarely, only when the user would have been offline for a long time. That user will be happy to have been able to work offline, largely compensating the fact that they have to proof read the text again.
SkiFire13 2 days ago [-]
This doesn't "solve" conflict resolution, it just picks one of the possible answers and then doesn't care whether it was the correct one or not.
It can be acceptable for some usecases, but not for others where you're still concerned about stuff that happens "extremely rately" and is not under your direct control.
> Something else (the user, an LLM...) will then have to fix it.
This assumes that user/llm knows the conflict was automatically solved and might need to be fixed, so the conflict is still there! You just made the manual part delayed and non-mandatory, but if you want correctness it will still have to be there.
brulard 2 days ago [-]
> in practice this will happen extremely rarely, only when the user would have been offline for a long time.
I don't think it would happen "extremely rarely". Drops in connectivity happen a lot, especially on cellular connection and this can absolutely happen a lot for some applications. Especially when talking about "offline first" apps.
Jyaif 2 days ago [-]
You have to use another device during that drop of connectivity on cellular connection, and edit the same content. That doesn't happen often.
sgt 2 days ago [-]
> All this recent hype about sync engines and local first applications completely disregards conflict resolution.
Not really true though. I've used a couple of local sync engines, one internally built and another one which is both commercial and now open source called PowerSync[1]. Conflict resolution is definitely on the agenda, and a developer is definitely going to be mindful of conflicts when designing the application.
My unfortunate point is that the dev cannot know what the user is doing, and so cannot in principle know what choice to make on behalf of the user in case of a conflict. This is not a code problem. It cannot be solved with code.
sgt 2 days ago [-]
I've found that in almost all cases - the latest update "wins" strategy is fine. You could have two sessions working with conventional API calls and still have a conflict. As a dev you need to restrict what the user can do.
porridgeraisin 2 days ago [-]
Precisely. The hype articles write all about the journey to The Wall, and then leave out the bit where you smash headfirst into it.
lifty 2 days ago [-]
Very good point. The local-sync ecosystem is still in a young phase, and conflict resolution hasn't been tackled or solved yet. Most systems have a |last write wins" approach.
jamil7 2 days ago [-]
> All this recent hype about sync engines and local first applications
Kind of but only really in the web world, it was the default on desktop for a long time and is pretty common on mobile.
2 days ago [-]
tonsky 2 days ago [-]
Ah, no. Not really. People sometimes think about conflict resolution as a problem that needs to be solved. But it’s not solvable, not really. It’s part of the domain, it’s not going anywhere, it’s irreducible complexity.
You _will_ have conflicts (because your app is distributed and there are concurrent writes). They will happen on semantic level, so only you (app developer) _will_ be able to solve them. Database (or any other magical tool) can’t do it for you.
Another misconception is that conflict resolution needs to be “solved” perfectly before any progress can be made. That is not true as well. You might have unhandled conflicts in your system and still have a working, useful, successful product. Conflicts might be rare, insignificant, or people (your users) will just correct for/work around them.
I am not saying “drop data on the floor”, of course, if you can help it. But try not to overthink it, either.
DaiPlusPlus 2 days ago [-]
> But it’s not solvable, not really. It’s part of the domain, it’s not going anywhere, it’s irreducible complexity. You _will_ have conflicts (because your app is distributed and there are concurrent writes). [...] Another misconception is that conflict resolution needs to be “solved” perfectly before any progress can be made. That is not true as well. You might have unhandled conflicts in your system and still have a working, useful, successful product. Conflicts might be rare, insignificant, or people (your users) will just correct for/work around them.
I can't speak for whatever application-level problems you were trying to solve, but many problem-cases can be massaged into being conflict-free by adding constraints (or rather: discovering constraints inherent in the business-domain you can use). For example (and the best example, too) is to use an append-only logical model: then the synchronization problem reduces down to merge-sort. Another kind of constraint might be to simply disallow "edit" access to local data when working-offline (without a prior lock or lease being taken) but still allowing "create".
> Database (or any other magical tool) can’t do it for you.
Yes-and-no.
While I'm no fan of CORBA and COM+ (...or SOAP, or WS-OhGodMakeItStop), but being "enterprise-y" it meant they brought distributed-transactions to any application, and that includes RDBMS-mediated distributed transactions (let's agree, an RDBMS is in a far greater position to be a better canonical transaction-server than an application-server running in-front of it). For distributed systems needing transient distributed locks to prevent conflicts in the first place (so only used by interactive users in the same LAN, really) this worked just-as-well as a local-only solution - and make it fault-tolerant too.
...so it is unfortunate that with the (absolutely justified) back-to-basics approach with REST[1] that we lose built-in support for distributed transactions (even some of the more useful and legitimate parts of WebDAV (and so, piggy-backing on our web-servers' built-in support for WebDAV verbs) seem to be going-away) - this all raises the barrier-to-entry for doing distributed-transactions _right_, which means the next set of college-hires won't have been exposed to it, which means it won't be a standard expected feature in the next major internal application they'll write for your org, which means you'll either have a race-condition impacting a multi-billion-dollar business thing that no-one knows how to fix or more likely, just a crappy UX where you have to tell your users not to reload the page too quickly "just in case". Yes, I see advisories like that in the Zendesk pages of the next line-of-business SaaS you'll be voluntold to integrate into your org.
(I think today, the "best" way to handle distributed-locking between interactive-users in a web-app would necessitate using a ServiceWorker using WebRTC, SSE, or a highly-reliable WebSocket - which itself is a load of work right there - and don't forget to do all your JS feature-checks because eventually someone will try to use your app on an old Safari edition because they want to keep on using their vintage Mac) - or anyone using Incognito mode, _gah_.
Have been using Instant for a few side projects recently and it has been a phenomenal experience. 10/10, would build with it again. I suspect this is also at least partially true of client-server sync engines in general.
kenrick95 2 days ago [-]
I concur with this. Been using it on my side project that only have a front-end. The "back-end" is 100% InstantDB. Although for me, I found that the permissions part a bit hard to understand, especially when it involves linking to other namespace. Haven't checked them for a while, maybe they've improved on this...
zelon88 2 days ago [-]
Here's an idea.... Stop putting your critical business data on disparate third party systems that you don't have access to. Problem solved!
joeeverjk 2 days ago [-]
If sync really is the future, do you think devs will finally stop pretending local-first apps are some niche thing and start building around sync as the core instead of the afterthought? Or are we doomed to another decade of shitty conflict resolution hacks?
Zanfa 2 days ago [-]
> Or are we doomed to another decade of shitty conflict resolution hacks?
Conflict resolution is never going away. It's important to distinguish between syntactical and semantical conflicts though, the first of which can be solved, but the other will always require manual intervention.
Tobani 2 days ago [-]
I think this makes sense for applications applications that are just managing data maybe? But if your application needs to do things when you change that data (like call to a third party system)... Syncing is maybe not the solution. What happens when the total dataset is large, do you need to download 6gb of data every time you log in? Now you've blown up the quota on local storage. How do you make sure the appropriate data is downloaded or enough data? How do you prioritize the data you need NOW instead of waiting for that last byte of the 6gb to download?
It is like a useful tool, but not the only future.
2 days ago [-]
Nelkins 2 days ago [-]
Discussion of sync engines typically goes hand in hand with local-first software. But it seems to be limited to use cases when the amount of data is on the smaller side. For example, can anyone imagine how there might be a local-first version of a recommendation algorithm (I'm thinking something TikTok-esque)? This would be a case where the determination of the recommendation relies on a large amount of data.
Or think about any kind of large-ish scale enterprise SaaS. One of the clients I'm working with currently sells a Transportation Management Software system (think logistics, truck loads, etc). There are very small portions of the app that I can imagine relying on a sync engine, but being able to search over hundreds of thousands of truck loads, their contents, drivers, etc seems like it would be infeasible to do via a sync engine.
I mention this because it seems that sync engines get a lot of hype and interest these days, but they apply to a relatively small subset of applications. Which may still be a lot, but it's a bit much to say they're the future (I'm inferring "of application development"--which is what I'm getting from this article).
ochiba 2 days ago [-]
I think that is where sync engines come in that allow doing arbitrary hybrid queries (across local and remote data) and then keeping the results of those hybrid queries in sync on the client.
This is one of the ideas that appears to be central to the genesis of Zero [1]
ElectricSQL allows for a similar pattern and PowerSync is also working on this [2]
Edit: I watched the presentation (which I really enjoyed) and also read the blog post. For anyone with less time, the answer is essentially: don't sync everything, treat the local data like a cache. Sync as much as you can into that cache, and then reach out to the server for other things.
2 days ago [-]
paduc 2 days ago [-]
Before I write anything to the DB, I validate with business logic.
Should I write this logic in the DB itself ? Seems impractical.
TeMPOraL 2 days ago [-]
> Should I write this logic in the DB itself ?
Yes?
If it sounds impractical, it's because the whole industry got used to not learning databases beyond most basic SQL, and doing everything by hand in application code itself. But given how much of code in most applications is just ad-hoc reimplementation of databases, and then how much of the business logic is tied to data and not application-specific things, I can't help but wonder - maybe a better way would be to treat RDBMS as an application framework and have application itself be a thin UI layer on top?
On paper it definitely sounds like grouping concerns better.
brulard 2 days ago [-]
While stored procedures/triggers etc. can be powerful, it has been taught for decades now that it is an antipattern to put business logic to the RDBMS (for more or less valid reasons). Some concerns I would have would be vendor lock-in and limits of the provided language.
Tobani 2 days ago [-]
In very simple systems that makes sense. But as soon as your validation requires talking to a third party, or you have side effects like sending emails you have to suddenly move all that logic back out. You end up with system that isn't very easy to iterate on.
Nextgrid 2 days ago [-]
You can model external system interactions with tables representing "mailboxes" - so for example if a DB stored procedure needs to call a third-party API to create a resource, it writes a row in the "outbox" table for that API, then application-level code picks that up, makes the API call, parses the response (extracts the required fields) and stores it in an "inbox" table so now the database has access to the response (and a trigger can run the remainder of the business process upon insertion of that row).
Tobani 2 days ago [-]
Yes, but then you've removed parent comments' assertion that everything should be done by the RDBMS. And you've changed the contract of the action.
TeMPOraL 2 days ago [-]
Surely some RDBMS has the ability to run REST queries, possibly via SQL by pretending it's a table or something.
I can imagine that working on a good day. I don't dare imagine error handling (though would love to look at examples).
Ultimately, it probably makes no sense to do everything in the database, but I still believe we're doing way too much in the application, and too little in the DB. Some of the logic really belongs to data (and needs to be duplicated for any program using the same data, or else...; probably why people don't like to share databases between programs).
And, at a higher level, I wonder how far we could go if we pushed all data-specific logic into the DB, and the rest (like REST calls) into dedicated components, and used a generic orchestrator to glue the parts together? What of the "application code" would remain then, and where would it sit?
lloeki 2 days ago [-]
> treat RDBMS as an application framework and have application itself be a thin UI layer on top?
Stored procedures have been a thing. I've seen countless apps that had a thin VB UI and a MSSQL backend where most of the logic is implemented. Or, y'know, Access. Or spreadsheets even!
And before that AS/400&al.
But ORMs came in and the impedance mismatch is then too great. Splitting data wrangling across two completely differing points of views makes it extremely hard to reason about.
tonsky 2 days ago [-]
If you think of an existing database, like Postgres, sure. It’s not very convenient.
What I am saying is, in a perfect world, database and server will be the one and run code _and_ data at the same time. There’s really no good reason why they are separated, and it causes a lot of inconveniences right now.
Tobani 2 days ago [-]
Sure in an ideal world we don't need to worry about resources and everything is easy. There are very good reason why they are separated now. There have been systems like 4th dimension and K that combine them for decades. They're great for systems of a certain size. They do struggle once their workload is heavy enough, and seem to struggle to scale out. Being able to update my application without updating the storage engine reduces the risk. Having standardized backup solutions for my RDBMS means is a whole level of effort I don't have to worry about. Data storage can even be optimized without my application having to be updated.
Terr_ 2 days ago [-]
> logic in the DB
Something similar but in the opposite direction of lessening DB-responsibilities in favor of logic-layer ones: Driving everything from an event log. (Related to CQRS, Event-Sourcing.)
It means a bit less focus on "how do I ensure this data-situation never ever ever happens" logic, and a bit more "how shall I model escalation and intervention when weird stuff happens anyway."
This isn't as bad as it sounds, because any sufficiently old/large software tends to accrue a bunch of informal tinkering processes anyway. It's what drives the unfortunate popularity of DB rows with a soft-deleted mark (that often require manual tinkering to selectively restore) because somebody always wants a special undo which is never really just one-time-only.
scotty79 2 days ago [-]
I think that's the main issue. It's not enough to have a database that can automatically sync between frontend and backend. It would also need to be complex enough to keep some logic just on the backend (because you don't want to reveal it and entrust adherence to the client) and reject some changes done on frontend if they are invalid. Database would become the app itself.
acac10 2 days ago [-]
Which many DBs allow:
- stored procedures
- Oracle PL/SQL
I used to work for Oracle but never liked that approach.
scotty79 2 days ago [-]
I don't think a stored procedure that operates only on master copy of the database can reject update comming from a second copy and nicely comminicate thus happened so that the other copy can infrom the user through some ui.
Sammi 2 days ago [-]
The issue with stored procedures is testing and code maintenance. How do I run unit tests? How do I version control and code review?
TeMPOraL 2 days ago [-]
It's the same issue that killed the image-based programming in favor of edit-compile-run cycle we're all doing. "How do I test? How do I do version control? How do I migrate?".
These are valid concerns, but $deity I wish we focused on finding solutions for them, because the current paradigm of edit/compile/run + plaintext single source of truth codebase, is already severely limiting our ability to build and maintain complex software.
brulard 2 days ago [-]
While I don't like the idea of putting logic to the DBRMS (if not for a really good reason), you can do unit tests and code reviews. In a serious project you already should have a way to make migrations and versioning of the DB itself (for example using prisma, drizzle, etc.). Procedures would be just another entry in the migrations and unit tests can create testing temporary DB, run the procedures and compare the results. I agree tooling is (AFAIK) not good and there will be much more work around that, but it is possible.
x0x0 2 days ago [-]
The other issue, from experience, is needing to reimplement logic as well -- you end up with stored procedures that duplicate logic that also must be run either in your server or on your client. eg given the state of the system, is this mutation valid.
Then those multiple implementations inevitably suffer different bugs and drift, leading to really ugly bugs.
theanirudh 1 days ago [-]
How do sync engines address issues where we need something to be more dynamic? Currently I'm building a language learning app and we need to display your "learning path" - what lessons you have finished and what are your next lessons. The next lessons aren't fixed/same for everyone. It will change depending on how the score of completed lessons. Is any query language dynamic enough to support use cases like this? Or is it expected to recalculate the next lessons whenever the user completes a lesson and write it out to a table which can then be queried easily?
theanirudh 1 days ago [-]
Seems like a lot of extra work in cases where we change the scoring mechanism, we will then have to invalidate the existing entries, recalculate and write it out again compared to just having an endpoint that will take all previous lessons and generate the next lessons on demand.
spankalee 2 days ago [-]
The problem I have with "moving the database to the client" is the same one I have in practice with CRDTs: In my apps, I need to preserve the history of changes to documents, and I need to validate and authenticate based on high-level change descriptions, not low-level DB access.
This always leads me back to operational transforms. Operations being reified changes function as undo records; a log of changes; and a narrower, semantically-meaningful API, amenable to validation and authz.
For the Roam Firebase example: this only works if you can either trust the client to always perform valid actions, or you can fully validate with Firebase's security rules.
OT has critiques, but almost all of the fall away in my experience when you have a star topology with a central service that mediates everything - defining the canonical order of operations, performs validation & auth, and records the operation log.
jimbokun 2 days ago [-]
> This always leads me back to operational transforms. Operations being reified changes function as undo records; a log of changes; and a narrower, semantically-meaningful API, amenable to validation and authz.
Sounds like another kind of synchronization database.
spankalee 2 days ago [-]
I think it's only a database if you come down on the "logs are the source of truth, not tables" side of the logs vs tables debate. And if you do, any log is a database, I guess...
Phelinofist 2 days ago [-]
The largest feature my team develops is a sync engine. We have a distributed speech assistant app (multiple embeddeds [think car and smartphone] & cloud) that utilizes the Blackboard pattern. The sync engine keeps the blackboards on all instances in sync.
It is based on gRPC and uses a state machine on all instances that transitions through different states for connection setup, "bulk sync", "live sync" and connection wind down.
Bulk sync is the state that is used when an instance comes online and needs to catch up on any missed changes. It is also the self-heal mechanism if something goes wrong.
Unfortunately some embedded instances have super unreliable clocks that drift quite a bit (in both directions). We consider switching to a logical clock.
We have quite a bit of code that deals with conflicts.
I inherited this from my predecessor. Nowadays I would probably not implement something like this again, as it is quite complex.
exceptione 2 days ago [-]
I believe the idea of a Blackboard is that there is a single blackboard for all processes to asynchronously scribble and read from.
Syncing blackboards sounds like going straight against the spirit of that design pattern.
Pamar 2 days ago [-]
Maybe I am just dumb but I really cannot see how data synch could solve what (in my kind of business) is a real problem.
Example: you develop a web app to book for flights online.
My browser points to it and I login.
Should synchronization start right now? Before I even input my departure point and date?
Ok, no. I write NYC -> BER, and a dep date.
Should I start synching now?
Let's say I do. Is this really more efficient than querying a webservice?
Ok, now all data are synched. Even potentially the ones for business class, even if I just need economy.
You kniw, I could always change my mind later. Or find out that on the day I need to travel no economy seats are available anymore.
Whatever. I have all the inventory data that I need. Raw.
Guess what? As a LH frequent flyer I get special treatment in terms of price. Not just for LH, but most Business Alliance airlines.
This logic is usually on the server, because airlines want maximum creativity and flexibility in handling inventory.
Should we just synch data and make the offer selection algorithm run on the webserver instead?
Let's say it does not matter... I have somehow in front of me all the options for my trip. So I call my wife to confirm she agrees with my choice. I explain her the alternatives... this takes 5 minutes.
In this period, 367 other people are buying/cancelling trips to Europe. So I either see my selection constantly change (yay! Synchronization!!!) or I press confirm, and if my choice is gine I get a warning message and I repeat my query.
Now add two elements:
- airlines prefer not to show real numbers of available seats - they will usually send you a single digit from 1 to 9 or a "*" to mean "10 or more".
So just symching raw data and let the combinatorial engine work in the browser is not a very good idea.
Also, I see the pontential to easily mount DDOS attacks if every client is constantly being synchronized by copying high contention tables in RT.
What am I missing here?
earthnail 2 days ago [-]
Your use case doesn’t benefit from your own data. There’s nothing you can do that doesn’t require a direct interaction from the server.
I write an audio recording app, and in my app, users have most to gain from their own data. For most people, syncing is basically an afterthought. In this use case, the ability of having your recordings in your phone is the most important thing.
The difference here lies that in my app, the user generates all the valuable data themselves. In your app, nothing valuable can happen without communication with the airline.
Pamar 2 days ago [-]
Ok, fine. We are in full agreement here.
But then the post claims that "everything is a synchronization problem" seems should be qualified better.
Also, most of the comments before mine seemed to be in full agreement that yeah, full synchronization would be a silver bullet, even for cache invalidation
voidpointer 2 days ago [-]
Probably a silly question, but if you take this all the way and treat everything as a DB that is synchronized in the background, how do you manage access control where not every user/client is supposed to have access to every object represented in the DB? Where does that logic go?
If you do it on the document level like figma or canvas, every document is a DB and you sync the changes that happen to the document but first you need access to the document/DB. But doesn't this whole idea break apart if you need to do access control on individual parts of what you treat as the DB because you would need to have that logic on the client which could never be secure...
loquisgon 2 days ago [-]
The local first people (https://localfirstweb.dev/) have some cool ideas about how to solve the data synch problem. Check it out.
profstasiak 2 days ago [-]
so... what do people that want to have sync engines do?
I want to try it for hobby project and I think I will go the route of just one way sync (from database to clients) using electric sql and I will have writes done in a traditional way (POST requests).
I like the idea of having server db and local db in sync, but what happens with writes? I know people say CRDT etc... but they are solving conflicts in unintuitive ways...
I know I probably sound uneducated, but I think the biggest part of this is still solving conflicts in a good way, and I don't really see how you can solve those in a way that works for all different domains and have it "collapsed" as the author says
qudat 2 days ago [-]
The problem with sync engines is needing full-stack buy-in in order for it to work properly. Having a separate backend-for-frontend service defeats the purpose in my mind. So what do you do when a company already has an API and other clients beyond a web app? The web app has to accommodate. I see this as the major downside with sync engines.
The author would be excited to learn that CouchDB solves this problem since 20 years.
The use case the article describes is exactly the idea behind CouchDB: a database that is at the same time the server, and that's made to be synced with the client.
You can even put your frontend code into it and it will happily serve it (aka CouchApp).
I think an underappreciated library in this space is Logux [1]
It requires deeper (and more) integration work compared to solutions that sync your state for you, but is a lot more flexible wrt. the backend technology choices.
At its core, it is an action synchronizer. You manage both your local state and remote state through redux-style actions, and the library takes care of syncing and resequencing them (if needed) so that all clients converge at the same state.
Isn't this what CouchDB/PouchDB solves in quite a nice way?
fridder 2 days ago [-]
That was my first thought! https://couchdb.apache.org/ is pretty good though is it still the incremental views with JS?
paul_h 2 days ago [-]
I always found the documentation lacking and it not 100% clear what was in couchbase (commercial & OSS) vs couchdb and which I really wanted
rockmeamedee 2 days ago [-]
Idk man. It's a nice idea, but it has to be 10x better than what we currently have to overcome the ecosystem advantages of the existing tech. In practice, people in the frontend world already use Apollo/Relay/Tanstack Query to do data caching and querying, and don't worry too much about the occasional overfetching/unoptimized-ness of the setup. If they need to do a complex join they write a custom API endpoint for it. It works fine. Everyone here is very wary of a "magic data access layer" that will fix all of our problems. Serverless turned out to be a nightmare because it only partially solves the problem.
At the same time, I had a great time developing on Meteorjs a decade ago, which used Mongo on the backend and then synced the DB to the frontend for you. It was really fluid. So I look forward to things like this being tried. In the end though, Meteor is essentially dead today, and there's nothing to replace it. I'd be wary of depending so fully on something so important. Recently Faunadb (a "serverless database") went bankrupt and is closing down after only a few years.
I see the product being sold is pitched as a "relational version of firebase", which I think good idea. It's a good idea for starter projects/demos all the way up to medium-sized apps, (and might even scale further than firebase by being relational), but it's not "The Future" of all app development.
Also, I hate to be that guy but the SQL in example could be simpler, when aggregating into JSON it's nice to use a LATERAL join which essentially turns the join into a for loop and synthesises rows "on demand":
SELECT g.*,
COALESCE(t.todos, '[]'::json) as todos
FROM goals g
LEFT JOIN LATERAL (
SELECT json_agg(t.*) as todos
FROM todos t
WHERE t.goal_id = g.id
) t ON true
That still proves the author's point that SQL is a very complicated tool, but I will say the query itself looks simpler (only 1 join vs 2 joins and a group by) if you know what you're doing.
timita 2 days ago [-]
> Meteor is essentially dead today
Care to explain what you mean by "dead"? Just today v3.2 came out, and the company, the community, and their paid-for hosting service seem pretty alive to me.
finolex 2 days ago [-]
If anyone could be kind to give feedback on the local-first x data ownership db we're building, would really appreciate it! https://docs.basic.tech/
Will do my best to take action on any feedback I receive here
shikhar 2 days ago [-]
We have had interest in using our serverless stream API (https://s2.dev/) to power sync engines. Very excited about these kinds of use cases, email in profile if anyone wants to chat.
avodonosov 2 days ago [-]
Why he haven't implemented a full Datomic Peer for his DataScript I never understood.
Having a datalog query engine, supplying it with data from Datomic indexes - b-tree like collections storing entity-attribute-value records - seems simple. Updating the local index cache from log is also simple.
And that gets you a db in browser.
tonsky 2 days ago [-]
It’s not as simple as you make it sound:
- Reliable communication is hard
- Optimistic writes should on client are hard
- Tracking subsets of data is hard (you don't want the entirety of Datomic on the client, do you?)
- Permissions are hard in this model
Why didn't I implement it? Mostly comes down to free time. It's a hobby project and it's hard to find time for it. I also stopped writing web apps so immediate pressure for this went away.
avodonosov 18 hours ago [-]
> - Optimistic writes should on client are hard
This is out of scope - I don't mean a functional equivalent of instantdb. Just a database in browser.
> - Reliable communication is hard
The same, no special requirements. Just send a request, maybe retry several times (with increasing delays), and give up throwing an error.
> Tracking subsets of data is hard (you don't want the entirety of Datomic on the client, do you?)
That's the only thing really missing. And it doesn't seem hard. I think Datomic Peer just keeps a fixed number of index pages in cache. Pages missing in the cache are just retrieved from storage.
In result the cache keeps the working subset - the elements related to the entities needed for queries and entity api requests made by the application. Especially since the indexes are ordered (EAVT, AEVT, AVET, VAET), much of the data in the cache will be relevant to the application.
> - Permissions are hard in this model
Permission is a question, but there are useful applications where permission control is not needed.
Similar to what you say in another comment about conflict resolution: "Another misconception is that conflict resolution needs to be “solved” perfectly before any progress can be made. That is not true as well. You might have unhandled conflicts in your system and still have a working, useful, successful product."
Back in the day when DataScript first appeared and I was eager to see it working with larger-than-memory datasets (and maybe even reading data saved by Datomic by understanding its format), I wanted that to enable public to run queries (read-only) on a large database I was assembling, that didn't fit into memory.
In some applications all users may have equal write access to the document / data.
Server-side usage of DataScript could be another case that does not require permissions support in the DB. That's how Datomic itself is used.
I am not complaining, and understand there are limits on what people can do in free time. You did huge work on your open source projects. But I regretted that seemingly a small step to open DataScript to out-of-memory data, which I though would greatly expand its applicability, was missing.
Good luck with instantdb. Hopefully commercial success will allow to continue putting work in it and improve the tech landscape.
beders 2 days ago [-]
I found it quite disappointing to find a marketing piece from Nikki.
It is full of general statements that are only true for a subset of solutions.
Enterprise solutions in particular are vastly more complex and can't be magically made simple by a syncing database.
(no solution comes even close to "99% business code". Not unless you re-define what business code is)
It is astounding how many senior software engineers or architects don't understand that their stack contains multiple data models and even in a greenfield project you'll end up with 3 or more.
Reducing this to one is possible for simple cases - it won't scale up.
(Rama's attempt is interesting and I hope it proves me wrong)
From: "yeah, now you don't need to think about the network too much" to "humbug, who even needs SQL"
I've seen much bigger projects fail because they fell for one or both of these ideas.
While I appreciate some magic on the front-end/back-end gap, being explicit (calling endpoints, receiving server-side-events) is much easier to reason about.
If we have calls failing, we know exactly where and why.
Sprinkle enough magic over this gap and you'll end up in debugging hell.
Make this a laser focused library and I might still be interested because it might remove actual boilerplate.
Turn it into a full-stack and your addressable market will be tiny.
asdffdasy 2 days ago [-]
> Such a library would be called a database.
bold of them to assume a database can manage even the most trivial of conflicts.
There's a reason you bombard all your writes to a "main/master/etc"
hamilyon2 2 days ago [-]
I am feeling a bit confused. Is not the stated problem solved 99.9% with decades-old battle-proven optimistic locking and some careful retries?
arkh 2 days ago [-]
The future of webapps: wasm in the browser, direct SQL for the API.
Main problem? No result caching but that's "just" a middleware to implement.
TeMPOraL 2 days ago [-]
Also the past of webapps. We don't have that because doing this properly, in a way that's maximally useful and ergonomic for the users, pretty much kills the entire business of the web. If you give direct SQL access to the underlying data, you can no longer seek rent by putting a bloated, barely-functional app in front of the database, nor can you use it to funnel users or upsell them stuff. Most of the money in this industry is made from rent-seeking.
mike_hearn 2 days ago [-]
I recently took a part time role at Oracle Labs and have been learning PL/SQL as part of a project. Seeing as Niki is shilling for his employer, perhaps it's OK for me to do the same here :) [1]. HN discourse could use a bit of a shakeup when it comes to databases anyway. This may be of only casual interest to most readers, but some HN readers work at places with Oracle licenses and others might be surprised to discover it can be cheaper than an AWS managed Postgres [2].
It has a couple of features relevant to this blog post.
The first: Niki points out that in standard SQL producing JSON documents from relational tables is awkward and the syntax is terrible. This is true, so there's a better syntax:
CREATE JSON RELATIONAL DUALITY VIEW dept_w_employees_dv AS
SELECT JSON {'_id' : d.deptno,
'departmentName' : d.dname,
'location' : d.loc,
'employees' :
[ SELECT JSON {'employeeNumber' :e.empno,
'name' : e.ename}
FROM employee e
WHERE e.deptno = d.deptno ]
}
FROM department d WITH UPDATE INSERT DELETE;
It makes compound JSON documents from data stored relationally. This has three advantages: (1) JSON documents get materialized on demand by the database instead of requiring frontend code to do it, (2) the ORDS proxy server can serve these over HTTP via generic authenticated endpoints (e.g. using OAuth or cookie based auth) so you may not need to write any code beyond SQL to get data to the browser, and (3) the JSON documents produced can be written to, not only read.
The second feature is query change notifications. You can issue a command on a connection that starts recording the queries issued on it and then get a callback or a message posted to an MQ when the results change (without polling). The message contains some info about what changed. So by wiring this up to a web socket, which is quite easy, the work of an hour or two in most web frameworks, then you can stream changes to the client directly from the database without needing much logic or third party integrations. You either use the notification to trigger a full requery and send the entire result json back to the browser, or you can get fancier and transform the deltas to json subsets.
It'd be neat if there was a way to join these two features together out of the box, but AFAIK if you want full streaming of document deltas to the browser and reconstituting them there, it would need a bit more on top.
Again, you may feel this is irrelevant because doesn't every self-respecting HN reader use Postgres for everything, but it's worth knowing what's out there. Especially as the moment you decide to paying a cloud for hosting your DB you have crossed the Rubicon anyway (all the hosted DBs are proprietary forks of Postgres), so you might as well price out alternatives.
[1] and you know the drill, views are my own and nobody has reviewed this post.
Lots of good people work at Oracle, and I am sure you're one of them.
HOWEVER. There is no world where lifetime costs of using Postgres for any successful company anywhere in the world are greater than using Postgres. I understand that's a key message for your sales team to get out, but only one of the CEOs at Oracle and Percona has flown a fighter jet underneath the Golden Gate Bridge.
Oracle licensing is famously, famously sticky. Extremely. Incredibly. It's how the company was built and is maintained.
mike_hearn 2 days ago [-]
Great, let's debate!
I've never talked to database sales people and have no idea what messages they have or care about. Actually I'm 99% sure they don't care about the HN/startup crowd at all - did you see anyone except me talk about this stuff here? Me neither. I'm making this argument basically because I like making arguments early that are surprising but correct, and databases feel like fertile ground for such arguments. There's a lot of groupthink in this space. And you know all about my history with surprising technology arguments, Peter ;)
Anyway I'd be interested to see a spreadsheet with a worked set of scenarios for both cost and "stickiness" however it's defined (genuinely). I think it's going to depend heavily on:
a) Whether you cloud host or not. The cost of a small Postgres that you run yourself is pretty much whatever your own time is valued at, as self-hosted hardware is cheap. The costs of a Postgres you outsource can be really un-intuitively high. I already showed that a cloud hosted elastic Oracle DB can be cheaper for the same-spec AWS-managed Postgres despite a massive feature disparity on one side. Costs here aren't dominated by hardware nor software purchase costs.
b) What features and scaling level you need, combined with cost of labour in your area. If you want to scale up a Postgres based operation very fast then that's going to take a ton of skilled engineering effort, devs will be slowed down a lot as they spend time on implementing custom sharding schemes etc. At some point the cost of rolling your own ad-hoc solutions to these things will cross with the cost of just buying a system that already solves them all out of the box. Where that cross-point is will depend on all kinds of things like opportunity cost, cost of hiring, cost of developer productivity....
b) Whether you consider unique features to be "stickiness". You're claiming the licensing is sticky here but companies negotiate all kinds of licenses so what does that mean? By default it's charged per core like any other commercial db (or in the cloud by core seconds/storage). If unique features are the problem then that's an aspect of choosing any tech platform. If you're taking advantage of full SQL joins on a 50-node horizontally scaled multi-master cluster then yeah, trying to migrate to something else is going to be sticky because there aren't many other products that offer that. That's tech for you. Still, these days I guess it must be less sticky because there are other people selling very scalable SQL-speaking databases like Spanner.
As for Larry Ellison's stunts, that's great but if you're deciding what platform to use on the basis of executive horsepower then you can pick between fighter jets, Jeff Bezo's rockets, Bill Gates' yachts or Larry Page's flying cars. Selling databases seems to go hand in hand with high tech vehicles, which is probably a sign there's some actual value being delivered there, somewhere.
vessenes 2 days ago [-]
:)
I referenced Larry as a proxy for his extreme wealth. Although it is true he’s one of the great businessmen of the late 20th century. Just not the sort you want to be in a business deal with in general.
Oracle has always been good at both adding helpful functions that developers rely on, making switching difficult, and also at teasing companies into using more licenses than they’ve purchased, then smacking them with audits and fees as a stick, and a ‘cheaper’ larger license as a carrot to avoid the audit fees.
In the 90s, this was tech like PL/SQL and Materialized views - I’m long out of the Oracle game, so I have no idea where they compete on features now vis-a-vis open source — but I will say that I have owned companies where the Oracle license was both HATED — and outlived all original owners of the company. It’s hard to replace once it’s in your workflow, and that is 100% by design.
mike_hearn 2 days ago [-]
I guess audits are fading away as more people move to the cloud. Audits are used by other enterprise tech sellers as well because you don't want DRM or telemetry in something like a mission critical HA DB that runs behind a firewall. So audits it is. Cloud solves all that (admittedly, whilst trading off against data privacy).
vessenes 2 days ago [-]
Makes sense. And metering is a leveling playing field in terms of cost assessments (if you discount tail costs to $0 that is)
ativzzz 2 days ago [-]
I've always wondered, how do applications with more stringent security requirements handle this?
Assume that permissions to any row in the DB can be removed at any time. If we store the data offline, this security measure is already violated. If you don't care about a user potentially storing data they no longer have access to, when they come online, any operations they make are invalid and that's fine
But, if security access is part of your business logic, and is complex enough to the point where it lives in your app and not in your DB (other than using DB tools like RLS), how do you verify that the user still has access to all cached data? Wouldn't you need to re-query every row every time?
I'm still uncertain how these sync engines can be secured properly
sreekanth850 2 days ago [-]
We use indexedDB and signalr for real time sync. What is new about this?
wslh 2 days ago [-]
Sync, in general, is a very complex topic. There are past examples, such as just trying to sync contacts across different platforms where no definitive solution emerged. One fundamental challenge is that you can’t assume all endpoints behave fairly or consistently, so error propagation becomes a core issue to address.
Returning to the contacts example, Google Contacts attempts to mitigate error propagation by introducing a review stage, where users can decide how to handle duplicates (e.g., merge contacts that contain different information).
In the broader context of sync, this highlights the need for policies to handle situations where syncing is simply not possible beyond all the smart logic we may implement.
hyperbolablabla 2 days ago [-]
How does this compare to supabase?
jFriedensreich 2 days ago [-]
that question comes up all the time for some reason but supabase does not support offline or sync, only some form of subscription updating, but this has nothing to do with having sync or local data.
VikingCoder 2 days ago [-]
There are two hard problems:
1. Naming things
2. Caching
3. Off-by-one errors
jiggawatts 2 days ago [-]
> Such a library would be called a database. But we’re used to thinking of a database as something server-related, a big box that runs in a data center. It doesn’t have to be like that! Databases have two parts: a place where data is stored and a place where data is delivered. That second part is usually missing.
Yes! A thousand times this!
Databases can't just "live on a server somewhere", their code should extend into the clients. The client isn't just a network protocol parser / serialiser, it should implement what is essentially an untrusted, read-only replica. For writes, it should implement what is essentially a local write-ahead log (WAL) either in-memory and optionally fsync-d to local storage. All of this should use the same codebase as the database engine, or machine-generated in multiple languages from some sort of formal specification.
erichocean 2 days ago [-]
I designed the sync engine for Things Cloud [0] over a decade ago. It seems to have worked out pretty well for them. (The linked page has some details about what it can do.)
When sync Just Works™, it's a magical thing.
One of the reason's my design has been reliable from its very first release, even across multiple refactors/rewrites (I believe it's currently on its third, this time to Swift) is that it uses a Git-like model internally with pervasive hashing. It's almost impossible for sync to work incorrectly (if it works at all).
I started using Things about 4-5 years ago and it's been amazing, partly because of the first-class support for syncing between devices and their cloud. Thanks for making this great!
I would be interested to read any articles you've written about Things's sync pattern, if any.
erichocean 1 days ago [-]
Hashing + custom server merge is the main logical device, the rest is just encoding tricks to make everything fast.
Push becomes like git push to a tmp branch, then you do a server-side merge, then a pull on remaining clients. Push/pull is fast just like git, and due to content hashing, there are never any problems overwriting another device's work on the server.
I think I had some clever tricks with text syncing (IIRC I implemented a Merkle tree-inspired approach), but that's the general concept.
(I don't remember anything else, it was so long ago. :-)
ltbarcly3 2 days ago [-]
This has been solved every 5 years or so, and along the way people learn why this solution doesn't actually work.
quantadev 2 days ago [-]
IPFS is a technology very helpful for syncing. One way it's being used in a modern context (although only sub-parts of IPFS stack) is how BlueSky engineers, during their design process a few years ago, accepted my proposal that for a new Social Media protocol, each user should have his own "Repository" (Basically a Merkel Tree) of everything he's ever posted. Then there's just a "Sync" up to some master service provider node (decentralized set of nodes/servers) for the rest of the world to consume.
Merkel-Tree based synching is as performant as you can possibly get (used by Git protocol too I believe) because you can tell of a root of a tree-structure is identical to some other remote tree structure just by comparing the Hash Strings. And this can be recursively applied down any "changed branches" of a tree to implement very fast syncing mechanisms.
I think we need a NEW INTERNET (i.e. Web3, and dare I say Semantic Web built in) where everyone's basically got their own personal "Tree of Stuff" they can publish to the world, all naively built into some new kind of tree structure-based killer app. Like imagine having Jupyter Notebooks in Tree form, where everything on it (that you want to be) is published to the web.
didn't know that about roam research. I was a user, but also that app convinced me that front-end went in the wrong direction for a decade...
Rocicorp Zero Sync, instantdb, linear app like trend is great -- sync will be big. I hope a lot of the spa slop gets fixed!
delusional 2 days ago [-]
> I’ve yet to see a code base that has maintained a separate in-memory index for data they are querying
Define "separate" but my old X11 compositor project neocomp I did something like that with a series of AOS arrays and bitfields that combined to make a sort of entity manager. Each index in the arrays was an entity, and each array held a data associated with a "type" of entity. An entity could hold multiple types that would combine to specify behavior. The bitfield existed to make it quick to query.
It waaay too complicated for what it was, but it was fun to code and worked well enough. I called it a "swiss" (because it was full of holes). It's still online on github (https://github.com/DelusionalLogic/NeoComp/blob/master/src/s...) even though I don't use it much anymore.
theamk 2 days ago [-]
TL/DR:
> If your database is smart enough and capable enough, why would you even need a server? Hosted database saves you from the horrors of hosting and lets your data flow freely to the frontend.
(this is a blog of one such hosted database provider)
Sytten 2 days ago [-]
That quote is why security people will always be employed.
Jokes aside firebase access control is a nightmare and all those database as an APi thing have the same problem.
SuperNinKenDo 2 days ago [-]
Apropos of the other reply to you about security. Maybe some security people could let me know their thoughts on this.
It seems like generally, the best way to expose your database to the internet is considered to be not doing so in the first place, i.e., have your webserver query and cache a hosted database that isn't directly exposed.
Is my understanding correct? It seems that almost all data breaches we hear about are directly exposed databases or their cloud equivalents.
Is doing this in the era of "cloud" being made impossible?
TeMPOraL 2 days ago [-]
That's in some sense a "Swiss cheese security model". It's not that databases should, in principle, never be directly exposed. It's that they rarely are designed for it security-wise[0]; meanwhile, adding whatever complex assembly of containers and applications written in random languages and frameworks, to sit between users and the database, introduces a swamp of better-secured systems that attackers also needs to get through. The more cruft you pile on, the more annoying it gets for attackers and users alike.
In fact, there are many benefits of directly exposed databases - many of which would remove the need for applications normally sitting on top of those databases, which are strictly inferior and less ergonomic and overall more shitty than a generic database browsing interface. But that's another reason for why things are the way they are: people wanna make money, and having your application be a toll booth between useful data you own and the rest of the world, is tried and true way of making money.
--
[0] - Because they're not normally exposed, because they're not designed for it, because... it's a self-reinforcing loop.
worthless-trash 2 days ago [-]
I don't think most largescale breaches are directly exposed databases, they are just the ones that summon the largest face palms.
dvrp 2 days ago [-]
What do you propose as solution for companies to be able to embrace more "liberating" philosophies such as anti-lock-in measures or copyleft-friendly measures?
It seems that solving that is a cultural/economic problem, not a technical one, and that's a shame.
DeathArrow 2 days ago [-]
I've solved data sync in distributed apps long time ago. I send outgoing data to /dev/null and receive incoming data from /dev/zero. This way data is always consistent. That also helps with availability and partion tolerance.
irisman 18 hours ago [-]
[flagged]
Rendered at 14:49:10 GMT+0000 (Coordinated Universal Time) with Vercel.
And 'synchronisation' as a practice gets very little attention or discussion. People just start with naive approaches like 'download whats marked as changed' and then get stuck in the quagmire of known problems and known edge cases (handling deletions, handling transport errors, handling changes that didn't get marked with a timestamp, how to repair after a bad sync, dealing with conflicting updates etc).
The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types https://crdt.tech which is essentially restricting your data and the rules for dealing with conflicts to situations that are known to be resolvable and then packaging it all up into an object.
I will go against the grain and say CRDTs have been a distraction and the overfocus on them have been delaying real progress. They are immature and highly complex and thus hard to debug and understand, and have extremely limited cross-language support in practice - let alone any indexing or storage engine support.
Yes, they are fascinating and yes they solve real problems but they are absolute overkill to your problems (except collab editing), at least currently. Why? Because they are all about conflict resolution. You can get very far without addressing this problem: for instance a cache, like you mentioned, has no need for conflict resolution. The main data store owns the data, and the cache follows. If you can have single ownership, (single writer) or last write wins, or similar, you can drop a massive pile of complexity on the floor and not worry about it. (In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs – I would bet they have an extremely bespoke and domain-specific data structures that are inspired by CRDTs.)
Instead, what I believe we need is end-to-end bidirectional stream based data communication, simple patch/replace data structures to efficiently notify of updates, and standard algorithms and protocols for processing it all. Basically adding async reactivity on the read path of existing data engines like SQL databases. I believe even this is a massive undertaking, but feasible, and delivers lasting tangible value.
It is still tempting to turn to CRDTs to solve the next problem: how to apply server-side changes to a client when the client has its own pending local operations. But this can be solved in a fully general way using server reconciliation, which doesn't restrict your operations or data structures like a CRDT does. I wrote about it here: https://mattweidner.com/2024/06/04/server-architectures.html...
> how to apply server-side changes to a client when the client has its own pending local operations
I liked the option of restore and replay on top of the updated server state. I’m wondering when this causes perf issues? First local changes should propagate fast after eg a network partition, even if the person has queued up a lot of them (say during a flight).
Anyway, my thinking is that you can avoid many consensus problems by just partitioning data ownership. The like example is interesting in this way. A like count is an aggregate based on multiple data owners, and everyone else just passively follows with read replication. So thinking in terms of shared write access is the wrong problem description, imo, when in reality ”liked posts” is data exclusively owned by all the different nodes doing the liking (subject to a limit of one like per post). A server aggregate could exist but is owned by the server, so no shared write access is needed.
Similarly, say you have a messaging service. Each participant owns their own messages and others follow. No conflicts are needed. However, you can still break the protocol (say liking twice). Those can be considered malformed and eg ignored. In some cases, you can copy someone else’s data and make it your own: for instance to protect against impersonations: say that you can change your own nickname, and others follow. This can be exploited to impersonate but you can keep a local copy of the last seen nickname and then display a ”changed name” warning.
Anyway, I’m just a layman who wants things to be simple. It feels like CRDTs have been the ultimate nerd-snipe, and when I did my own evaluations I was disappointed with how heavyweight and opaque they were a few years ago (and probably still).
I agree with this. CRDTs are cool tech but I think in practice most folks would be surprised by the high percentage of use cases that can be solved with much simpler conflict resolution mechanism (and perhaps combined with server reconciliation as Matt mentioned). I also agree that collaborative document editing is a niche where CRDTs are indeed very useful.
[0] https://news.ycombinator.com/item?id=33865672
[1] https://news.ycombinator.com/item?id=24617542
Or CRDTs at all. Google Docs is based on operational transforms and Figma on what they call multiplayer technology.
I suspect the generalized solution is much harder to achieve, and looks more like batch-based reconciliation of full snapshots than streaming or event-driven.
The challenge is if you aim to sync data sources where the parties managing each data source are not incentivized to provide robust sync. Consider Dropbox or similar, where a single party manages the data set, and all software (server and clients), or ecosystems like Salesforce and Mulesoft which have this as a stated business goal, or ecosystems like blockchains where independent parties are still highly incentivized to coordinate and have technically robust mechanisms to accomplish it like Merkle trees and similar. You can achieve sync in those scenarios because independent parties are incentivized to coordinate (or there is only one party).
But if you have two or more independent systems, all of which provide some kind of API or import/export mechanisms, you can never guarantee those systems will stay in sync using a streaming or event-driven approach. And worse, those systems will inevitably drift out of sync, or even more worse, will propagate incorrect data across multiple systems, which can then only be reconciled by batch-like point-in-time snapshots, which then begs the question of why use streaming if you ultimately need batch to make it work reliably.
Put another way, people say batch is a special case of streaming, so just use streaming. But you could also say streaming is a fragile form of sync, so just use sync. But sync is a special case of batch, so just use batch.
Also, http caching is sort of a special case of sync - where the cache (say, nginx) is trying to keep a synchronised copy of a resource from the backend web server. But because there’s no way for the web server to notify nginx that the resource has changed, you get both stale reads and unnecessary polling. Doing fan-out would be way more efficient than a keep alive header if we had a way to do it!
CRDTs are cool tech. (I would know - I’ve been playing with them for years). But I think it’s worth dividing data interfaces into two types: owned data and shared data. Owned data has a single owner (eg the database, the kernel, the web server) and other devices live down stream of that owner. Shared data sources have more complex systems - eg everyone in the network has a copy of the data and can make changes, then it’s all eventually consistent. Or raft / paxos. Think git, or a distributed database. And they can be combined - eg, the app server is downstream of a distributed database. GitHub actions is downstream of a git repo.
I’ve been meaning to write a blog post about this for years. Once you realise how ubiquitous this problem is, you see it absolutely everywhere.
In most cases, the easiest approach there is just "slap a blockchain on it", as a good and modern (think Ethereum, not Bitcoin) blockchain essentially "abstracts away" the decentralization and mostly acts like a centralized computer to higher layers.
That is certainly not the only viable approach, and I wish we looked at others more. For example, a decentralized DNS-like system, without an attached cryptocurrency, but with global consensus on what a given name points to, would be extremely useful. I'm not convinced that such a thing is possible, you need some way of preventing one bad actor from grabbing all the names, and monetary compensation seems like the easiest one, but we should be looking in this direction a lot more.
In my mind, this is just the second category again. It’s just a shared data system, except with data validation & Byzantine fault tolerance requirements.
It’s a surprisingly common and thorny problem. For example, I could change my local git client to generate invalid / wrong hashes for my commits. When I push my changes, other peers should - in some way - reject them. PVH (of Ink&Switch) has a rule when thinking about systems like this. He says you’re free to deface your own copy of the US constitution. But I don’t have to pull your changes.
Access control makes the BFT problem much worse. The classic problem is that if two admins concurrently remove each other, it’s not clear what happens. In a crdt (or git), peers are free to backdate their changes to any arbitrary point in the past. If you try and implement user roles on top of a crdt, it’s a nightmare. I think CRDTs are just the wrong tool for thinking about access control.
One thing I think that is missing in the discussion about shared data (and maybe you can correct me) is that there are two ways of looking at the problem: * The "math/engineering" way, where once state is identical you are done! * The "product manager" way where you have reasonable-sounding requests like "I was typing in the middle of a paragraph, then someone deleted that paragraph, and my text was gone! It should be its own new paragraph in the same place."
Literally having identical state (or even identical state that adheres to a schema) is hard enough, but I'm not aware of techniques to ensure 1) identical state 2) adhering to a schema 3) that anyone on the team can easily modify in response to "PM-like" demands without being a sync expert.
I've spent 16 years working on a sync engine and have worked with hundreds of enterprises on sync use cases during this time. I've seen countless cases of developers underestimating the complexity of sync. In most cases it happens exactly as you said: start with a naive approach and then the fractal complexity spiral starts. Even if the team is able to do the initial implementation, maintaining it usually turns into a burden that they eventually find too big to bear.
That said, there’s work that has been done towards fixing some of those issues.
Evan Wallace (I think he’s the CTO of Figma) has written about a few solutions he tried for Figma’s collaborative features. And then Martin Kleppmann has a paper proposing a solution:
https://martin.kleppmann.com/papers/move-op.pdf
[0] https://www.youtube.com/watch?v=NMq0vncHJvU&t=1016s
I've been working on sync for the latter use case for a while and CRDTs would definitely be overkill.
When clients disagree about the the order of events and a conflict results then clients can be required to roll back (apply the inverse of each change) to the last point in time where all clients were in agreement about the world state. Then, all clients re-apply all changes in the new now-agreed-upon order. Now all changes have been applied and there is agreement about the world state and the process starts anew.
This way multiple clients can work offline for extended periods of time and then reconcile with other clients.
[0] https://loro.dev/docs/advanced/event_graph_walker
[1] https://www.youtube.com/watch?v=rjbEG7COj7o
I'd love to hear about any success cases people have had with CRDTs.
The behavior the article found is peculiar to the particular CRDT algorithms they looked at. But they’re probably right that it’s impossible for all conflicting edits to “just work” (in general, not just with CRDTs). That doesn’t mean CRDTs are pointless; you could imagine an algorithm that attempts to detect such semantic conflicts so the application can present some sort of resolution UI.
Here’s the article, if interested (it’s very good): https://www.moment.dev/blog/lies-i-was-told-pt-1
I can't blame people for thinking otherwise, pretty much every self-called "CRDT library" I've come across implements exactly one such data structure, maybe parameterized.
It's like writing a "semiring library" and it's simply (min, +).
I've been thinking for a bit that it is probably about time the industry renamed that first C to something other than "conflict-free". There is no freedom from conflicts. There's conflict resistance, sure and CRDTs can provide in their various data structures a lot of conflict resistance. But at the end of the day if the data structure is meant to encode an application for humans, it needs every merge tool and review tool and audit tool it can offer to deal with those.
I think we're finally starting to see some of the light in the tunnel in the major CRDT efforts and we're finally leaving the detour of "no it must be conflict-free, we named it that so it must be true". I don't think any one library is yet delivering it at a good high level, but I have that feeling that "one of the next libraries" is maybe going to start getting the ergonomics of conflict handling right.
Throwing small language models into the mix could make merging less painful too — like having the system take its best guess at what you meant, apply it, and flag it for later review.
The issue happens when a file is renamed by one client, and then all other clients pick up the rename and make the change to the local files on disk. Since every edit is broken down into delete/keep/insert runs, the automated process runs rapidly in all clients and can break the links.
I could limit the edits to just one client, but it feels clunky. Another thought I've had is to use ytext annotations, or just also store a ymap of the link metadata and only apply updates if they can meet some kind of check (kind of like schema validation for objects).
If anyone has a good mental model for modeling automated operations (especially find/replace) in ytext please let me know! (email in bio).
[0] https://system3.md/relay
One early insight was that we needed a representation of partner data in our database (and the downstream systems need a representation of our opinionated view as well). This is clearly an (eventually consistent) synchronization problem.
We also realized that we often either fail to sync (due to bugs, timing, or whatever) and need a regular process to resync data.
We've ended up with a homegrown framework that does both things, such that the same business logic gets used in both cases. This also makes it easy to backfill data if a chosen representation changes)
We're now on the third or fourth iteration of this system and I'm pretty happy with it.
I've been in that situation a lot, and I'd always carefully consider if you even need the online synchronization at that point. It's pretty rarely required.
Does naming things and off-by-one errors also count?
Multiplayer games too.
Abstracting all of this complexity away in one general tool/library and pretending that it will always work is snake oil. There are no shortcuts to building truly high quality product at a large scale.
I definitely believe sync engines are the future as they make it so much easier to enable things like no-spinners browsing your data, optimistic rendering, offline use, real-time collaboration and so on.
I'm also not entirely convinced yet though that it's possible to get away with something that's not custom-built, or at least large parts of it. There were so many micro decisions and trade-offs going into the engine: what is the granularity of updates (characters, rows?) that we need and how does that affect the performance. Do we need a central server for things like permissions and real-time collaboration? If so do we want just deltas or also state snapshots for speedup. How much versioning do we need, what are implications of that? Is there end-to-end-encryption, how does that affect what the server can do. What kind of data structure is being synced, a simple list/map, or a graph with potential cycles? What kind of conflict resolution business logic do we need, where does that live?
It would be cool to have something general purpose so you don’t need to build any of this, but I wonder how much time it will save in practice. Maybe the answer really is to have all kinds of different sync engines to pick from and then you can decide whether it's worth the trade-off not having everything custom-built.
[1] https://thymer.com
btw: excellent questions to ask / insights - about the same I also came across in my lo-fi ventures.
Would be great if someone could assemble all these questions in a "walkthrough" step-by-step interface and in the end, the user gets a list of the best matching engines.
Edit: Mh ... maybe something small enough to vibe code ... if someone is interested to help let me know!
1) in a decentralized system who is responsible for backups? What happens when you restore from a backup?
2) in a decentralized system who sends push notifications and syncs with mobile devices?
I think that in an age of $5/mo cloud vms and free SSL having a single coordination server has all the advantages and none of the downsides.
- Sync engines might only solve small and medium scale, that would be a huge win even without large scale
Remember Meteor?
> It’s also ill-advised to try to solve data sync while also working on a product. These problems require patience, thoroughness, and extensive testing. They can’t be rushed. And you already have a problem on your hands you don’t know how to solve: your product. Try solving both, fail at both.
Also, you might not have that "large scale" yet.
(I get that you could also make the opposite case, that the individual requirements for your product are so special that you cannot factor out any common behavior. I'd see that as a hypothesis to be tested.)
The first rule of network transparency is: the network is not transparent.
> Or: I’ve yet to see a code base that has maintained a separate in-memory index for data they are querying
Is boost::multi_index_container no longer a thing?
Also there's SQLite with the :memory: database.
And this ancient 4gl we use at work has in-memory tables (as in database tables, with typed columns and any number of unique or not indexes) as a basic language feature.
Helps a lot with high read situations and takes considerable load off the database with probably 1 hour of coding effort if you know what you're doing.
Depends on the shop. I haven't seen one in production so far, but I don't doubt some people use it.
> Also there's SQLite with the :memory: database.
Ah, now that's cheating. I know, because I did that too. I did that because of the realization that half the members I'm stuffing into classes to store my game state are effectively poor man's hand-rolled tables, indices and spatial indices, so why not just use a proper database for this?.
> And this ancient 4gl we use at work has in-memory tables (as in database tables, with typed columns and any number of unique or not indexes) as a basic language feature.
Which one is this? I've argued in the past that this is a basic feature missing from 4GL languages, and a lot of work in every project is wasted on hand-rolling in-memory databases left and right, without realizing it. It would seem I've missed a language that recognized this fact?
(But then, so did most of the industry.)
https://en.wikipedia.org/wiki/OpenEdge_Advanced_Business_Lan...
Dates back to 1981, called "Progress 4GL" until 2006.
https://docs.progress.com/bundle/abl-reference/page/DEFINE-T...
- Zero Sync: https://github.com/rocicorp/mono
- Triplit: https://github.com/aspen-cloud/triplit
Doesn't even have a readme :D Raise the bar a bit maybe.
Hard to raise the bar on Zero. It’s a brilliant system.
Would you recommend it for side projects?
But you don't have to. GitHub shows the readme just below the partial file list. That's what all the same-page docs on GitHub/GitLab repositories are.
Full docs are linked from the readme.
- https://github.com/electric-sql/electric
- https://github.com/powersync-ja
- https://github.com/get-convex
- https://github.com/tinyplex/tinybase
- https://github.com/garden-co/jazz
edit: Also their decision to make it just one way sync makes a LOT of sense. Write access brings a lot of scary cases, so by making it only read sync eases some of my anxieties. I can still use Rest / RPC for updating the data
If you want a fully decentralized system, check out jazz. It is the best of these currently IMO.
https://github.com/glycerine/jcp
If that was true, we would ultimately end up with a single layer. Instead I would say that major shifts happen when we move the boundaries between layers.
The author here proposes to replace servers by synced client-side data stores.
That is certainly a good idea for some applications, but it also comes with drawbacks. For example, it would be easier to avoid stale data, but it would be harder to enforce permissions.
There was still a server, its just not YOUR server. In this case, there will still be servers, just maybe not something that you need to manage state on.
This misnaming creates endless conflict when trying to communicate this with hyper excited management who want to get on the latest trend.
Cant wait to be on the meeting and hearing: "We dont need servers when we migrate to client side data stores".
Over time, the meaning of the word 'Xerox' changed. More specifically, it gained a new meaning. For a long time, Xerox only referred to a company named in 1961. Some time in the late 60s, it started to be used as a verb, and as I was growing up in the 70s and 80s, the word 'Xerox' was overwhelmingly used in its verb form.
Our society decided as a whole that it was ok for the noun Xerox to be used a verb. That's a normal and natural part of language development.
As others have noted, management doesn't care whether the serverless thing you want to use is running on servers or not. They care that they don't have to maintain servers themselves. CapEx vs OpEx and all that.
I agree that there could be some small hazard with the idea that, if I run my important thing in a 'serverless' fashion, then I don't have to associate all of the problems/challenges/concerns I have with 'servers' to my important thing.
It's an abstraction, and all abstractions are leaky.
If we're lucky, this abstraction will, on average, leak very little.
https://www.youtube.com/watch?v=PZbqAMEwtOE#t=5m58s I don't think this dramatization (of a court proceedings from 2010) is related to Xerox's plight with losing their trademark, but said dramatization is brilliant nonetheless
There's no such thing as reliable network in the world. The world is network connected, there's almost no local-only systems anymore (for a long long time now).
Some engineers dream that there's some cases when network is reliable, like when a system fully lives in the same region and single AZ. But even then it's actually not reliable and can have some glitches quite frequently (like once per month or so, depending on some luck).
I’m not saying there is
A white paper showing how Instant solves synchronization problems might be nice.
Mostly because I consider the state of the art on this to be Clojure Electric and he presumably is aware of it at least to some degree but does not mention it
If you are referring to virtual scroll over large collections - yes, we use the persistent connection to stream the window of visible records from the server in realtime as the user scrolls, affording approximately realtime virtual scroll over arbitrarily large views (we target collections of size 500-50,000 records and test at 100ms artificial RT latency, my actual prod latency to the Fly edge network is 6ms RT ping), and the Electric client retains in memory precisely the state needed to materialize the current DOM state, no more no less. Which means the client process performance is decoupled from the size of the dataset - which is NOT the case for sync engines, which put high memory and compute pressure on the end user device for enterprise scale datasets. It also inherits the traditional backend-for-frontend security model, which all enterprise apps require, including consumer apps like Notion that make the bulk of their revenue from enterprise citizen devs and therefore are exposed to enterprise data security compliance. And this is in an AI-focused world where companies want to defend against AI scrapers so they can sell their data assets to foundation model providers for use in training!
Which IMO is the real problem with sync engines: they are not a good match for enterprise applications, nor are they a good match for hyper scale consumer saas that aspire to sell into enterprise. So what market are they for exactly?
[1] https://tonsky.me/blog/crdt-filesync/
How does this happen without an interface for conflict resolution? That's the hard part.
(make sure to also read the footnote [28] there).
Imagine if git just on its own picked what to keep and what to throw away when there's a conflict. You fundamentally need the user to make the choice.
It is described here:
https://rocicorp.dev/blog/ready-player-two
It works really well and we and our customers have found it to be quite general.
It allows you to run an arbitrary transaction on the sever side to decide what to do in case of conflicts. It is the software equivalent of git asking the user what to do. Zero asks your code what to do.
But it asks it in the form of the question "please run the function named x with these inputs on the current backend db state". Which is a much more ergonomic way to ask it than "please do a 3-way merge between these three states".
Conflict resolution is not the reason why there has not been a general-purpose sync engine. None of our customers have ~ever complained about conflict resolution.
The reason there has not been a general-purpose sync engine is actually on the read side:
These problems are being solved in next generation of sync engines.For more on this, I talk about it some here:
https://www.youtube.com/watch?v=rqOUgqsWvbw
You are asking the dev what to do. You are _not_ asking the user what to do. This is akin of the git devs baking in a choice into git on what to keep in a merge conflict.
It's hard to trust you guys when you misrepresent like this. I thought long and hard on whether to respond confrontationally like this, but decided you really need to hear the push back on this.
I represented that we ask the dev what to do:
> Zero asks your code what to do
You agree that's what we do:
> You are asking the dev what to do. You are _not_ asking the user what to do.
I get that your actual issue is you don't think that what we do is "the software equivalent of git asking the user what to do". But like, I also said what we do concretely in the same paragraph. It's not like I was trying to hide something. This is a metaphor for how to understand our approach to conflict resolution that works for most developers. Like all metaphors it is not perfect.
FWIW, there is nothing stopping a developer from having this function just save off a forked copy and ask the user what to do. Some developers do this.
Also FWIW, Zero does not allow offline writes specifically because we want to educate people how to properly handle conflicts before we do. I see down-thread this is the majority of your concern.
Without offline support AND you're doing real time updating of data, then conflict resolution is not a real world practical concern. Users will be looking at the same data at the same time anyways, so they generally see what data won out in case of a conflict, as they are looking at real time data as they are editing.
IF you had offline support, and for other sync engines that do: There is a real and meaningful difference between a backend dev and an end user of the application choosing what to do in case of a conflict. A backend dev cannot make a general case algorithm that knows that two end users want to keep or throw away in a conflict, because this is completely situational - users could be doing whatever. And if you push the conflict resolution to the end users, then you are asking a lot of those users. They need to be technically inclined and motivated people in order to take the time to understand and resolve the conflict. Like with git users.
I disagree with this. There are many real-world cases where keywise lww does the wrong thing. The article I linked up-thread covers many of them. Even a simple counter does the wrong thing.
This is where robust conflict resolution really matters in these systems, not the long-time offline case people often ask about.
You need robust conflict resolution to make correct software and maintain invariants in the face of write/write systems.
> A backend dev cannot make a general case algorithm that knows that two end users want to keep or throw away in a conflict, because this is completely situational - users could be doing whatever. And if you push the conflict resolution to the end users, then you are asking a lot of those users. They need to be technically inclined and motivated people in order to take the time to understand and resolve the conflict. Like with git users.
I agree completely. In my opinion the ideal offline-first write/write UI has never been built, but the team at Ink & Switch are closest:
https://www.inkandswitch.com/patchwork/notebook/
I think the perfect UX in many cases is that syncs goes ahead and tries to land the offline writes, but the user has a history UI where they can see what happened. Like how many collaborative apps do today.
But importantly in this UI the app would represent branches and merges. But unlike Git's fine grained branch/merge points, in this UI it would literally represent points where people went offline and made changes.
Users could then go back and recover the version of their data from when they were offline, or compare (probably manually in two tabs) the two different versions of the data and recover.
This does still ask users to compare and resolve conflicts in the worst case, but it is not a blocking operation or one that is final. The more common case is the user will go ahead with the merge and sometimes find some corruption. They can always go back and see what went wrong after the fact and fix. This seems like the right tradeoff to me of making the common case (no conflict) easy and automatic but making the uncommon but scary case at least not dangerous.
There also needs to be clear first-class UX telling users that they're going offline and what will happen when they come online.
I'm looking forward to someday working on this, but it's not what our users ask about most often so we're just disabling offline writes for now.
Linear had to do all sorts of shenanigans to be able to sync all data, for orgs with lots of it – there's a talk on that here:
https://www.youtube.com/watch?v=Wo2m3jaJixU&t=1473s
https://github.com/rocicorp/mono/tree/main/packages/replicac...
The main concern of sync engines is precisely the conflict resolution! Everything else is simple in comparison.
The good news is that under some circumstances it is possible to solve conflicts without user intervention. The simplest example is a counter that can only be incremented. More advanced data structures automatically solving conflicts exists, for example solving conflicts for strings exists, and those are good enough for a text editor.
I agree that there will be conflicts that are resolved in a way that yields non-sensical text, for example if there are 2 edits of the sentence "One cat":
One cat => Two cats
One cat => One dog
The resulting merge may be something like "Two cats dog". Something else (the user, an LLM...) will then have to fix it.
But that's totally OK, because in practice this will happen extremely rarely, only when the user would have been offline for a long time. That user will be happy to have been able to work offline, largely compensating the fact that they have to proof read the text again.
It can be acceptable for some usecases, but not for others where you're still concerned about stuff that happens "extremely rately" and is not under your direct control.
> Something else (the user, an LLM...) will then have to fix it.
This assumes that user/llm knows the conflict was automatically solved and might need to be fixed, so the conflict is still there! You just made the manual part delayed and non-mandatory, but if you want correctness it will still have to be there.
I don't think it would happen "extremely rarely". Drops in connectivity happen a lot, especially on cellular connection and this can absolutely happen a lot for some applications. Especially when talking about "offline first" apps.
Not really true though. I've used a couple of local sync engines, one internally built and another one which is both commercial and now open source called PowerSync[1]. Conflict resolution is definitely on the agenda, and a developer is definitely going to be mindful of conflicts when designing the application.
[1] https://www.powersync.com/
Kind of but only really in the web world, it was the default on desktop for a long time and is pretty common on mobile.
You _will_ have conflicts (because your app is distributed and there are concurrent writes). They will happen on semantic level, so only you (app developer) _will_ be able to solve them. Database (or any other magical tool) can’t do it for you.
Another misconception is that conflict resolution needs to be “solved” perfectly before any progress can be made. That is not true as well. You might have unhandled conflicts in your system and still have a working, useful, successful product. Conflicts might be rare, insignificant, or people (your users) will just correct for/work around them.
I am not saying “drop data on the floor”, of course, if you can help it. But try not to overthink it, either.
I can't speak for whatever application-level problems you were trying to solve, but many problem-cases can be massaged into being conflict-free by adding constraints (or rather: discovering constraints inherent in the business-domain you can use). For example (and the best example, too) is to use an append-only logical model: then the synchronization problem reduces down to merge-sort. Another kind of constraint might be to simply disallow "edit" access to local data when working-offline (without a prior lock or lease being taken) but still allowing "create".
> Database (or any other magical tool) can’t do it for you.
Yes-and-no.
While I'm no fan of CORBA and COM+ (...or SOAP, or WS-OhGodMakeItStop), but being "enterprise-y" it meant they brought distributed-transactions to any application, and that includes RDBMS-mediated distributed transactions (let's agree, an RDBMS is in a far greater position to be a better canonical transaction-server than an application-server running in-front of it). For distributed systems needing transient distributed locks to prevent conflicts in the first place (so only used by interactive users in the same LAN, really) this worked just-as-well as a local-only solution - and make it fault-tolerant too.
...so it is unfortunate that with the (absolutely justified) back-to-basics approach with REST[1] that we lose built-in support for distributed transactions (even some of the more useful and legitimate parts of WebDAV (and so, piggy-backing on our web-servers' built-in support for WebDAV verbs) seem to be going-away) - this all raises the barrier-to-entry for doing distributed-transactions _right_, which means the next set of college-hires won't have been exposed to it, which means it won't be a standard expected feature in the next major internal application they'll write for your org, which means you'll either have a race-condition impacting a multi-billion-dollar business thing that no-one knows how to fix or more likely, just a crappy UX where you have to tell your users not to reload the page too quickly "just in case". Yes, I see advisories like that in the Zendesk pages of the next line-of-business SaaS you'll be voluntold to integrate into your org.
(I think today, the "best" way to handle distributed-locking between interactive-users in a web-app would necessitate using a ServiceWorker using WebRTC, SSE, or a highly-reliable WebSocket - which itself is a load of work right there - and don't forget to do all your JS feature-checks because eventually someone will try to use your app on an old Safari edition because they want to keep on using their vintage Mac) - or anyone using Incognito mode, _gah_.
[1]: https://devblast.com/b/calling-your-web-api-restful-youre-do...
Conflict resolution is never going away. It's important to distinguish between syntactical and semantical conflicts though, the first of which can be solved, but the other will always require manual intervention.
It is like a useful tool, but not the only future.
Or think about any kind of large-ish scale enterprise SaaS. One of the clients I'm working with currently sells a Transportation Management Software system (think logistics, truck loads, etc). There are very small portions of the app that I can imagine relying on a sync engine, but being able to search over hundreds of thousands of truck loads, their contents, drivers, etc seems like it would be infeasible to do via a sync engine.
I mention this because it seems that sync engines get a lot of hype and interest these days, but they apply to a relatively small subset of applications. Which may still be a lot, but it's a bit much to say they're the future (I'm inferring "of application development"--which is what I'm getting from this article).
This is one of the ideas that appears to be central to the genesis of Zero [1]
ElectricSQL allows for a similar pattern and PowerSync is also working on this [2]
[1] https://www.youtube.com/watch?v=rqOUgqsWvbw
[2] https://www.powersync.com/blog/powersync-2025-roadmap-sqlite...
Edit: I watched the presentation (which I really enjoyed) and also read the blog post. For anyone with less time, the answer is essentially: don't sync everything, treat the local data like a cache. Sync as much as you can into that cache, and then reach out to the server for other things.
Should I write this logic in the DB itself ? Seems impractical.
Yes?
If it sounds impractical, it's because the whole industry got used to not learning databases beyond most basic SQL, and doing everything by hand in application code itself. But given how much of code in most applications is just ad-hoc reimplementation of databases, and then how much of the business logic is tied to data and not application-specific things, I can't help but wonder - maybe a better way would be to treat RDBMS as an application framework and have application itself be a thin UI layer on top?
On paper it definitely sounds like grouping concerns better.
I can imagine that working on a good day. I don't dare imagine error handling (though would love to look at examples).
Ultimately, it probably makes no sense to do everything in the database, but I still believe we're doing way too much in the application, and too little in the DB. Some of the logic really belongs to data (and needs to be duplicated for any program using the same data, or else...; probably why people don't like to share databases between programs).
And, at a higher level, I wonder how far we could go if we pushed all data-specific logic into the DB, and the rest (like REST calls) into dedicated components, and used a generic orchestrator to glue the parts together? What of the "application code" would remain then, and where would it sit?
Stored procedures have been a thing. I've seen countless apps that had a thin VB UI and a MSSQL backend where most of the logic is implemented. Or, y'know, Access. Or spreadsheets even!
And before that AS/400&al.
But ORMs came in and the impedance mismatch is then too great. Splitting data wrangling across two completely differing points of views makes it extremely hard to reason about.
What I am saying is, in a perfect world, database and server will be the one and run code _and_ data at the same time. There’s really no good reason why they are separated, and it causes a lot of inconveniences right now.
Something similar but in the opposite direction of lessening DB-responsibilities in favor of logic-layer ones: Driving everything from an event log. (Related to CQRS, Event-Sourcing.)
It means a bit less focus on "how do I ensure this data-situation never ever ever happens" logic, and a bit more "how shall I model escalation and intervention when weird stuff happens anyway."
This isn't as bad as it sounds, because any sufficiently old/large software tends to accrue a bunch of informal tinkering processes anyway. It's what drives the unfortunate popularity of DB rows with a soft-deleted mark (that often require manual tinkering to selectively restore) because somebody always wants a special undo which is never really just one-time-only.
I used to work for Oracle but never liked that approach.
These are valid concerns, but $deity I wish we focused on finding solutions for them, because the current paradigm of edit/compile/run + plaintext single source of truth codebase, is already severely limiting our ability to build and maintain complex software.
Then those multiple implementations inevitably suffer different bugs and drift, leading to really ugly bugs.
This always leads me back to operational transforms. Operations being reified changes function as undo records; a log of changes; and a narrower, semantically-meaningful API, amenable to validation and authz.
For the Roam Firebase example: this only works if you can either trust the client to always perform valid actions, or you can fully validate with Firebase's security rules.
OT has critiques, but almost all of the fall away in my experience when you have a star topology with a central service that mediates everything - defining the canonical order of operations, performs validation & auth, and records the operation log.
Sounds like another kind of synchronization database.
It is based on gRPC and uses a state machine on all instances that transitions through different states for connection setup, "bulk sync", "live sync" and connection wind down.
Bulk sync is the state that is used when an instance comes online and needs to catch up on any missed changes. It is also the self-heal mechanism if something goes wrong.
Unfortunately some embedded instances have super unreliable clocks that drift quite a bit (in both directions). We consider switching to a logical clock.
We have quite a bit of code that deals with conflicts.
I inherited this from my predecessor. Nowadays I would probably not implement something like this again, as it is quite complex.
Syncing blackboards sounds like going straight against the spirit of that design pattern.
Example: you develop a web app to book for flights online.
My browser points to it and I login. Should synchronization start right now? Before I even input my departure point and date?
Ok, no. I write NYC -> BER, and a dep date.
Should I start synching now?
Let's say I do. Is this really more efficient than querying a webservice?
Ok, now all data are synched. Even potentially the ones for business class, even if I just need economy.
You kniw, I could always change my mind later. Or find out that on the day I need to travel no economy seats are available anymore.
Whatever. I have all the inventory data that I need. Raw.
Guess what? As a LH frequent flyer I get special treatment in terms of price. Not just for LH, but most Business Alliance airlines.
This logic is usually on the server, because airlines want maximum creativity and flexibility in handling inventory.
Should we just synch data and make the offer selection algorithm run on the webserver instead?
Let's say it does not matter... I have somehow in front of me all the options for my trip. So I call my wife to confirm she agrees with my choice. I explain her the alternatives... this takes 5 minutes.
In this period, 367 other people are buying/cancelling trips to Europe. So I either see my selection constantly change (yay! Synchronization!!!) or I press confirm, and if my choice is gine I get a warning message and I repeat my query.
Now add two elements: - airlines prefer not to show real numbers of available seats - they will usually send you a single digit from 1 to 9 or a "*" to mean "10 or more".
So just symching raw data and let the combinatorial engine work in the browser is not a very good idea.
Also, I see the pontential to easily mount DDOS attacks if every client is constantly being synchronized by copying high contention tables in RT.
What am I missing here?
I write an audio recording app, and in my app, users have most to gain from their own data. For most people, syncing is basically an afterthought. In this use case, the ability of having your recordings in your phone is the most important thing.
The difference here lies that in my app, the user generates all the valuable data themselves. In your app, nothing valuable can happen without communication with the airline.
But then the post claims that "everything is a synchronization problem" seems should be qualified better.
Also, most of the comments before mine seemed to be in full agreement that yeah, full synchronization would be a silver bullet, even for cache invalidation
I want to try it for hobby project and I think I will go the route of just one way sync (from database to clients) using electric sql and I will have writes done in a traditional way (POST requests).
I like the idea of having server db and local db in sync, but what happens with writes? I know people say CRDT etc... but they are solving conflicts in unintuitive ways...
I know I probably sound uneducated, but I think the biggest part of this is still solving conflicts in a good way, and I don't really see how you can solve those in a way that works for all different domains and have it "collapsed" as the author says
I've been using `starfx` which is able to "sync" with APIs using structured concurrency: https://github.com/neurosnap/starfx
The use case the article describes is exactly the idea behind CouchDB: a database that is at the same time the server, and that's made to be synced with the client.
You can even put your frontend code into it and it will happily serve it (aka CouchApp).
https://couchdb.apache.org
It requires deeper (and more) integration work compared to solutions that sync your state for you, but is a lot more flexible wrt. the backend technology choices.
At its core, it is an action synchronizer. You manage both your local state and remote state through redux-style actions, and the library takes care of syncing and resequencing them (if needed) so that all clients converge at the same state.
[1] https://logux.org/
At the same time, I had a great time developing on Meteorjs a decade ago, which used Mongo on the backend and then synced the DB to the frontend for you. It was really fluid. So I look forward to things like this being tried. In the end though, Meteor is essentially dead today, and there's nothing to replace it. I'd be wary of depending so fully on something so important. Recently Faunadb (a "serverless database") went bankrupt and is closing down after only a few years.
I see the product being sold is pitched as a "relational version of firebase", which I think good idea. It's a good idea for starter projects/demos all the way up to medium-sized apps, (and might even scale further than firebase by being relational), but it's not "The Future" of all app development.
Also, I hate to be that guy but the SQL in example could be simpler, when aggregating into JSON it's nice to use a LATERAL join which essentially turns the join into a for loop and synthesises rows "on demand":
That still proves the author's point that SQL is a very complicated tool, but I will say the query itself looks simpler (only 1 join vs 2 joins and a group by) if you know what you're doing.Care to explain what you mean by "dead"? Just today v3.2 came out, and the company, the community, and their paid-for hosting service seem pretty alive to me.
Will do my best to take action on any feedback I receive here
Having a datalog query engine, supplying it with data from Datomic indexes - b-tree like collections storing entity-attribute-value records - seems simple. Updating the local index cache from log is also simple.
And that gets you a db in browser.
- Reliable communication is hard - Optimistic writes should on client are hard - Tracking subsets of data is hard (you don't want the entirety of Datomic on the client, do you?) - Permissions are hard in this model
Why didn't I implement it? Mostly comes down to free time. It's a hobby project and it's hard to find time for it. I also stopped writing web apps so immediate pressure for this went away.
This is out of scope - I don't mean a functional equivalent of instantdb. Just a database in browser.
> - Reliable communication is hard
The same, no special requirements. Just send a request, maybe retry several times (with increasing delays), and give up throwing an error.
> Tracking subsets of data is hard (you don't want the entirety of Datomic on the client, do you?)
That's the only thing really missing. And it doesn't seem hard. I think Datomic Peer just keeps a fixed number of index pages in cache. Pages missing in the cache are just retrieved from storage.
In result the cache keeps the working subset - the elements related to the entities needed for queries and entity api requests made by the application. Especially since the indexes are ordered (EAVT, AEVT, AVET, VAET), much of the data in the cache will be relevant to the application.
> - Permissions are hard in this model
Permission is a question, but there are useful applications where permission control is not needed.
Similar to what you say in another comment about conflict resolution: "Another misconception is that conflict resolution needs to be “solved” perfectly before any progress can be made. That is not true as well. You might have unhandled conflicts in your system and still have a working, useful, successful product."
Back in the day when DataScript first appeared and I was eager to see it working with larger-than-memory datasets (and maybe even reading data saved by Datomic by understanding its format), I wanted that to enable public to run queries (read-only) on a large database I was assembling, that didn't fit into memory.
In some applications all users may have equal write access to the document / data.
Server-side usage of DataScript could be another case that does not require permissions support in the DB. That's how Datomic itself is used.
I am not complaining, and understand there are limits on what people can do in free time. You did huge work on your open source projects. But I regretted that seemingly a small step to open DataScript to out-of-memory data, which I though would greatly expand its applicability, was missing.
Good luck with instantdb. Hopefully commercial success will allow to continue putting work in it and improve the tech landscape.
It is full of general statements that are only true for a subset of solutions. Enterprise solutions in particular are vastly more complex and can't be magically made simple by a syncing database. (no solution comes even close to "99% business code". Not unless you re-define what business code is)
It is astounding how many senior software engineers or architects don't understand that their stack contains multiple data models and even in a greenfield project you'll end up with 3 or more. Reducing this to one is possible for simple cases - it won't scale up. (Rama's attempt is interesting and I hope it proves me wrong)
From: "yeah, now you don't need to think about the network too much" to "humbug, who even needs SQL"
I've seen much bigger projects fail because they fell for one or both of these ideas.
While I appreciate some magic on the front-end/back-end gap, being explicit (calling endpoints, receiving server-side-events) is much easier to reason about. If we have calls failing, we know exactly where and why. Sprinkle enough magic over this gap and you'll end up in debugging hell.
Make this a laser focused library and I might still be interested because it might remove actual boilerplate. Turn it into a full-stack and your addressable market will be tiny.
bold of them to assume a database can manage even the most trivial of conflicts.
There's a reason you bombard all your writes to a "main/master/etc"
Main problem? No result caching but that's "just" a middleware to implement.
It has a couple of features relevant to this blog post.
The first: Niki points out that in standard SQL producing JSON documents from relational tables is awkward and the syntax is terrible. This is true, so there's a better syntax:
It makes compound JSON documents from data stored relationally. This has three advantages: (1) JSON documents get materialized on demand by the database instead of requiring frontend code to do it, (2) the ORDS proxy server can serve these over HTTP via generic authenticated endpoints (e.g. using OAuth or cookie based auth) so you may not need to write any code beyond SQL to get data to the browser, and (3) the JSON documents produced can be written to, not only read.The second feature is query change notifications. You can issue a command on a connection that starts recording the queries issued on it and then get a callback or a message posted to an MQ when the results change (without polling). The message contains some info about what changed. So by wiring this up to a web socket, which is quite easy, the work of an hour or two in most web frameworks, then you can stream changes to the client directly from the database without needing much logic or third party integrations. You either use the notification to trigger a full requery and send the entire result json back to the browser, or you can get fancier and transform the deltas to json subsets.
It'd be neat if there was a way to join these two features together out of the box, but AFAIK if you want full streaming of document deltas to the browser and reconstituting them there, it would need a bit more on top.
Again, you may feel this is irrelevant because doesn't every self-respecting HN reader use Postgres for everything, but it's worth knowing what's out there. Especially as the moment you decide to paying a cloud for hosting your DB you have crossed the Rubicon anyway (all the hosted DBs are proprietary forks of Postgres), so you might as well price out alternatives.
[1] and you know the drill, views are my own and nobody has reviewed this post.
[2] https://news.ycombinator.com/item?id=42855546
HOWEVER. There is no world where lifetime costs of using Postgres for any successful company anywhere in the world are greater than using Postgres. I understand that's a key message for your sales team to get out, but only one of the CEOs at Oracle and Percona has flown a fighter jet underneath the Golden Gate Bridge.
Oracle licensing is famously, famously sticky. Extremely. Incredibly. It's how the company was built and is maintained.
I've never talked to database sales people and have no idea what messages they have or care about. Actually I'm 99% sure they don't care about the HN/startup crowd at all - did you see anyone except me talk about this stuff here? Me neither. I'm making this argument basically because I like making arguments early that are surprising but correct, and databases feel like fertile ground for such arguments. There's a lot of groupthink in this space. And you know all about my history with surprising technology arguments, Peter ;)
Anyway I'd be interested to see a spreadsheet with a worked set of scenarios for both cost and "stickiness" however it's defined (genuinely). I think it's going to depend heavily on:
a) Whether you cloud host or not. The cost of a small Postgres that you run yourself is pretty much whatever your own time is valued at, as self-hosted hardware is cheap. The costs of a Postgres you outsource can be really un-intuitively high. I already showed that a cloud hosted elastic Oracle DB can be cheaper for the same-spec AWS-managed Postgres despite a massive feature disparity on one side. Costs here aren't dominated by hardware nor software purchase costs.
b) What features and scaling level you need, combined with cost of labour in your area. If you want to scale up a Postgres based operation very fast then that's going to take a ton of skilled engineering effort, devs will be slowed down a lot as they spend time on implementing custom sharding schemes etc. At some point the cost of rolling your own ad-hoc solutions to these things will cross with the cost of just buying a system that already solves them all out of the box. Where that cross-point is will depend on all kinds of things like opportunity cost, cost of hiring, cost of developer productivity....
b) Whether you consider unique features to be "stickiness". You're claiming the licensing is sticky here but companies negotiate all kinds of licenses so what does that mean? By default it's charged per core like any other commercial db (or in the cloud by core seconds/storage). If unique features are the problem then that's an aspect of choosing any tech platform. If you're taking advantage of full SQL joins on a 50-node horizontally scaled multi-master cluster then yeah, trying to migrate to something else is going to be sticky because there aren't many other products that offer that. That's tech for you. Still, these days I guess it must be less sticky because there are other people selling very scalable SQL-speaking databases like Spanner.
As for Larry Ellison's stunts, that's great but if you're deciding what platform to use on the basis of executive horsepower then you can pick between fighter jets, Jeff Bezo's rockets, Bill Gates' yachts or Larry Page's flying cars. Selling databases seems to go hand in hand with high tech vehicles, which is probably a sign there's some actual value being delivered there, somewhere.
I referenced Larry as a proxy for his extreme wealth. Although it is true he’s one of the great businessmen of the late 20th century. Just not the sort you want to be in a business deal with in general.
Oracle has always been good at both adding helpful functions that developers rely on, making switching difficult, and also at teasing companies into using more licenses than they’ve purchased, then smacking them with audits and fees as a stick, and a ‘cheaper’ larger license as a carrot to avoid the audit fees.
In the 90s, this was tech like PL/SQL and Materialized views - I’m long out of the Oracle game, so I have no idea where they compete on features now vis-a-vis open source — but I will say that I have owned companies where the Oracle license was both HATED — and outlived all original owners of the company. It’s hard to replace once it’s in your workflow, and that is 100% by design.
Assume that permissions to any row in the DB can be removed at any time. If we store the data offline, this security measure is already violated. If you don't care about a user potentially storing data they no longer have access to, when they come online, any operations they make are invalid and that's fine
But, if security access is part of your business logic, and is complex enough to the point where it lives in your app and not in your DB (other than using DB tools like RLS), how do you verify that the user still has access to all cached data? Wouldn't you need to re-query every row every time?
I'm still uncertain how these sync engines can be secured properly
Returning to the contacts example, Google Contacts attempts to mitigate error propagation by introducing a review stage, where users can decide how to handle duplicates (e.g., merge contacts that contain different information).
In the broader context of sync, this highlights the need for policies to handle situations where syncing is simply not possible beyond all the smart logic we may implement.
1. Naming things
2. Caching
3. Off-by-one errors
Yes! A thousand times this!
Databases can't just "live on a server somewhere", their code should extend into the clients. The client isn't just a network protocol parser / serialiser, it should implement what is essentially an untrusted, read-only replica. For writes, it should implement what is essentially a local write-ahead log (WAL) either in-memory and optionally fsync-d to local storage. All of this should use the same codebase as the database engine, or machine-generated in multiple languages from some sort of formal specification.
When sync Just Works™, it's a magical thing.
One of the reason's my design has been reliable from its very first release, even across multiple refactors/rewrites (I believe it's currently on its third, this time to Swift) is that it uses a Git-like model internally with pervasive hashing. It's almost impossible for sync to work incorrectly (if it works at all).
[0] https://culturedcode.com/things/cloud/
I would be interested to read any articles you've written about Things's sync pattern, if any.
Push becomes like git push to a tmp branch, then you do a server-side merge, then a pull on remaining clients. Push/pull is fast just like git, and due to content hashing, there are never any problems overwriting another device's work on the server.
I think I had some clever tricks with text syncing (IIRC I implemented a Merkle tree-inspired approach), but that's the general concept.
(I don't remember anything else, it was so long ago. :-)
Merkel-Tree based synching is as performant as you can possibly get (used by Git protocol too I believe) because you can tell of a root of a tree-structure is identical to some other remote tree structure just by comparing the Hash Strings. And this can be recursively applied down any "changed branches" of a tree to implement very fast syncing mechanisms.
I think we need a NEW INTERNET (i.e. Web3, and dare I say Semantic Web built in) where everyone's basically got their own personal "Tree of Stuff" they can publish to the world, all naively built into some new kind of tree structure-based killer app. Like imagine having Jupyter Notebooks in Tree form, where everything on it (that you want to be) is published to the web.
- https://news.ycombinator.com/item?id=43436645
- https://greenvitriol.com/posts/sync-engine-for-everyone
Rocicorp Zero Sync, instantdb, linear app like trend is great -- sync will be big. I hope a lot of the spa slop gets fixed!
Define "separate" but my old X11 compositor project neocomp I did something like that with a series of AOS arrays and bitfields that combined to make a sort of entity manager. Each index in the arrays was an entity, and each array held a data associated with a "type" of entity. An entity could hold multiple types that would combine to specify behavior. The bitfield existed to make it quick to query.
It waaay too complicated for what it was, but it was fun to code and worked well enough. I called it a "swiss" (because it was full of holes). It's still online on github (https://github.com/DelusionalLogic/NeoComp/blob/master/src/s...) even though I don't use it much anymore.
> If your database is smart enough and capable enough, why would you even need a server? Hosted database saves you from the horrors of hosting and lets your data flow freely to the frontend.
(this is a blog of one such hosted database provider)
Jokes aside firebase access control is a nightmare and all those database as an APi thing have the same problem.
It seems like generally, the best way to expose your database to the internet is considered to be not doing so in the first place, i.e., have your webserver query and cache a hosted database that isn't directly exposed.
Is my understanding correct? It seems that almost all data breaches we hear about are directly exposed databases or their cloud equivalents.
Is doing this in the era of "cloud" being made impossible?
In fact, there are many benefits of directly exposed databases - many of which would remove the need for applications normally sitting on top of those databases, which are strictly inferior and less ergonomic and overall more shitty than a generic database browsing interface. But that's another reason for why things are the way they are: people wanna make money, and having your application be a toll booth between useful data you own and the rest of the world, is tried and true way of making money.
--
[0] - Because they're not normally exposed, because they're not designed for it, because... it's a self-reinforcing loop.
It seems that solving that is a cultural/economic problem, not a technical one, and that's a shame.