Skip down to the bottom for a simple approach that just works, in sufficiently recent PG releases (MERGE). Back up just a bit for something that's nearly as good, and works with everything since Postgresql 10 (INSERT ... ON CONFLICT). Most of the post describes more complicated approaches that have subtle race conditions, and are worth looking at mainly to understand what the problems are.
geekodour 163 days ago [-]
at one point i was using upserts for everything but then had a case where it was impossible to get it working with generated id.
Small nitpick - the author mentions upsert in the beginning of the article only to "forget" about it and use it in the end (insert .. on conflict ignore) and that's the obvious solution there.
The final query is very neat though! And special thanks for mentioning "MERGE ... RETURNING" in PG17, that's really cool
atemerev 163 days ago [-]
I wonder why this really common use case requires so much esoteric SQL/Postgres knowledge. While everything is well explained, the solution is not simple, by all means. I don’t like complex solutions to common problems.
robertlagrant 163 days ago [-]
SQL is partitioned along the major axes of GET, INSERT, UPDATE, DELETE. Combining any of them just isn't so natural for SQL. Upsert is another one.
da_chicken 163 days ago [-]
It's not that difficult to do in the general case. Usually, you're just doing it through the program.
- Open connection
- Begin transaction
- Do INSERT
- If you get a PK violation, do an UPDATE instead
- Commit
The thing is, while it's a common pattern, it also has many equally common similar patterns. Sometimes you want to try the UPDATE first. Sometimes you want the program to throw an exception if the INSERT fails. Sometimes you're doing something with row versioning or pessimistic locking.
It's same reason why your file IO library might have a ReadAllText() method, but only after it has Open() and Read() and ReadLine().
zokier 163 days ago [-]
- Begin transaction
- Do INSERT
- If you get a PK violation, do an UPDATE instead
- Commit
This is not safe in the default transaction mode, other transactions can delete the row between insert and update/select.
Also, as explained by the article, this has the table bloating problem
da_chicken 162 days ago [-]
If that's not safe, then Postgres has not implemented transactions correctly. That seems unlikely, so my guess is that what you think I'm suggesting is not what I'm actually suggesting.
zokier 162 days ago [-]
there is nothing wrong postgres transactions, they work exactly as intended, but they aren't magic. this is pretty easy to test:
Session 1:
psql (16.3 (Debian 16.3-1.pgdg120+1))
Type "help" for help.
postgres=# CREATE TABLE tags (
id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
name VARCHAR(50) NOT NULL
);
CREATE TABLE
postgres=# ALTER TABLE tags ADD CONSTRAINT tags_name_unique UNIQUE(name);
ALTER TABLE
postgres=# INSERT INTO tags (name) VALUES ('A'), ('B') RETURNING *;
id | name
----+------
1 | A
2 | B
(2 rows)
INSERT 0 2
postgres=# BEGIN;
BEGIN
postgres=*# INSERT INTO tags (name) VALUES ('B') ON CONFLICT DO NOTHING RETURNING *;
id | name
----+------
(0 rows)
INSERT 0 0
Session 2:
psql (16.3 (Debian 16.3-1.pgdg120+1))
Type "help" for help.
postgres=# DELETE FROM tags WHERE name = 'B';
DELETE 1
Session 1 continues:
postgres=*# SELECT * FROM tags WHERE name = 'B';
id | name
----+------
(0 rows)
postgres=*# END;
COMMIT
(yes, I added `ON CONFLICT DO NOTHING` to avoid aborting the transaction prematurely)
da_chicken 162 days ago [-]
I did some more research. Postgres doesn't handle transactions correctly. Once an error on a constraint occurs, subsequent commands are ignored. You can't handle errors in a transaction. Most RDBMSs don't require a rollback in that situation. It's weird.
anarazel 162 days ago [-]
You can, you just need to use savepoints.
zokier 162 days ago [-]
Yeah the error handling is a side-note here. The main thing to understand is that postgresql transactions have "read committed" isolation level by default, which means that reads will see the writes from any committed other transactions, which means that there can be arbitrary changes to data between operations, e.g. between insert and select, even if no errors or conflicts are involved. This is pretty much what the sample I posted in parent comment demonstrates.
BeefySwain 163 days ago [-]
What transaction mode would you need to make this safe? Can you explain further?
VWWHFSfQ 163 days ago [-]
I believe PG will abort the transaction on an exception (PK violation) and the subsequent update will not run in the same context that had original violation. So it could result in a data race. I don't know what isolation level would fix that, if any.
My understanding is that in general, if you hit an exception in postgres at all then you can't trust the isolation of the current transaction anymore.
That's what MERGE and speculative insertion (on conflict do update) addresses.
zokier 162 days ago [-]
well, yes, you are right that exception does abort the transaction. but even if you use something like 'on conflict do nothing' to avoid the exception, you still can get problems with concurrent writes (see my sibling comment https://news.ycombinator.com/item?id=41169638)
REPEATABLE READ or SERIALIZABLE isolation levels would help with that, as the name suggest repeatable read ensures that the read made by inserts constraint check is repeatable in successive select (or update) statements.
Most RDBMSs don't behave this way. They allow you to correct exceptions that occur for anything less than a deadlock without a total rollback, but not Postgres.
eknkc 163 days ago [-]
Oh I've been doing the `on conflict do update set id = EXCLUDED.id` thing for ages now. Thought the update would be no-op because nothing changed. Interesting.
arkh 163 days ago [-]
> Thought the update would be no-op because nothing changed.
You can use a specific trigger function just to prevent this: suppress_redundant_updates_trigger
> The suppress_redundant_updates_trigger function, when applied as a row-level BEFORE UPDATE trigger, will prevent any update that does not actually change the data in the row from taking place. This overrides the normal behavior which always performs a physical row update regardless of whether or not the data has changed.
agwa 163 days ago [-]
I just tested that and the trigger causes the INSERT to return no rows if the row already exists.
It basically turns the ON CONFLICT DO UPDATE into ON CONFLICT DO NOTHING.
teget 163 days ago [-]
If you update a column that is not indexed you might get away with it without stressing the cleanup processes too much (vacuum and friends):
This is super informative, thank you. I did a small OSS project around postgresql a while ago to implement a simple document store with Kotlin and a low level co-routine friendly postgresql driver called jasync-postgresql that doesn't do blocking IO. I ended up using a few of the other solutions mentinoed in the article. But based on this, I might want to upgrade it to use MERGE.
My intention with this project is to use it for storage and transactions. Most of our query heavy lifting is done via external search engines like Elasticsearch. And my attitude with databases is that if you are not querying on it, it doesn't need separate tables, columns, and indices. The point with a document store is that you only use limited querying strategies. I've used documents stores for various projects for over ten years and that style of persistence just fits a lot of things I do.
I'm currently planning to roll this out for one of our products. So, this is not just a hobby project. If somebody has time; I wouldn't mind some eyeballs on this project: https://github.com/formation-res/pg-docstore
163 days ago [-]
shiandow 163 days ago [-]
Interesting and very educational article.
I do wonder why in the first example of inserting multiple tags with get_or_create_tag the id of 'C' suddenly changes to 4. Were some queries perhaps not run in the order they appear?
CRConrad 163 days ago [-]
I got the same impresssion, that the demo wasn't implemented as a single "clean" run. Another indication of this is that the next passage, about eliminating "bloat" from failed inserts, shows the table containing two live tuples again, not three or four as it should after the preceding example.
haki 163 days ago [-]
Since at least two people noticed, I would just say that the insert-select-insert approach was added later, after I took some inspiration from Django's implementation of get_or_create.
163 days ago [-]
Rendered at 09:30:42 GMT+0000 (Coordinated Universal Time) with Vercel.
wrote about it very roughly here https://blog.geekodour.org/posts/upserts/
The final query is very neat though! And special thanks for mentioning "MERGE ... RETURNING" in PG17, that's really cool
- Open connection
- Begin transaction
- Do INSERT
- If you get a PK violation, do an UPDATE instead
- Commit
The thing is, while it's a common pattern, it also has many equally common similar patterns. Sometimes you want to try the UPDATE first. Sometimes you want the program to throw an exception if the INSERT fails. Sometimes you're doing something with row versioning or pessimistic locking.
It's same reason why your file IO library might have a ReadAllText() method, but only after it has Open() and Read() and ReadLine().
- Do INSERT
- If you get a PK violation, do an UPDATE instead
- Commit
This is not safe in the default transaction mode, other transactions can delete the row between insert and update/select.
Also, as explained by the article, this has the table bloating problem
Session 1:
Session 2: Session 1 continues: (yes, I added `ON CONFLICT DO NOTHING` to avoid aborting the transaction prematurely)My understanding is that in general, if you hit an exception in postgres at all then you can't trust the isolation of the current transaction anymore.
That's what MERGE and speculative insertion (on conflict do update) addresses.
REPEATABLE READ or SERIALIZABLE isolation levels would help with that, as the name suggest repeatable read ensures that the read made by inserts constraint check is repeatable in successive select (or update) statements.
https://www.postgresql.org/docs/current/transaction-iso.html is pretty comprehensive.
Most RDBMSs don't behave this way. They allow you to correct exceptions that occur for anything less than a deadlock without a total rollback, but not Postgres.
You can use a specific trigger function just to prevent this: suppress_redundant_updates_trigger
https://www.postgresql.org/docs/current/functions-trigger.ht...
> The suppress_redundant_updates_trigger function, when applied as a row-level BEFORE UPDATE trigger, will prevent any update that does not actually change the data in the row from taking place. This overrides the normal behavior which always performs a physical row update regardless of whether or not the data has changed.
It basically turns the ON CONFLICT DO UPDATE into ON CONFLICT DO NOTHING.
https://www.postgresql.org/docs/current/storage-hot.html
My intention with this project is to use it for storage and transactions. Most of our query heavy lifting is done via external search engines like Elasticsearch. And my attitude with databases is that if you are not querying on it, it doesn't need separate tables, columns, and indices. The point with a document store is that you only use limited querying strategies. I've used documents stores for various projects for over ten years and that style of persistence just fits a lot of things I do.
I'm currently planning to roll this out for one of our products. So, this is not just a hobby project. If somebody has time; I wouldn't mind some eyeballs on this project: https://github.com/formation-res/pg-docstore
I do wonder why in the first example of inserting multiple tags with get_or_create_tag the id of 'C' suddenly changes to 4. Were some queries perhaps not run in the order they appear?