NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Vector databases are the wrong abstraction (timescale.com)
morgango 2 hours ago [-]
Great point!

(Disclaimer: I work for Elastic)

Elasticsearch has recently added a data type called semantic_text, which automatically chunks text, calculates embeddings, and stores the chunks with sensible defaults.

Queries are similarly simplified, where vectors are calculated and compared internally, which makes a lot less I/O and a lot simpler client code.

https://www.elastic.co/search-labs/blog/semantic-search-simp...

pjot 1 hours ago [-]
I made something similar, but used duckDB as the vector store (and query engine)! It’s impressively fast

https://github.com/patricktrainer/duckdb-embedding-search

jdthedisciple 2 hours ago [-]
How does their embedding model compare in terms of retrieval accuracy to, say `text-embedding-3-small` and `text-embedding-3-large`?
binarymax 1 hours ago [-]
It’s impossible to answer that question without knowing what content/query domain you are embedding. Checkout MTEB leaderboard, dig into the retrieval benchmark, and look for analogous datasets.
splike 1 hours ago [-]
You can use openai embeddings in elastic if you don't want to use their elser sparse embeddings
jdthedisciple 2 hours ago [-]
Whats wrong with using FAISS as your single db?

Its like sqlite for vector embeddings, and you can store metadata (the primary data, foreign keys, etc) along with the vectors, preserving the relationship.

Not sure if the metadata is indexxed but at least iirc it's more or less trivial to update the embeddings when your data changes (tho i haven't used it in a while so not sure).

avthar 1 hours ago [-]
Good q. For most standalone vector search use cases, FAISS or a library like it is good.

However, FAISS is not a database. It can store metadata alongside vectors, but it doesn't have things you'd want in your app db like ACID compliance, non-vector indexing, and proper backup/recovery mechanisms. You're basically giving up all the DBMS capabilities.

For new RAG and search apps, many teams prefer just using a single app db with vector search capabilities included (Postgres, Mongo, MySQL etc) vs managing an app db and a separate vector db.

avthar 8 hours ago [-]
Hey HN! Post co-author here, excited to share our new open-source PostgreSQL tool that re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed.

Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon.

Eager to hear your feedback and reactions. If you'd like to leave an issue or better yet a PR, you can do so here [1]

[1]: https://github.com/timescale/pgai

hhdhdbdb 2 hours ago [-]
Pretty smart. Why is the DB api the abstraction layer though? Why not two columns and a microservice. I assume you are making async calls to get the embeddings?

I say that because it seems n unsual. Index would suit sync better. But async things like embeddings, geo for an address, is this email considered a spammer etc. feel like app level stuff.

cevian 49 minutes ago [-]
(post co-author here)

The DB is the right layer from a interface point of view -- because that's where the data properties should be defined. We also use the DB for bookkeeping what needs to be done because we can leverage transactions and triggers to make sure we never miss any data. From an implementation point of view, the actual embedding does happen outside the database in a python worker or cloud functions.

Merging the embeddings and the original data into a single view allows the full feature set of SQL rather than being constrained by a REST API.

dinobones 3 hours ago [-]
Wow, actually a good point I haven't seen anyone make.

Taking raw embeddings and then storing them into vector databases, would be like if you took raw n-grams of your text and put them into a database for search.

Storing documents makes much more sense.

choilive 2 hours ago [-]
Been using pgvector for a while, and to me it was kind of obvious that the source document and the embeddings are fundamentally linked so we always stored them "together". Basically anyone doing embeddings at scale is doing something similar to what Pgai Vectorizer is doing and is certainly a nice abstraction.
jdthedisciple 2 hours ago [-]
I used FAISS as it also allowed me to trivially store them together.

Idk how well it scales though, it's just doing it's job on my hobby project scale

For my few 100'000s embeddings I must say the performance was satisfactory.

markusw 2 hours ago [-]
I’m using sqlite-vec along with FTS5 in (you guessed it) SQLite and it’s pretty cool. :)
ok123456 44 minutes ago [-]
Yes. Materialized Views are good.
unholyguy001 35 minutes ago [-]
That was just what I was thinking. This approach will have the same issues that materialized views have as well
cevian 35 minutes ago [-]
haha. We had a good internal debate as to whether this is more like indexes or more like Materialized Views. It's kinda a mixture of the two.
sgarland 33 minutes ago [-]
> the responsibility for generating and updating them as the underlying data changes can be handed over to the database management system

And now we shift ever more slightly back towards logic in the DB. I for one am thrilled; there’s no reason other than unfamiliarity to not let RDBMS perform functions it’s designed to do. As long as these offloads are documented in code, embrace not needing to handle it in your app.

mattxxx 2 hours ago [-]
This reads solely as a sales pitch, which quickly cuts to the "we're selling this product so you don't have to think about it."

...when you actually do want to think about it (in 2024).

Right now, we're collectively still figuring out:

  1. Best chunking strategies for documents
  2. Best ways to add context around chunks of documents
  3. How to mix and match similarity search with hybrid search
  4. Best way to version and update your embeddings
cevian 2 hours ago [-]
(post co-author here)

We agree a lot of stuff still needs to be figured out. Which is why we made vectorizer very configurable. You can configure chunking strategies, formatting (which is a way to add context back into chunks). You can mix semantic and lexical search on the results. That handles your 1,2,3. Versioning can mean a different version of the data (in which case the versioning info lives with the source data) OR a different embedding config, which we also support[1].

Admittedly, right now we have predefined chunking strategies. But we plan to add custom-code options very soon.

Our broader point is that the things you highlight above are the right things to worry about, not the data workflow ops and babysitting your lambda jobs. That's what we want to handle for you.

[1]: https://www.timescale.com/blog/which-rag-chunking-and-format...

8 hours ago [-]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 23:35:19 GMT+0000 (Coordinated Universal Time) with Vercel.