NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Launch HN: Sift Dev (YC W25) – AI-Powered Datadog Alternative
nextts 58 days ago [-]
Funny I was thinking this week logging needs some magic.

Log diving takes a lot of time especially during some kind of outage/downtime/bug where the whole team might be watching a screen share of someone diving into logs.

At the same time I am sceptical about "AI" especially if it is just an LLM stumbling around.

Understanding logs is probably the most brain intensive part of the job for me, more so than system design, project planning or coding.

This is because you need to know where the code is logging, imagine code paths in your head and you constantly see stuff that is a red herring or doesn't make sense.

I hope you can improve this space but it won't be easy!

Akula112233 58 days ago [-]
Very relatable experience with log diving, feels very much like a needle-in-haystack problem that gets so much harder when you're not the only one who contributed to the source of errors (often the case).

As for the skepticism with LLMs stumbling around raw logs: it's super deserved. Even the developers who wrote the program often refer to larger app context when debugging, so it's not as easy as throwing a bunch of logs into an LLM. Plus, context window limits & the relative lack of "understanding" with increasingly larger contexts is troublesome.

We found it helped a lot to profile application logs over time. Think aggregation, but for individual flows rather than similar logs. By grouping and ordering flows together, it's bringing the context of thousands of (repetitive) logs down to the core flows. Much easier to find when things are out of the ordinary.

Still a lot of improvements in regards to false positives and variations in application flows.

ohgr 58 days ago [-]
The best way to improve this is to just generate decent useful and actionable logs. Sifting through a trash heap is where the problem is. No magic will suddenly turn that trash into gold.

You have to do this at the inception of the software you’re building rather then strap it on the donkey when something breaks (the usual way).

Akula112233 58 days ago [-]
Yep, but it's sometimes a compromise people may be unwilling to make. Too often I hear (and have seen via DD customers) horror stories about initiatives to fix observability squashed by teams in hopes of shipping.

Moving fast has it's downsides and I can't say I blame people for deprioritizing good logging practices. But it does come back to bite...

Though as a caveat, you don't always have control over your logs -- especially with third party services, large but fragmented engineering organizations, etc. -- even with great internal practices, there's always something.

On another note, access to codebase + live logs gives room to develop better auto-instrumentation tooling. Though perhaps cursor could do a decent enough job at starting folks off

ohgr 58 days ago [-]
[flagged]
bmurphy1976 58 days ago [-]
This is part of hardening a system for production. Making it easy to operate:

* Make sure the logs are actionable

* Make sure the logs are readable

* Make sure you are collecting operational metrics

* Make sure the metrics are useful

* Make sure you have error handling

* Make sure you have alerting

* Make sure you document how to support the application

* Make sure you have knows and levers you can pull in an emergency to change the systems behavior or fix things

* Make sure you have vetted the system for security issues

etc.

cthuen 58 days ago [-]
Disclaimer: I'm a founder at Gravwell, a log analytics startup

I agree, even when applicable LLMs are relegated to analyzing subselected data, so logs have to go somewhere else first. I think understanding logs is brain intensive because it can be a tricky problem. It gets easier with good tools, but often those tools are the kind that need to be used to build something else that solves the problem, rather than solve the problem themselves (e.g. building a good query + automation). I think LLMs can get better at creating the queries which would help a lot.

We started Gravwell to try bring some magic. It's a schema-on-read time-series data lake that will eat text or binary and comes in SaaS or self-hosted (on-prem). We built our backend from scratch to offer maximum flexibility in query. The search syntax looks like a linux command line, and kinda behaves like one too. Chain modules together to extract, filter, aggregate, enrich, etc. Automation system included. If you like Splunk, you should check us out.

There's a free community edition (personal or commercial use) for 2GB/day anon or 14GB/day w/ email. Tech docs are open at docs.gravwell.io.

evil-olive 58 days ago [-]
> SiftDev flags silent failures, such as two microservices updating the same record within 50ms

I don't understand, what about that is a "silent failure"?

in order for your product to even know about it, wouldn't I need to write a log message for every single record update?

and if my architecture allows two microservices to update the same row in the same database...maybe it happening within 50ms is expected?

that could be an inefficient architecture for sure, but I'm confused as to whether your product is also trying to give me recommendations about "here's an architectural inefficiency we found based on feeding your logs to an LLM"

> You can then directly ask your logs questions like, “What's causing errors in our checkout service?” or “Why did latency spike at 2 AM?” and immediately receive insightful, actionable answers that you’d otherwise manually be searching for.

the general question I have with any product that's marketing itself as being "AI-powered" - how do hallucinations get resolved?

I already have human coworkers who will investigate some error or alert or performance problem, and come to an incorrect conclusion about the cause.

when that happens I can walk through their thought process and analysis chain with them and identify the gap that led them to the incorrect conclusion. often this is a useful signal that our system documentation needs to be updated, or log messages need to be clarified, or a dashboard should include a different metric, etc etc.

if I ask your product "what caused such-and-such outage" and the answer that comes back is incorrect, how do I "teach" it the correct answer?

Akula112233 58 days ago [-]
> I don't understand, what about that is a "silent failure"?

Silent failures can be "allowed" behavior in your applications that aren't actually labeled as errors but can be irregular. Think race conditions, deadlocks, silent timeouts, or even just mislabeled error logs.

> in order for your product to even know about it, wouldn't I need to write a log message for every single record update?

That's right, and this may not always feasible (or necessary!), but if your application can be impacted by errors like these, perhaps it may be worth logging anyway.

> the general question I have with any product that's marketing itself as being "AI-powered" - how do hallucinations get resolved?

> and if my architecture allows two microservices to update the same row in the same database...maybe it happening within 50ms is expected?

> if I ask your product "what caused such-and-such outage" and the answer that comes back is incorrect, how do I "teach" it the correct answer?

For these concerns, human-in-loop feedback is our preliminary approach! We have our own internally running to account for changes and false errors, but having explanations from human input (even as simple as "Not an error" or "Missed error" buttons) is very helpful.

> when that happens I can walk through their thought process and analysis chain with them and identify the gap that led them to the incorrect conclusion. often this is a useful signal that our system documentation needs to be updated, or log messages need to be clarified, or a dashboard should include a different metric, etc etc.

Got it, I imagine it'll be very helpful for us to display our chain of thought from our dashboards too. Great feedback, thank you!

evil-olive 58 days ago [-]
> Think race conditions, deadlocks, silent timeouts, or even just mislabeled error logs.

I agree that those are bad things.

but how does your product help me with them?

I have some code that has a deadlock. are you suggesting that I can find the deadlock by shipping my logs to a 3rd-party service that will feed them into an LLM?

999900000999 58 days ago [-]
Can it run completely on prem ?

In most of the industries I work in we would never just send you our logs.

What stops me from building my own logger that sends a request to write a record to a DB and later asks an LLM what it means ?

Where is the pricing information?

Why do I need to login visit your homepage? How would I pitch this to my boss if they can’t read what it does ?

Edit: https://runsift.com/pricing.html

I see the landing page. The pricing should be clear though “ Contact Us” is scary.

Akula112233 58 days ago [-]
> Can it run completely on prem ?

Yep we have an on-prem offering as well, got similar notes from folks before!

> What stops me from building my own logger that sends a request to write a record to a DB and later asks an LLM what it means ?

Great question! The main limitation over brute force is the sheer volume of noise, and therefore relevant context. We tried this and realized it wasn't working. From a numbers perspective, at even just 10s of GBs/day scale of data (not even close to enterprise scale), mainstream LLMs can't provide the context windows you need for more than a few minutes of operational data. And larger models suffer from other factors (like attention diffusion / dilution & drift).

> I see the landing page. The pricing should be clear though “ Contact Us” is scary. Noted!

999900000999 58 days ago [-]
Thanks!

I hope my tone wasn’t too brash.

If you can update the pricing I might be able to pitch this to my org later this year. We’d definitely like an on prem solution though!

vardaro 58 days ago [-]
Neat idea. Why logs, and not metrics too? You can characterize an accurate "baseline" system behavior through a combination of system level and userspace metrics. This profile would offer more depth than what you'd otherwise piece together with userspace logs.
Akula112233 58 days ago [-]
Agreed! Metrics are a high priority, especially since working to increase the available context around each anomaly we flag.

Logs were a natural starting point because that’s where developers often spend a significant amount of time stuck reading & searching for the right information, manually tracking down issues + jumping between logs across services. In a way, just finding & summarizing relevant logs for the user gave people an easier time debugging.

But metrics will introduce more dimensions to establish baseline behavior, so we're pretty excited about it too.

vardaro 58 days ago [-]
I tend to use logs the least when debugging production issues. I realize that's a personal anecdote, so I see your point.
csomar 58 days ago [-]
Can you explain what goes through an LLM and what does not. You offer 100K logs per day for free but if all of these goes through an LLM, this will burn "thousands?" of dollars every month for a free customer that is milking the machine.
Akula112233 57 days ago [-]
Our free tier doesn't include the anomaly/error detection (noted on the site, we can make it more clear though). And your numbers do add up! That's why you can't just run all your logs through an LLM.

Aggregated (+ simplified) versions of your logs + flagged anomalies get passed through our LLMs

mdaniel 58 days ago [-]
Your python sdk's <https://pypi.org/project/sift-dev-logger> GH link is 404: <https://github.com/sift-dev/python-sdk> Navigating upward shows the fork of SigNoz which I think is funny

There was no GH link for your npm dep so maybe they're both private. Although npmjs shows your npm one as ISC licensed, likely because of the default in package.json

Akula112233 58 days ago [-]
Ah, any particular reason to want these SDKs public? Happy to, especially since you can see source on install anyway. Just curious!

And Kudos to SigNoz as well - have to check out other folks in the space :)

mdaniel 58 days ago [-]
My initial concern was what transitive deps it was pulling in, but the other answer to your question is the thing that most GH repos are good for: submitting bugs and submitting fixes

It is also good for finding out what the buffering story is, because I would want to know if I'm dragging in an unbounded queue into my app (putting memory pressure on me) or knowing that your service returning 503s is going to eat logs. The kind of thing that only looking at the source would say for sure because the docs don't even hint at such operational concerns

Anyway, the only reason I mentioned the dead link is because your PyPI page linked to GH in the first place. So if you don't intend people to think there's supposed to be a repo, then I'd suggest removing the repo link

Akula112233 58 days ago [-]
Noted, thank you! Will make some changes accordingly.
theogravity 58 days ago [-]
Hi, I'm the author of LogLayer (https://loglayer.dev) for Typescript, which has integration with DataDog and competitors. Sift looks easy to integrate with since you have a TS library and the API is straightforward.

Would you like me to create a transport for it (I'm not implying I'd be charging to do this; it'd be free)?

The benefit of LogLayer is that they'd just use the loglayer library to make their log calls and it ships it to whatever transports they have defined for it. Better than having them manage two separate loggers (eg Sift and Pino for example) or write their own wrapper.

Ishirv 58 days ago [-]
Hey, loglayer looks super cool! Would love to chat and set something up, send us an email at founders@runsift.com
theogravity 58 days ago [-]
Sent an e-mail!
kbouck 58 days ago [-]
From the docs it looks like you ingest directly from apps instrumented with your libraries. Do you also plan to ingest OpenTelemetry events, such as those exported from an OpenTelemetry agent, or OpenTelemetry collector?
Akula112233 57 days ago [-]
Yes, we support OpenTelemetry ingestion! Also Datadog, Splunk, and various other vendors’ agents/forwarders - even custom HTTP Daemons.

If you’ve already set up logging, good chance you can just point your instrumentation towards us and we know how to ingest and handle it.

TZubiri 58 days ago [-]
Consider not marketing yourself as an X alternative when launching? That might fly in slide decks and investor meetings. But I don't know what Datadog is, and I certainly don't care, won't look into what DataDog is just so I can be qualified to learn about your product.

I guess it may be the case that you really know who your target is, but why miss the majority of the market and position yourself as pepsi on the same stroke?

Sytten 58 days ago [-]
Datadog is industry standard at this point, if you dont know what splunk or datadog is you are likely not their ICP and their marketing is not targeting you.
theogravity 58 days ago [-]
Agreed, if you don't know what Datadog is then you're probably not the target audience for this product.
TZubiri 58 days ago [-]
Do you think if I don't know what datadog is, I am not the target audience for datadog?
csomar 58 days ago [-]
Kinda? There aren't that many players in this niche and datadog is the "dog".
chzblck 58 days ago [-]
probably
Ishirv 58 days ago [-]
Hey - thanks for the feedback. We were trying to give people a good idea of where we fit in quickly, but I can see where you're coming from!
n2d4 58 days ago [-]
Even if it won't work for everyone — some people (including me) are looking for Datadog alternatives, so this is the easiest way for them to speak to their ICP.
reconnecting 58 days ago [-]
TINLA, but perhaps you need to ensure your product complies with potential trademark issues related to sift[.]com.
super_ar 58 days ago [-]
It looks awesome! What are you guys using under the hood? I've lately seen a lot of companies building on top of ClickHouse.
r_singh 58 days ago [-]
How does this compare with Axiom? I'm looking to shift out of Datadog asap and Axiom was the choice. Would consider Sift
Akula112233 57 days ago [-]
We offer competing core functionality in terms of storage and search, but we’re also focused on intelligence: real-time anomaly detection, semantic log analysis, and natural language search.

Would recommend the demo video and playground environment we linked above! Feel free to reach out at founders@runsift.com if you’d like to learn more

waffletower 59 days ago [-]
Java bindings would be welcomed by many.
Akula112233 58 days ago [-]
Absolutely! Java bindings are on our radar. Any specific use cases / implementations you'd like to see? In the meantime, we do also support a couple off-the-shelf collectors that should already support Java applications!
ritvikpandey21 58 days ago [-]
curious how LLM hallucinations will work on logging info - gonna be a hard problem to solve
Akula112233 57 days ago [-]
I assume this is most in regards to our anomaly/error detection! Deterministic rules for flagging anomalies + human feedback help in adapting our flagging system accordingly. So hallucinations won't directly impact flagged anomalies. The rules (patterns) we generate are on the stricter end, so they err on the side of flagging more.

However, the rules themselves aren't deterministically generated (and therefore prone to LLM hallucinations). To address this, we currently have a simpler system that lets you mark incorrectly flagged anomalies so they can be incorporated into our generated rules. There's room to improve that we're actively working on: exposing our generated patterns in a human-digestible manner (so they can be corrected), introducing metrics and more data sources for context, and connecting with a codebase.

mattfrommars 58 days ago [-]
What is your background to build 'AI powered datadog' alternate? Datadog is a massive company... how much experience do you guys have to have a product that competes with them?
adelowo 58 days ago [-]
OP literally said they worked at Datadog and Splunk. That’s enough tbh as those are leaders in this space
dang 59 days ago [-]
[stub for offtopicness]
graphman 59 days ago [-]
Is it common practice to display fake realtime numbers on the homepage?

let storedNumber = getCookie("countingNumber"); let startNumber = storedNumber !== null ? storedNumber : Math.floor(Math.random() * (10300000 - 10000000 + 1)) + 10000000; let currentNumber = startNumber; function updateNumber() { let randomIncrement = Math.floor(Math.random() * (275 - 101 + 1)) + 101; currentNumber += randomIncrement; element.textContent = formatNumber(currentNumber); setCookie("countingNumber", currentNumber, 7); // Save number in cookie for 7 days } element.textContent = formatNumber(currentNumber); setInterval(updateNumber, 1000);

Akula112233 59 days ago [-]
Ah! That was a leftover from the initial dev version of our website. I've taken it out now. Thank you!
kadomony 59 days ago [-]
The marketing design approach feels very off to me. You barrage me with an annoying scrolling marquee showing me the most abstract, unrecognizable logos telling me I should trust you because they do. 10+ companies on board feels rather small.

You said AI-driven analysis to identify logs, but I'm already skeptical of AI doing tasks like this, and you obfuscate it further by not actually showing me how it works, just another generic abstract marketing design graphic.

I dunno. It just seems like vaporware-as-a-service from the design vibes.

dang 59 days ago [-]
Early-stage startups often have websites that are little more than landing pages. That's because a full commercial website isn't in their critical path yet—first they need to build their product and attract early users, who don't typically come in through general web traffic.

That's one reason why Launch HNs usually include a demo video. That's the link you should be clicking on if you want to see these guys' product. If you do that, you'll see that it isn't vaporware.

We also advise startups doing Launch HNs to provide a link for users to try the product (preferably without a signup gate, but that's not always doable). There's such a link in the text above as well.

I suppose one way to avoid complaints about stub websites would be not to link to them at all—but then other comments would say "why would I trust you, you don't even have a website"!

Edit: I've replaced https://runsift.com/ with https://app.trysift.dev/docs in the text above. Perhaps that will help.

Velorivox 59 days ago [-]
I want to jump in here and post this Launch HN form [0]. Obviously do not submit it if you are not a YC startup, but the questions on there are very helpful in terms of thinking about how to post about your startup on HN and elsewhere.

[0] https://docs.google.com/forms/d/1pRMkNiD-FKjYL-La5JWMwwrcWsp...

dang 58 days ago [-]
There's also https://news.ycombinator.com/yli.html, which is the guide for YC startups who want to launch on Hacker News. The formal mechanism is YC-only but the principles apply more broadly.
jascination 58 days ago [-]
I'm not in YC, but I want to launch my startup here as it's relevant to the audience. Can I go through a process like this to coordinate with you for a launch, or should we just follow the guidelines, make a submission and hope for the best?
Velorivox 57 days ago [-]
You would have to do a “Show HN”, the YC launch (post to the front page) is only for YC startups. You can certainly try and go through the process to do a “Launch HN” - but it would start with applying to YC.

Apart from show vs launch I think following the guidelines and hoping for the best is the norm. Launch HN is nice to get a one-time boost but it doesn’t confer any long-term special treatment on your post afaict.

paularmstrong 59 days ago [-]
What's not recognizable about Duck, Square, Triangle, Asterisk, C, two different cubes, and the letter 'n'?

These, coupled with the random number generator to claim how many logs they're processing makes me wonder if the entire product is just AI generated slop.

59 days ago [-]
Jeslijar 58 days ago [-]
Hey, maybe you can have a better hiring practice than datadog with a 5 question test where if you get a single answer wrong in even the smallest of ways you get disqualified from getting a job with them for 6 months.

I'm guessing they lost a wealth of great talent due to this test on how to support a platform that they give to fresh off the street applicants rather than having even a modicum of training about their product. They want you to study it for free, probably as a marketing tactic - but also so they don't have to pay to train employees. it's great like cancer.

Disclaimer: I have never applied to a role with datadog, nor interviewed with them. Just had multiple friends complete the process with mixed results. Seems like you need to put in ~two full weeks of self directed study to pass their on site interview 'exam' where they don't tell you about the exam being 100% or fail (but it is!)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 06:42:00 GMT+0000 (Coordinated Universal Time) with Vercel.