NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals (blog.skyvern.com)
happyopossum 3 hours ago [-]
Many of the examples given for agents such as this are things I just flat wouldn’t trust an LLM to do - buying something on Amazon for example: Will it pick new or ‘renewed’? Will it select an item that is from a janky looking vendor and may be counterfeit? Will it pick the cheapest option for me? What if multiple colors are offered?

This one example alone has so many branches that would require knowing what’s in my head.

On the flip side, it’s a ridiculously simple task for a human to do for themselves, so what am I truly saving?

Call me when I can ask it to check the professional reviews of X category on N websites (plus YouTube), summarize them for me, and find the cheapest source for the top 2 options in the category that will arrive in Y days or sooner.

That would be useful.

suchintan 3 hours ago [-]
This is a great point -- the example we chose was meant to be a consumer example that we could relate with.. however a similar example exists for the enterprise which may be more interesting

Let's say that you are a parts procurement shop and want to order 10,000 of SKU1, and 20,000 of SKU2. If you go on parts websites like finditparts.com, you'll see that there is little ambiguity when it comes to ordering specific SKUs

We've seen cases of companies that want to automate item ordering like this on tens of different websites, and have people (usually the CEO) spending a few hours a week doing it manually.

Writing a script can take ~10-20hours to do it (if you know how to code).. but we can help you automate it in <30 minutes with Skyvern, even if you don't know how to code!

Fnoord 3 hours ago [-]
I got Amazon Prime. If it has Prime, it is a no-brainer. Free return for 30 days. No S&H costs. Only cost is my time.
CryptoBanker 37 minutes ago [-]
If it fails enough times and you have to return enough items…well, Amazon has been known to ban people for that.

If you have an AWS account created before 2017, am Amazon ban means an AWS ban

drdaeman 2 hours ago [-]
Yea, but LLMs cannot reason - we've all seen them blurt out complete non-sequitur, or end up in death loops of pseudo-reasoning (e.g. https://news.ycombinator.com/item?id=42734681 has a few examples). I don't think one should trust an LLM to pick Prime products all the time even if that's very explicitly requested - I'm sure it's possible to minimize errors so it'll do the right thing most of the time, but having a guarantee that it won't pick non-Prime item sounds impossible. Same for any other tasks - if there is a way to make a mistake, a mistake will be eventually made.

(Idk if we can trust a human either - brain farts are a thing after all, but at least humans are accountable. Machines are not - at least not at the moment.)

lyime 2 hours ago [-]
To your last point -- Humans make mistakes too. I asked my EA to order a few things for our office a few days ago, and she ended up ordering things that I did not want. In this case I could have wrote a better prompt. Even with a better prompt she could have ordered the unwanted item. This is a reversible decision.

So my point is, that while you might get some false positives, it's worth automating as long as many of the decisions are reversible or correctable.

You might not want to use this in all cases, but it's still worthwhile for many many cases. The use case worth automating depends on the acceptable rate of error for the given use case.

skull8888888 34 minutes ago [-]
isn't browser use sota on web voyager? At this point web voyager seems to be outdated, there's def a need for a new harder benchmark.
lyime 2 hours ago [-]
This is an impressive tool. I especially like the observability around the workflow and the steps it takes to achieve the outcome. We are potentially interested in exploring this if we can get the cost down at scale.
suchintan 2 hours ago [-]
I'd love to chat to see how we can help! Here's my email: suchintan@skyvern.com

We're working on 2 major improvements that will get cost down at scale: 1. We're building a code generation layer under the hood that will start to memorize actions Skyvern has taken on a website, so repeated runs will be nearly free 2. We're exploring some graph re-ranking techniques to eliminate useless elements from the HTML DOM when analyzing the page. For example, if you're looking at the product page and want to add a product to cart, the likelihood you'll need to interact with the Reviews page will be 0. No need to send that context along to the LLM

dataviz1000 8 minutes ago [-]
> We're exploring some graph re-ranking techniques to eliminate useless elements from the HTML DOM when analyzing the page.

Computer vision is useful and very quick, however, it has been my experience parsing stacking context is much more useful. The problem is creating a stacking context when a news site embeds a youtube or blusky post. It requires injecting script into each using playwright. (Not mine, but, prior art [0]).

I've been quietly solving a problem I encountered creating browser agents that didn't have a solution 2 years ago in my free time. Most webpages are several independent global execution contexts and I'm developing a coherent way to get them all to speak with each other. [1]

[0] https://github.com/andreadev-it/stacking-contexts-inspector

[1] https://news.ycombinator.com/item?id=42576240

govindsb 3 hours ago [-]
congrats Suchintan! huge achievement!
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 22:54:23 GMT+0000 (Coordinated Universal Time) with Vercel.