NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
WebSockets cost us $1M on our AWS bill (recall.ai)
gwbas1c 8 minutes ago [-]
Classic story of a startup taking a "good enough" shortcut and then coming back later to optimize.

---

I have a similar story: Where I work, we had a cluster of VMs that were always high CPU and a bit of a problem. We had a lot of fire drills where we'd have to bump up the size of the cluster, abort in-progress operations, or some combination of both.

Because this cluster of VMs was doing batch processing that the founder believed should be CPU intense, everyone just assumed that increasing load came with increasing customer size; and that this was just an annoyance that we could get to after we made one more feature.

But, at one point the bean counters pointed out that we spent disproportionately more on cloud than a normal business did. After one round of combining different VM clusters (that really didn't need to be separate servers), I decided that I could take some time to hook up this very CPU intense cluster up to a profiler.

I thought I was going to be in for a 1-2 week project and would follow a few worms. Instead, the CPU load was because we were constantly loading an entire table, that we never deleted from, into the application's process. The table had transient data that should only last a few hours at most.

I quickly deleted almost a decade's worth of obsolete data from the table. After about 15 minutes, CPU usage for this cluster dropped to almost nothing. The next day we made the VM cluster a fraction of its size, and in the next release, we got rid of the cluster and merged the functionality into another cluster.

I also made a pull request that introduced a simple filter to the query to only load 3 days of data; and then introduced a background operation to clean out the table periodically.

wiml 1 minutes ago [-]
> One complicating factor here is that raw video is surprisingly high bandwidth.

It's weird to be living in a world where this is a surprise but here we are.

Nice write up though. Web sockets has a number of nonsensical design decisions, but I wouldn't have expected that this is the one that would be chewing up all your cpu.

trollied 17 minutes ago [-]
>In a typical TCP/IP network connected via ethernet, the standard MTU (Maximum Transmission Unit) is 1500 bytes, resulting in a TCP MSS (Maximum Segment Size) of 1448 bytes. This is much smaller than our 3MB+ raw video frames.

> Even the theoretical maximum size of a TCP/IP packet, 64k, is much smaller than the data we need to send, so there's no way for us to use TCP/IP without suffering from fragmentation.

Just highlights that they do not have enough technical knowledge in house. Should spend the $1m/year saving on hiring some good devs.

turtlebits 2 hours ago [-]
Is this really an AWS issue? Sounds like you were just burning CPU cycles, which is not AWS related. WebSockets makes it sound like it was a data transfer or API gateway cost.
brazzy 1 hours ago [-]
Neither the title nor the article are painting it as an AWS issue, but as a websocket issue, because the protocol implicitly requires all transferred data to be copied multiple times.
VWWHFSfQ 1 hours ago [-]
> Is this really an AWS issue?

I doubt they would have even noticed this outrageous cost if they were running on bare-metal Xeons or Ryzen colo'd servers. You can rent real 44-core Xeon servers for like, $250/month.

So yes, it's an AWS issue.

JackSlateur 1 hours ago [-]

  You can rent real 44-core Xeon servers for like, $250/month.
Where, for instance ?
Faaak 1 hours ago [-]
Hetzner for example. An EPYC 48c (96t) goes for 230 euros
dilyevsky 1 hours ago [-]
Hetzner network is complete dog. They also sell you machines that are long should be EOL’ed. No serious business should be using them
dijit 55 minutes ago [-]
What cpu do you think your workload is using on AWS?

GCP exposes their cpu models, and they have some Haswell and Broadwell lithographies in service.

Thats a 10+ year old part, for those paying attention.

tsimionescu 7 minutes ago [-]
I think they meant that Hetzner is offering specific machines they know to be faulty and should have EOLd to customers, not that they use deprecated CPUs.
dijit 3 minutes ago [-]
Thats scary if true, any sources? My google-fu is failing me. :/
dilyevsky 34 minutes ago [-]
Most of GCP and some AWS instances will migrate to another node when it’s faulty. Also disk is virtual. None of this applies to baremetal hetzner
dijit 33 minutes ago [-]
Why is that relevant to what I said?
dilyevsky 12 minutes ago [-]
Only relevant if you care about reliability
dijit 8 minutes ago [-]
AWS was working “fine” for about 10 years without live migration, and I’ve had several individual machines running without a reboot or outage for quite literally half a decade. Enough to hit bugs like this: https://support.hpe.com/hpesc/public/docDisplay?docId=a00092...

Anyway, depending on individual nodes to always be up for reliability is incredibly foolhardy. Things can happen, cloud isn't magic, I’ve had instances become unrecoverable. Though it is rare.

So, I still don’t understand the point, that was not exactly relevant to what I said.

blibble 39 minutes ago [-]
I just cat'ed /proc/cpuinfo on my Hetzner and AWS machines

AWS: E5-2680 v4 (2016)

Hetzner: Ryzen 5 (2019)

speedgoose 41 minutes ago [-]
I know serious businesses using Hetzner for their critical workloads. I wouldn’t unless money is tight, but it is possible. I use them for my non critical stuff, it costs so much less.
VWWHFSfQ 1 hours ago [-]
There are many colos that offer dedicated server rental/hosting. You can just google for colos in the region you're looking for. I found this one

https://www.colocrossing.com/server/dedicated-servers/

petcat 1 hours ago [-]
I don't know anything about Colo Crossing (are they a reseller?) but I would bet their $60 per month 4-core Intel Xeons would outperform a $1,000 per month "compute optimized" EC2 server.
phonon 16 minutes ago [-]
For $1000 per month you can get a c8g.12xlarge (assuming you use some kind of savings plan).[0] That's 48 cores, 96 GB of RAM and 22.5+ Gbps networking. Of course you still need to pay for storage, egress etc., but you seem to be exaggerating a bit....

[0]https://instances.vantage.sh/aws/ec2/c8g.12xlarge?region=us-...

fragmede 59 minutes ago [-]
What benchmark would you like to use?
petcat 51 minutes ago [-]
This blog is about doing video processing on the CPU, so something akin to that.
marcopolo 50 minutes ago [-]
Masking in the WebSocket protocol is kind of a funny and sad fix to the problem of intermediaries trying to be smart and helpful, but failing miserably.

The linked section of the RFC is worth the read: https://www.rfc-editor.org/rfc/rfc6455#section-10.3

handfuloflight 2 hours ago [-]
Love the transparency here. Would also love if the same transparency was applied to pricing for their core product. Doesn't appear anywhere on the site.
DrammBA 16 minutes ago [-]
I use that as a litmus test when deciding whether to use a service: if I can't find a prominently linked pricing page on the homepage, I'm out.
lawrenceduk 2 hours ago [-]
It’s ok, it’s now a million dollars/year cheaper when your renewal comes up!

Jokes aside though, some good performance sleuthing there.

cperciva 13 minutes ago [-]
We use atomic operations to update the pointers in a thread-safe manner

Are you sure about that? Atomics are not locks, and not all systems have strong memory ordering.

pier25 24 minutes ago [-]
Why were they using websockets to send video in the first place?

Was it because they didn't want to use some multicast video server?

cogman10 1 hours ago [-]
This is such a weird way to do things.

Here they have a nicely compressed stream of video data, so they take that stream and... decode it. But they aren't processing the decoded data at the source of the decode, so instead they forward that decoded data, uncompressed(!!), to a different location for processing. Surprisingly, they find out that moving uncompressed video data from one location to another is expensive. So, they compress it later (Don't worry, using a GPU!)

At so many levels this is just WTF. Why not forward the compressed video stream? Why not decompress it where you are processing it instead of in the browser? Why are you writing it without any attempt at compression? Even if you want lossless compression there are well known and fast algorithms like flv1 for that purpose.

Just weird.

isoprophlex 1 hours ago [-]
Article title should have been "our weird design cost us $1M".

As it turns out, doing something in Rust does not absolve you of the obligation to actually think about what you are doing.

dylan604 38 minutes ago [-]
TFA opening graph "But it turns out that if you IPC 1TB of video per second on AWS it can result in enormous bills when done inefficiently. "
tbarbugli 36 minutes ago [-]
Possibly because they capture the video from xvfb or similar (they run a headless browser to capture the video) so at that point the decoding already happened (webrtc?)
rozap 1 hours ago [-]
Really strange. I wonder why they omitted this. Usually you'd leave it compressed until the last possible moment.
dylan604 38 minutes ago [-]
> Usually you'd leave it compressed until the last possible moment.

Context matters? As someone working in production/post, we want to keep it uncompressed until the last possible moment. At least as far as no more compression than how it was acquired.

DrammBA 8 minutes ago [-]
> Context matters?

It does, but you just removed all context from their comment and introduced a completely different context (video production/post) for seemingly no reason.

Going back to the original context, which is grabbing a compressed video stream from a headless browser, the correct approach to handle that compressed stream is to leave it compressed until the last possible moment.

cosmotic 2 hours ago [-]
Why decode to then turn around and re-encode?
pavlov 2 hours ago [-]
Reading their product page, it seems like Recall captures meetings on whatever platform their customers are using: Zoom, Teams, Google Meet, etc.

Since they don't have API access to all these platforms, the best they can do to capture the A/V streams is simply to join the meeting in a headless browser on a server, then capture the browser's output and re-encode it.

MrBuddyCasino 1 hours ago [-]
They‘re already hacking Chromium. If the compressed video data is unavailable in JS, they could change that instead.
pavlov 16 minutes ago [-]
If you want to support every meeting platform, you can’t really make any assumptions about the data format.

To my knowledge, Zoom’s web client uses a custom codec delivered inside a WASM blob. How would you capture that video data to forward it to your recording system? How do you decode it later?

Even if the incoming streams are in a standard format, compositing the meeting as a post-processing operation from raw recorded tracks isn’t simple. Video call participants have gaps and network issues and layer changes, you can’t assume much anything about the samples as you would with typical video files. (Coincidentally this is exactly what I’m working on right now at my job.)

moogly 1 hours ago [-]
They did what every other startup does: put the PoC in production.
ketzo 2 hours ago [-]
I had the same question, but I imagine that the "media pipeline" box with a line that goes directly from "compositor" to "encoder" is probably hiding quite a lot of complexity

Recall's offering allows you to get "audio, video, transcripts, and metadata" from video calls -- again, total conjecture, but I imagine they do need to decode into raw format in order to split out all these end-products (and then re-encode for a video recording specifically.)

Szpadel 2 hours ago [-]
my guess is either that video they get use some proprietary encoding format (js might do some magic on the feed) or it's because it's latency optimized stream that consumes a lot of bandwidth
Dylan16807 2 hours ago [-]
The title makes it sound like there was some kind of blowout, but really it was a tool that wasn't the best fit for this job, and they were using twice as much CPU as necessary, nothing crazy.
a_t48 2 hours ago [-]
Did they consider iceoryx2? From the outside, it feels like it fits the bill.
akira2501 2 hours ago [-]
> A single 1080p raw video frame would be 1080 * 1920 * 1.5 = 3110.4 KB in size

They seem to not understand the fundamentals of what they're working on.

> Chromium's WebSocket implementation, and the WebSocket spec in general, create some especially bad performance pitfalls.

You're doing bulk data transfers into a multiplexed short messaging socket. What exactly did you expect?

> However there's no standard interface for transporting data over shared memory.

Yes there is. It's called /dev/shm. You can use shared memory like a filesystem, and no, you should not be worried about user/kernel space overhead at this point. It's the obvious solution to your problem.

> Instead of the typical two-pointers, we have three pointers in our ring buffer:

You can use two back to back mmap(2) calls to create a ringbuffer which avoids this.

Scaevolus 1 hours ago [-]
It's pretty funny that they assumed that memory copying was the limiting factor when they're pushing a mere 150MB/s around instead of the various websocket overheads, then jumped right into over-engineering a zero copy ring buffer. I get it, but come on!

>50 GB/s of memory bandwidth is common nowadays[1], and will basically never be the bottleneck for 1080P encoding. Zero copy matters when you're doing something exotic, like Netflix pushing dozens of GB/s from a CDN node.

[1]: https://lemire.me/blog/2024/01/18/how-much-memory-bandwidth-...

didip 25 minutes ago [-]
I agree with you. The moment they said shared memory, I was thinking /dev/shm. Lots of programming languages have libraries to /dev/shm already.

And since it behaves like filesystem, you can swap it with real filesystem during testing. Very convenient.

I am curious if they tried this already or not and if they did, what problems did they encounter?

anonymous344 55 minutes ago [-]
well someone will feel like an idiot after reading your facts. This is why education and experience is important. Not just React/rust course and then you are full stack senior :D
ComputerGuru 2 hours ago [-]
I don't mean to be dismissive, but this would have been caught very early on (in the planning stages) by anyone that had/has experience in system-level development rather than full-stack web js/python development. Quite an expensive lesson for them to learn, even though I'm assuming they do have the talent somewhere on the team if they're able to maintain a fork of Chromium.

(I also wouldn't be surprised if they had even more memory copies than they let on, marshalling between the GC-backed JS runtime to the GC-backed Python runtime.)

I was coming back to HN to include in my comment a link to various high-performance IPC libraries, but another commenter already beat me linking to iceoryx2 (though of course they'd need to use a python extension).

SHM for IPC has been well-understood as the better option for high-bandwidth payloads from the 1990s and is a staple of Win32 application development for communication between services (daemons) and clients (guis).

diroussel 2 hours ago [-]
Sometimes it is more important to work on proving you have a viable product and market to sell it in before you optimise.

On the outside we can’t be sure. But it’s possible that they took the right decision to go with a naïve implementation first. Then profile, measure and improve later.

But yes the hole idea of running a headless web browser to get run JavaScript to get access to a video stream is a bit crazy. But I guess that’s just the world we are in.

CharlieDigital 2 hours ago [-]

    > I don't mean to be dismissive, but this would have been caught very early on (in the planning stages) by anyone that had/has experience in system-level development rather than full-stack web js/python development
Based on their job listing[0], Recall is using Rust on the backend.

[0] https://www.workatastartup.com/companies/recall-ai

Sesse__ 2 hours ago [-]
It's not even clear why they need a browser in the mix; most of these services have APIs you can use. (Also, why fork Chromium instead of using CEF?)
randomdata 1 hours ago [-]
> rather than full-stack web js/python development.

The product is not a full-stack web application. What makes you think that they brought in people with that kind of experience just for this particular feature?

Especially when they claim that they chose that route because it was what was most convenient. While you might argue that wasn't the right tradeoff, it is a common tradeoff developers of all kinds make. “Make It Work, Make It Right, Make It Fast” has become pervasive in this industry, for better or worse.

whatever1 2 hours ago [-]
Wouldn’t also something like redis be an alternative?
beoberha 51 minutes ago [-]
Classic Hacker News getting hung up on the narrative framing. It’s a cool investigation! Nice work guys!
bauruine 53 minutes ago [-]
FWIW: The MTU of the loopback interface on Linux is 64KB by default
2 hours ago [-]
londons_explore 2 hours ago [-]
They are presumably using the GPU for video encoding....

And the GPU for rendering...

So they should instead just be hooking into Chromium's GPU process and grabbing the pre-composited tiles from the LayerTreeHostImpl[1] and dealing with those.

[1]: https://source.chromium.org/chromium/chromium/src/+/main:cc/...

orf 1 hours ago [-]
One of the first parts of the post explains how they are using CPUs only
mbb70 1 hours ago [-]
They are very explicit in the article that they run everything on CPUs.
isoprophlex 1 hours ago [-]
You'd think so but nope, they deliberately run on CPU, as per the article...
OptionOfT 2 hours ago [-]
Did they originally NOT run things on the same machine? Otherwise the WebSocket would be local and incur no cost.
nemothekid 2 hours ago [-]
>WebSocket would be local and incur no cost.

The memcopys are the cost that they were paying, even if it was local.

magamanlegends 2 hours ago [-]
our websocket traffic is roughly 40% of recall.ai and our bill was $150 USD this month using a high memory VPS
jgauth 2 hours ago [-]
Did you read the article? It is about the CPU cost of using WebSockets to transfer data over loopback.
kunwon1 1 hours ago [-]
I read the entire article and that wasn't my takeaway. After reading, I assumed that AWS was (somehow) billing for loopback bandwidth, it wasn't apparent (to me) from the article that CPU costs were the sticking point
dbrower 1 hours ago [-]
How much did the engineering time to make this optimization cost?
jazzyjackson 28 minutes ago [-]
I for one would like to praise the company for sharing their failure, hopefully next time someone Googles "transport video over websocket" theyll find this thread.
renewiltord 2 hours ago [-]
That's a good write-up with a standard solution in some other spaces. Shared memory buffers are very fast too. It's interesting to see them being used here. Nice write up. It wasn't what I expected: that they were doing something dumb with API Gateway Websockets. This is actual stuff. Nice.
2 hours ago [-]
CyberDildonics 2 hours ago [-]
Actual reality beyond the fake title:

"using WebSockets over loopback was ultimately costing us $1M/year in AWS spend"

then

"and the quest for an efficient high-bandwidth, low-latency IPC"

Shared memory. It has been there for 50 years.

jgalt212 2 hours ago [-]
> But it turns out that if you IPC 1TB of video per second on AWS it can result in enormous bills when done inefficiently.

As a point of comparison, how many TB per second of video does Netflix stream?

ffsm8 1 hours ago [-]
I don't think that number is as easy to figure out as most people think.

Netflix has hardware ISPs can get so they can serve their content without saturating the ISPs lines.

There is a statistic floating around that Netflix was responsible for 15% of the global traffic 2022/2023, and YouTube 12%. If that number is real... That'd be a lot more

hipadev23 2 hours ago [-]
what was the actual cost? cpu?
cynicalsecurity 1 hours ago [-]
They are desperately trying to blame anyone except themselves.
yapyap 2 hours ago [-]
> But it turns out that if you IPC 1TB of video per second on AWS it can result in enormous bills when done inefficiently.

that’s surprising to.. almost no one? 1TBPS is nothing to scoff at

blibble 1 hours ago [-]
in terms of IPC, DDR5 can do about 50GB/s per memory channel

assuming you're only shuffling bytes around, on bare metal this would be ~20 DDR5 channels worth

or 2 servers (12 channels/server for EPYC)

you can get an awful lot of compute these days for not very much money

(shipping your code to the compressed video instead of the exact opposite would probably make more sense though)

thadk 1 hours ago [-]
Could Arrow be a part of the shared memory solution in another context?
hkgjjgjfjfjfjf 2 hours ago [-]
[dead]
punduk 2 hours ago [-]
[flagged]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 21:33:12 GMT+0000 (Coordinated Universal Time) with Vercel.