Classic story of a startup taking a "good enough" shortcut and then coming back later to optimize.
---
I have a similar story: Where I work, we had a cluster of VMs that were always high CPU and a bit of a problem. We had a lot of fire drills where we'd have to bump up the size of the cluster, abort in-progress operations, or some combination of both.
Because this cluster of VMs was doing batch processing that the founder believed should be CPU intense, everyone just assumed that increasing load came with increasing customer size; and that this was just an annoyance that we could get to after we made one more feature.
But, at one point the bean counters pointed out that we spent disproportionately more on cloud than a normal business did. After one round of combining different VM clusters (that really didn't need to be separate servers), I decided that I could take some time to hook up this very CPU intense cluster up to a profiler.
I thought I was going to be in for a 1-2 week project and would follow a few worms. Instead, the CPU load was because we were constantly loading an entire table, that we never deleted from, into the application's process. The table had transient data that should only last a few hours at most.
I quickly deleted almost a decade's worth of obsolete data from the table. After about 15 minutes, CPU usage for this cluster dropped to almost nothing. The next day we made the VM cluster a fraction of its size, and in the next release, we got rid of the cluster and merged the functionality into another cluster.
I also made a pull request that introduced a simple filter to the query to only load 3 days of data; and then introduced a background operation to clean out the table periodically.
sly010 57 days ago [-]
Hah. At a previous place I found that our cloud cost consisted of 90% storage costs. The data? Tens of thousands of incomplete backups of the in-office file server. 3 years of the NAS continuously trying to back itself up to S3 and failing every time.
declan_roberts 57 days ago [-]
I love these stories. I have a few as well. In the end I know we're all just doing our job, but I've been tempted at times to say to my manager: "I will save the company $10k/month tomorrow if you give me a cut of the pie."
euroderf 57 days ago [-]
This should be the norm, actually.
It drives ya nuts to read a story where some guy on the shop floor saves his employer ten million dollars and to reward him they give him a 20% off coupon for Home Depot.
freedomben 56 days ago [-]
This sounds nice in theory, but would incentivize people to introduce unnecessarily expensive things initially and optimize them later to claim some of the savings. We would like to think that nobody would do something like that, but the sad reality is that there are plenty, especially as the potential reward goes up high enough.
7jjjjjjj 56 days ago [-]
Boss makes a dollar, I make a dime, that's why my code runs in exponential time.
zuhsetaqi 56 days ago [-]
> It drives ya nuts to read a story where some guy on the shop floor saves his employer ten million dollars and to reward him they give him a 20% off coupon for Home Depot.
It’s your job as an employee, it’s why you get paid in the first place
mgkimsal 55 days ago [-]
Is it? Or is an employee's job to just do the work they're asked to do?
"It depends" is, of course, the common answer, but in most places I've worked, "please help find operational optimizations that can have a positive impact for the team, department or organization" has certainly never been an explicit ask.
Ask team mates to change something to help on a project may fall under the same category, but usually the effect isn't felt beyond a project.
My default mode when coming in to jobs is to try to get a 'full company' view, because I want to know how things work and how they might be made better. That approach is usually not met with much enthusiasm, and usually more with "that's not your job, you don't need to know that", etc.
I took a daily import routine that was taking 25+ hours (meaning we couldn't show 'daily info' because it was out of date before finishing import) and got it down to 30 minutes. This was after having to fight/argue to see the data, and being told for a couple weeks "it can't be sped up, we'll have to buy faster hardware" ($8-$10k min, but they weren't looking at $15-20k IIRC). I spent a few hours over a weekend and got it down to 30 minutes, and saved the company minimum $8k. But I had to fight/argue to even do that ("that's not your job", "Charles is taking care of that", "the client will just have to deal with more delays while we upgrade", etc).
evoke4908 54 days ago [-]
Uh, no. Have you ever been employed? Your job is what is laid out in your contract, period. You are paid to do that specific set of tasks and nothing else. "Other duties as required" be damned.
Fixing the business is very explicitly not your job, and is absolutely not what you're paid for. Any value you create for the business outside of those bounds is at your own cost and you absolutely will not be compensated unless the business is so small you don't have six layers of management trying to extract any kind of promotion.
antisthenes 57 days ago [-]
99% of the time, it's either a quadratic (or exponential) algorithm or a really bad DB query.
dd82 57 days ago [-]
can also be a linear algorithm that does N+1 queries. ORMs can be very good at hiding this implementation detail
It wasn't a financial cost, but the biggest single performance improvement I've seen firsthand came from optimizing a SQL query. One of our Professional Services people had written a query that did repeated self-joins on a fairly large table, which took ~15 minutes to run. A DBA-turned-dev on our team rewrote it using MSSQL's PIVOT operator, and the query started executing in less than a second.
alsetmusic 57 days ago [-]
As much as you can say (perhaps not hard numbers, but as a percentage), what was the savings to the bottom line / cloud costs?
gwbas1c 57 days ago [-]
Probably ~5% of cloud costs. Combined with the prior round of optimizations, it was substantial.
I was really disappointed when my wife couldn't get the night off from work when the company took everyone out to a fancy steak house.
chgs 57 days ago [-]
So you saved the company $10k a month and got a $200 meal in gratitude? Awesome.
ChadNauseam 57 days ago [-]
I'm not sure how they feel, but when it happens to me, it's not a big deal because it's my job to do things like that. If I fuck up and cost them $10k/month I'm certainly not going to offer to reimburse them.
bagels 57 days ago [-]
They're presumably already being paid a salary to do this work.
ponty_rick 57 days ago [-]
They're more pissed about the 1.2M they spent than about the 10k a month they saved.
_hyn3 57 days ago [-]
> So you saved the company $10k a month and got a $200 meal in gratitude? Awesome.
You seem to be assuming that a $200 meal was the only compensation the person received, and they weren't just getting a nice meal as a little something extra on top of getting paid for doing their job competently and efficiently.
But that's the kind of deal I make when I take a job: I do the work (pretty well most of the time), and I get paid. If I stop doing the work, I stop getting paid. If they stop paying, I stop doing the work. (And bonus, literally, if I get a perk once in a while like a free steak dinner that I wasn't expecting)
It doesn't have to be more complicated than that.
meiraleal 57 days ago [-]
Yeah? Well, proper rewards make those savings and optimizations more common. Otherwise most people will do the work needed just to have work tomorrow.
groby_b 57 days ago [-]
Depends. There are people who put in the absolute minimum work they can get away with, and there are people who have pride in their profession.
That's independent of pay scale.
Granted, if you pay way below expectations, you'll lose the professionals over time. But if you pay lavishly no matter what, you get the 2021/2022 big tech hiring cycle instead. Neither one is a great outcome.
meiraleal 56 days ago [-]
A business that relies on people having pride in their profession won't scale. Proper rewards scale.
groby_b 54 days ago [-]
We demonstrated in 2022 that they don't.
There is no single mechanism that does. Paying well is always a component in attracting talent (see my original comment)
It is not a guarantor of quality/motivation. That's ongoing leadership work. And part of that is maintaining/kindling pride in people's work (and firing the ones who are just there for the money)
Sohcahtoa82 57 days ago [-]
It creates a perverse incentive to deliberately do things a more expensive way at the beginning and then be a hero 6 months down the line by refactoring it to be less expensive.
fn-mote 57 days ago [-]
Ha ha, software developers already have this incentive. Viz: "superhero 10x programmer" writing unmaintainable code to provide some desirable features, whose work later turns out to be much less awesome than originally billed.
Of course the truth is more complicated than the sound bite, but still...
gwbas1c 56 days ago [-]
The dinner was to celebrate getting acquired, which was a wide team effort. The cost savings I did was one of the pieces I contributed.
Don't assume that a steak dinner was the only recognition we got.
As far as comp: I was well taken care of, and I won't discuss more in a public forum.
Cyphase 57 days ago [-]
What should they have gotten?
tempest_ 57 days ago [-]
They are in theory owed nothing more than their salary but it can be very good for moral to reward that type of thing (assuming they are not introducing a perverse incentive)
wellthisisgreat 57 days ago [-]
Yeah, oppositely they should share the downsides of the burdens lol
misstercool 57 days ago [-]
It is a good problem to have for a startup, most startups are struggling finding customers to use their thing. Better to go with "good enough" shortcut and prioritize on growth. Recall is a YC company, I am sure they took advantage of huge amount AWS of credits in the first few years.
magundu 57 days ago [-]
Great.
It will be great if anyone write a checklist(playbook) to be checked for CPU, Memory, Disk, IO and network issues.
wiml 57 days ago [-]
> One complicating factor here is that raw video is surprisingly high bandwidth.
It's weird to be living in a world where this is a surprise but here we are.
Nice write up though. Web sockets has a number of nonsensical design decisions, but I wouldn't have expected that this is the one that would be chewing up all your cpu.
arccy 57 days ago [-]
I think it's just rare for a lot of people to be handling raw video. Most people interact with highly efficient (lossy) codecs on the web.
adastra22 57 days ago [-]
Even compressed video is massive data though.
carlhjerpe 57 days ago [-]
I was surprised when calculating and sizing the shared memory for my Gaming VM for use with "Looking-Glass". At 165hz 2k HDR it's many gigabytes per second, that's why HDMI and DisplayPort is specced really high
jazzyjackson 56 days ago [-]
I'd still like to get a looking glass. What did you end up with?
sensanaty 57 days ago [-]
I always knew video was "expensive", but my mark for what expensive meant was a good few orders of magnitude off when I researched the topic for a personal project.
I can easily imagine the author being in a similar boat, knowing that it isn't cheap, but then not realizing that expensive in this context truly does mean expensive until they actually started seeing the associated costs.
handfuloflight 57 days ago [-]
> It's weird to be living in a world where this is a surprise but here we are.
I think it's because the cost of it is so abstracted away with free streaming video all across the web. Once you take a look at the egress and ingress sides you realize how quickly it adds up.
turnsout 56 days ago [-]
I’m so confused… they were sending uncompressed video to an AWS server?
If so, they deserve a $1M bill.
7jjjjjjj 56 days ago [-]
It was on a loopback interface. The problem was CPU usage, not bandwidth costs.
turnsout 56 days ago [-]
Let me rephrase. They were processing uncompressed video via a loopback interface?
turtlebits 57 days ago [-]
Is this really an AWS issue? Sounds like you were just burning CPU cycles, which is not AWS related. WebSockets makes it sound like it was a data transfer or API gateway cost.
brazzy 57 days ago [-]
Neither the title nor the article are painting it as an AWS issue, but as a websocket issue, because the protocol implicitly requires all transferred data to be copied multiple times.
bigiain 57 days ago [-]
I disagree. Like @turtlebits, I was waiting for the part of the story where websocket connections between their AWS resources somehow got billed at Amazon's internet data egress rates.
turtlebits 57 days ago [-]
If you call out your vendor, the issue usually lies with some specific issue with them or their service. The title obviously states AWS.
If I said that "childbirth cost us 5000 on our <hospital name> bill", you assume the issue is with the hospital.
Capricorn2481 57 days ago [-]
Only for people that just read headlines and make technical decisions based on them. Are we catering to them now? The title is factual and straightforward.
Nevermark 57 days ago [-]
And also highlights a meaningful irrelevance.
The idea that clearer titles are just babying some class of people is perverse.
Titles are the foremost means of deciding what to read, for anyone of any sophistication. Clearer titles benefit everyone.
The subject matter is meaningful to more than AWS users, but non-AWS users are going to be less likely to read it based on the title.
anitil 57 days ago [-]
I didn't know this - why is this the case?
VWWHFSfQ 57 days ago [-]
> Is this really an AWS issue?
I doubt they would have even noticed this outrageous cost if they were running on bare-metal Xeons or Ryzen colo'd servers. You can rent real 44-core Xeon servers for like, $250/month.
So yes, it's an AWS issue.
JackSlateur 57 days ago [-]
You can rent real 44-core Xeon servers for like, $250/month.
Where, for instance ?
Faaak 57 days ago [-]
Hetzner for example. An EPYC 48c (96t) goes for 230 euros
Hetzner network is complete dog. They also sell you machines that are long should be EOL’ed. No serious business should be using them
dijit 57 days ago [-]
What cpu do you think your workload is using on AWS?
GCP exposes their cpu models, and they have some Haswell and Broadwell lithographies in service.
Thats a 10+ year old part, for those paying attention.
tsimionescu 57 days ago [-]
I think they meant that Hetzner is offering specific machines they know to be faulty and should have EOLd to customers, not that they use deprecated CPUs.
dijit 57 days ago [-]
Thats scary if true, any sources? My google-fu is failing me. :/
akvadrako 57 days ago [-]
It's not scary, it's part of the value proposition.
I used to work for a company that rented lots of hetzner boxes. Consumer grade hardware with frequent disk failures was just what we excepted for saving a buck.
tsimionescu 57 days ago [-]
Sorry, I have no idea if this is true. I was just pointing out what the GP was trying to claim.
dilyevsky 57 days ago [-]
Most of GCP and some AWS instances will migrate to another node when it’s faulty. Also disk is virtual. None of this applies to baremetal hetzner
dijit 57 days ago [-]
Why is that relevant to what I said?
dilyevsky 57 days ago [-]
Only relevant if you care about reliability
dijit 57 days ago [-]
AWS was working “fine” for about 10 years without live migration, and I’ve had several individual machines running without a reboot or outage for quite literally half a decade. Enough to hit bugs like this: https://support.hpe.com/hpesc/public/docDisplay?docId=a00092...
Anyway, depending on individual nodes to always be up for reliability is incredibly foolhardy. Things can happen, cloud isn't magic, I’ve had instances become unrecoverable. Though it is rare.
So, I still don’t understand the point, that was not exactly relevant to what I said.
blibble 57 days ago [-]
I just cat'ed /proc/cpuinfo on my Hetzner and AWS machines
AWS: E5-2680 v4 (2016)
Hetzner: Ryzen 5 (2019)
dilyevsky 57 days ago [-]
Now do hard drives
blibble 57 days ago [-]
the hetzner one is a dedicated pcie 4.0 nvme device and wrote at 2.3GB/s (O_DIRECT)
the AWS one is some emulated block device, no idea what it is, other than it's 20x slower
luuurker 57 days ago [-]
You keep moving the goal posts with these replies.
Hetzner isn't the best provider in the world, but it's also not as bad as you say they are. They're not just renting old servers.
speedgoose 57 days ago [-]
I know serious businesses using Hetzner for their critical workloads. I wouldn’t unless money is tight, but it is possible. I use them for my non critical stuff, it costs so much less.
VWWHFSfQ 57 days ago [-]
There are many colos that offer dedicated server rental/hosting. You can just google for colos in the region you're looking for. I found this one
I don't know anything about Colo Crossing (are they a reseller?) but I would bet their $60 per month 4-core Intel Xeons would outperform a $1,000 per month "compute optimized" EC2 server.
phonon 57 days ago [-]
For $1000 per month you can get a c8g.12xlarge (assuming you use some kind of savings plan).[0] That's 48 cores, 96 GB of RAM and 22.5+ Gbps networking. Of course you still need to pay for storage, egress etc., but you seem to be exaggerating a bit....they do offer a 44 core Broadwell/128 GB RAM option for $229 per month, so AWS is more like a 4x markup[1]....the C8g would likely be much faster at single threaded tasks though[2][3]
Wouldn't c8g.12xlarge with 500g storage (only EBS is possible), plus 1gbps from/to the internet is 5,700 USD per month, that's some discount you have.
If I try to match the actual machine. 16G ram. A rough estimate is that their Xeon E3-1240 would be ~2 AWS vCPU. So an r6g.large is the instance that would roughly match this one. Add 500G disk + 1 Gbps to/from the internet and ... monthly cost 3,700 USD.
Without any disk and without any data transfer (which would be unusable) it's still ~80USD. Maybe you could create a bootable image that calculates primes.
These are still not the same thing, I get it, but ... it's safe to say you cannot get anything remotely comparable on AWS. You can only get a different thing for way more money.
If you have high bandwidth needs on AWS you can use AWS Lightsail, which has some discounted transfer rates.
spwa4 56 days ago [-]
Even just the compute, without even disk, is barely competitive.
phonon 55 days ago [-]
I'm not sure I understand your point anymore.
petcat 57 days ago [-]
> That's 48 cores
That's not dedicated 48 cores, it's 48 "vCPUs". There are probably 1,000 other EC2 instances running on those cores stealing all the CPU cycles. You might get 4 cores of actual compute throughput. Which is what I was saying
phonon 57 days ago [-]
That's not how it works, sorry. (Unless you use burstable instances, like T4g) You can run them at 100% as long as you like, and it has the same performance (minus a small virtualization overhead).
petcat 57 days ago [-]
Are you telling me that my virtualized EC2 server is the only thing running on the physical hardware/CPU? There are no other virtualized EC2 servers sharing time on that hardware/CPU?
phonon 57 days ago [-]
If you are talking about regular EC2 (not T series, or Lambda, or Fargate etc.) you get the same performance (within say 5%) of the underlying hardware. If you're using a core, it's not shared with another user. The pricing validates this...the "metal" version of a server on AWS is the same price as the full regular EC2 version.
In fact, you can even get a small discount with the -flex series, if you're willing to compromise slightly. (Small discount for 100% of performance 95% of the time).
petcat 57 days ago [-]
This seems pretty wild to me. Are you saying that I can submit instructions to the CPU and they will not be interleaved and the registers will not be swapped-out with instructions from other EC2 virtual server applications running on the same physical machine?
doctorpangloss 57 days ago [-]
Only the t instances and other VM types that have burst billing are overbooked in the sense that you are describing.
nostrebored 57 days ago [-]
Yes — you can validate this by benchmarking things like l1 cache
phonon 57 days ago [-]
Welcome to the wonderful world of multi-core CPUs...
fragmede 57 days ago [-]
What benchmark would you like to use?
petcat 57 days ago [-]
This blog is about doing video processing on the CPU, so something akin to that.
>In a typical TCP/IP network connected via ethernet, the standard MTU (Maximum Transmission Unit) is 1500 bytes, resulting in a TCP MSS (Maximum Segment Size) of 1448 bytes. This is much smaller than our 3MB+ raw video frames.
> Even the theoretical maximum size of a TCP/IP packet, 64k, is much smaller than the data we need to send, so there's no way for us to use TCP/IP without suffering from fragmentation.
Just highlights that they do not have enough technical knowledge in house. Should spend the $1m/year saving on hiring some good devs.
hathawsh 57 days ago [-]
Why do you say that? Their solution of using shared memory (structured as a ring buffer) sounds perfect for their use case. Bonus points for using Rust to do it. How would you do it?
Edit: I guess perhaps you're saying that they don't know all the networking configuration knobs they could exercise, and that's probably true. However, they landed on a more optimal solution that avoided networking altogether, so they no longer had any need to research network configuration. I'd say they made the right choice.
This is because reading how they came up with the solution it is clear they have little understanding how low level stuff works. For example, they surprised by the amount of data, that TCP packets are not the same as application level packets or frames, etc.
As for ring buffer design I’m not sure I understand their solution. Article mentions media encoder runs in a separate process. Chromium threads live in their processes (afaik) as well. But ring buffer requirement says “lock free” which only make sense inside a single process.
rstuart4133 56 days ago [-]
> But ring buffer requirement says “lock free” which only make sense inside a single process.
No, "lock free" is a thing that's nice to have when you've got two threads accessing the same memory. It doesn't matter if those two threads are in the same process or it's two different processes accessing the same memory. It's almost certainly two different processes in this case, and the shared memory is probably memory mapped file.
Whatever it is, the shared memory approach is going to be much faster using the kernel to ship the data between the two processes. Via the kernel means two copies, and probably two syscalls as well.
kikimora 47 days ago [-]
I understand you can setup a data structure in shared memory and use lock free instructions to access it. However, I have never seen this is done in practice due to complexity. One particularly complicated scenario that comes to mind is dealing with unexpected process failures. This is quite different to dealing with exceptions in a thread.
evoke4908 54 days ago [-]
"Lock-free" does not in any way imply a single process. Quite the opposite. We don't call single-thread code lock-free because all single-thread code is lock free by definition. You kind of can't use locks at all in this context, so it makes no sense to describe it as lock-free. This is like gluten-free water, complete nonsense.
Lock-free code is designed for concurrent access, but using some clever tricks to handle synchronization between processes without actually invoking a lock. Lock-free explicitly means parallel.
kikimora 47 days ago [-]
I’m talking about single process with multiple threads where lock free makes sense.
karamanolev 57 days ago [-]
I fail to see how TCP/IP fragmentation really affects this use case. I don't know why it's mentioned and given that there aren't multiple network devices with different MTUs it will cause issues. Am I right? Is that the lack of technical knowledge you're referring to or am I missing something?
drowsspa 57 days ago [-]
Sounds weird that apparently they expected to send 3 MB in a single TCP packet
bcrl 57 days ago [-]
Modern NICs will do that for you via a feature called TSO -- TCP Segmentation Offload.
More shocking to me is that anyone would attempt to run network throughput oriented software inside of Chromium. Look at what Cloudflare and Netflix do to get an idea what direction they should really be headed in.
oefrha 57 days ago [-]
They use Chromium (or any other browser) not out of choice but because they have to in order to participate in third party video conference sessions. Of course it’s best to reverse engineer the video conferencing clients and do HTTP requests directly without a headless browser, but I presume they’ve tried that and it’s very difficult, not to mention prone to breaking at any moment.
What’s surprising to me is they can’t access the compressed video on the wire and have to send decoded raw video. But presumably they’ve thought about that too.
dmazzoni 57 days ago [-]
I'm assuming it's because the compressed video on the wire is encrypted?
ahoka 57 days ago [-]
Especially considering there are no packets in TCP.
rstuart4133 55 days ago [-]
There are no packets on the user's API. But under the hood everything is sent in packets, numbered, ACK'ed and checksumed. The maximum packet size supported by IP is 64KB, as they say. I'm surprised the kernel supports that because I'm not aware of any real device that supports packets that big (Ethernet Jumbo Frames are only 9KB), but I guess it must.
lttlrck 57 days ago [-]
The article reads a like a personal "learn by doing" blog post.
adamrezich 57 days ago [-]
This reminds me of when I was first starting to learn “real game development” (not using someone else's engine)—I was using C#/MonoGame, and, while having no idea what I was doing, decided I wanted vector graphics. I came across libcairo, figured out how to use it, set it all up correctly and everything… and then found that, whoops, sending 1920x1080x4 bytes to your GPU to render, 60 times a second, doesn't exactly work—for reasons that were incredibly obvious, in retrospect! At least it didn't cost me a million bucks to learn from my mistake.
namibj 56 days ago [-]
The sending is fine; cairo just won't create these bitmaps fast enough.
adamrezich 56 days ago [-]
Was this true back in 2011 or so? I'm genuinely curious—this may be yet another layer of me having no idea of what I was doing at the time, but I thought I remember determining (somehow) that the problem was the CPU-to-GPU bottleneck. It may have been that I got 720p 30FPS working just fine, but then 1080p was in the single digits, and I just made a bad assumption, or something.
jmb99 56 days ago [-]
1080p@60 is “only” around 500MB/s, which should have been possible a decade ago. PCIe 1.0 x16 bandwidth maxed out at 4GB/s, so even if you weren’t on a top of the line system with PCIe 2.0 (or brand new 3.0!) you should have been fine on that front[1].
More than likely the CPU wasn’t able to keep up. The pipeline was likely generating a frame, storing it to memory, copying from memory to the PCIe device memory, displaying the frame, then generating the next frame. It wouldn’t surprise me if a ~2010 era CPU struggled doing so.
[1] Pretty much any GPU’s memory bandwidth is going to be limited by link speed. An 8800GTS 320MB from 2007 had a theoretical memory bandwidth of around 64GB/s, for reference.
maxmcd 57 days ago [-]
Please explain?
IX-103 57 days ago [-]
Chromium already has a zero-copy IPC mechanism that uses shared memory built-in. It's called Mojo. That's how the various browser processes talk to each other. They could just have passed mojo::BigBuffer messages to their custom.process and not had to worry about platform-specific code.
But writing a custom ring buffer implementation is also nice, I suppose...
doctorpangloss 57 days ago [-]
The best way to find out something valuable is to be wrong on the Internet. Next time, bill them $10k for this secret!
handfuloflight 57 days ago [-]
Love the transparency here. Would also love if the same transparency was applied to pricing for their core product. Doesn't appear anywhere on the site.
DrammBA 57 days ago [-]
I use that as a litmus test when deciding whether to use a service: if I can't find a prominently linked pricing page on the homepage, I'm out.
lawrenceduk 57 days ago [-]
It’s ok, it’s now a million dollars/year cheaper when your renewal comes up!
Jokes aside though, some good performance sleuthing there.
h4ck_th3_pl4n3t 57 days ago [-]
The problem is not on network level.
The problem is that the developers behind this way of streaming video data seem to have no idea of how video codecs work.
If they are in control of the headless chromium instances, the video streams, and the receiving backend of that video stream...why not simply use RDP or a similar video streaming protocol that is made exactly for this purpose?
This whole post reads like an article from a web dev that is totally over their head, trying to implement something that they didn't take the time to even think about. Arguing with TCP fragmentation when that is not even an issue, and trying to use a TCP stream when that is literally the worst thing you can do in that situation because of roundtrip costs.
But I guess that there is no JS API for that, so it's outside the development scope? Can't imagine any reason not to use a much more efficient video codec here other than this running in node.js, potentially missing offscreen canvas/buffer APIs and C encoding libraries that you could use for that.
I would not want to work at this company, if this is how they develop software. Must be horribly rushed prototypical code, everywhere.
dmazzoni 57 days ago [-]
Their business is joining meetings from 7 different platforms (Zoom, Meet, WebEx, etc.) and capturing the video.
They don't have control of the incoming video format.
They don't even have access to the incoming video data, because they're not using an API. They're joining the meeting using a real browser, and capturing the video.
Is it an ugly hack? Maybe. But it's also a pretty robust one, because they're not dependent on an API that might break or reverse-engineering a protocol that might change. They're a bit dependent on the frontend, but that changes rarely and it's super easy to adapt when it does change.
h4ck_th3_pl4n3t 57 days ago [-]
I'm not sure you understood what I meant.
They are in control of the bot server that joins with the headless chrome client. They can use the CDP protocol to use the screencast API to write the recorded video stream to filesystem/disk, and then they can literally just run ffmpeg on that on-disk-on-server file and stream it somewhere else.
But instead they decided to use websockets to send it from that bot client to their own backend API, transmitting the raw pixels as either a raw blob or base64 encoded data, each frame, not encoded anyhow. And that is where the huge waste in bandwidth comes from.
(The article hints to this in a lot of places)
yencabulator 56 days ago [-]
They are doing e.g. transcriptions live as the stream happens, not writing a file and batch processing later.
lostmsu 57 days ago [-]
Even in this case it is non-sensical. Dunno about Linux, but on Windows you'd just feed the GPU window surface into a GPU hardware encoder via a shared texture with basically 0 data transmission, and get a compressed stream out.
doctorpangloss 57 days ago [-]
It’s alright.
It is difficult to say, I’ve never used the product. They don’t describe what it is they are trying to do.
If you want to pipe a Zoom call to a Python process it’s complicated.
Everything else that uses WebRTC, I suppose Python should generate the candidates, and the fake browser client hands over the Python process’s candidates instead of its own. It could use the most basic bindings to libwebrtc.
If the bulk of their app is JavaScript, they ought to inject a web worker and use encoded transforms.
But I don’t know though.
marcopolo 57 days ago [-]
Masking in the WebSocket protocol is kind of a funny and sad fix to the problem of intermediaries trying to be smart and helpful, but failing miserably.
How is this a problem of WebSockets and not HTTP in general?
The RFC has a link to a document describing the attack, but the link is broken.
cogman10 57 days ago [-]
This is such a weird way to do things.
Here they have a nicely compressed stream of video data, so they take that stream and... decode it. But they aren't processing the decoded data at the source of the decode, so instead they forward that decoded data, uncompressed(!!), to a different location for processing. Surprisingly, they find out that moving uncompressed video data from one location to another is expensive. So, they compress it later (Don't worry, using a GPU!)
At so many levels this is just WTF. Why not forward the compressed video stream? Why not decompress it where you are processing it instead of in the browser? Why are you writing it without any attempt at compression? Even if you want lossless compression there are well known and fast algorithms like flv1 for that purpose.
Just weird.
bri3d 57 days ago [-]
I think the issue with compression is that they're scraping the online meeting services rather than actually reverse engineering them, so the compressed video stream is hidden inside some kind of black box.
I'm pretty sure that feeding the browser an emulated hardware decoder (ie - write a VAAPI module that just copies compressed frame data for you) would be a good semi-universal solution to this, since I don't think most video chat solutions use DRM like Widevine, but it's not as universal as dumping the framebuffer output off of a browser session.
They could also of course one-off reverse each meeting service to get at the backing stream.
What's odd to me is that even with this frame buffer approach, why would you not just recompress the video at the edge? You could even do it in Javascript with WebCodecs if that was the layer you were living at. Even semi-expensive compression on a modern CPU is going to be way cheaper than copying raw video frames, even just in terms of CPU instruction throughput vs memory bandwidth with shared memory.
It's easy to cast stones, but this is a weird architecture and making this blog post about the "solution" is even stranger to me.
cogman10 57 days ago [-]
> I think the issue with compression is that they're scraping the online meeting services rather than actually reverse engineering them, so the compressed video stream is hidden inside some kind of black box.
I mean, I would presume that the entire reason they forked chrome was to crowbar open the black box to get at the goodies. Maybe they only did it to get a framebuffer output stream that they could redirect? Seems a bit much.
Their current approach is what I'd think would be a temporary solution while they reverse engineer the streams (or even get partnerships with the likes of MS and others. MS in particular would likely jump at an opportunity to AI something).
> What's odd to me is that even with this frame buffer approach, why would you not just recompress the video at the edge? You could even do it in Javascript with WebCodecs if that was the layer you were living at. Even semi-expensive compression on a modern CPU is going to be way cheaper than copying raw video frames, even just in terms of CPU instruction throughput vs memory bandwidth with shared memory.
Yeah, that was my final comment. Even if I grant that this really is the best way to do things, I can't for the life of me understand why they'd not immediately recompress. Video takes such a huge amount of bandwidth that it's just silly to send around bitmaps.
> It's easy to cast stones, but this is a weird architecture and making this blog post about the "solution" is even stranger to me.
Agreed. Sounds like a company that likely has multiple million dollar savings just lying around.
dmazzoni 57 days ago [-]
> Their current approach is what I'd think would be a temporary solution while they reverse engineer the streams (or even get partnerships with the likes of MS and others. MS in particular would likely jump at an opportunity to AI something).
They support 7 meeting platforms. Even if 1 or 2 are open to providing APIs, they're not all going to do that.
Reverse-engineering the protocol would be far more efficient, yes - but it'd also be more brittle. The protocol could change at any time and reverse-engineering it again could days between days and weeks. Would you want a product with that sort of downtime?
Also, does it scale? Reverse-engineering 7+ protocols is a lot of engineering work, and it's very specialized work that not any software engineer could just dive into quickly.
In comparison, writing web scrapers to find the video element for 7 different meeting products is super easy to write, and super easy to fix.
lostmsu 57 days ago [-]
If they forked Chromium, they should have direct access to compressed stream of a particular video element without much fuss.
isoprophlex 57 days ago [-]
Article title should have been "our weird design cost us $1M".
As it turns out, doing something in Rust does not absolve you of the obligation to actually think about what you are doing.
dylan604 57 days ago [-]
TFA opening graph "But it turns out that if you IPC 1TB of video per second on AWS it can result in enormous bills when done inefficiently. "
dmazzoni 57 days ago [-]
> Here they have a nicely compressed stream of video data
But they don't.
They support 7 different meeting providers (Zoom, Meet, WebEx, ...), none of which have an API that give you access to the compressed video stream.
In theory, you could try to reverse-engineer each protocol...but then your product could break for potentially days or weeks anytime one of those companies decides to change their protocol - vs web scraping, where if it breaks they can probably fix it in 15 minutes.
Their solution is inefficient, but robust. And that's ultimately a more monetizable product.
tbarbugli 57 days ago [-]
Possibly because they capture the video from xvfb or similar (they run a headless browser to capture the video) so at that point the decoding already happened (webrtc?)
rozap 57 days ago [-]
Really strange. I wonder why they omitted this. Usually you'd leave it compressed until the last possible moment.
dylan604 57 days ago [-]
> Usually you'd leave it compressed until the last possible moment.
Context matters? As someone working in production/post, we want to keep it uncompressed until the last possible moment. At least as far as no more compression than how it was acquired.
DrammBA 57 days ago [-]
> Context matters?
It does, but you just removed all context from their comment and introduced a completely different context (video production/post) for seemingly no reason.
Going back to the original context, which is grabbing a compressed video stream from a headless browser, the correct approach to handle that compressed stream is to leave it compressed until the last possible moment.
pavlov 57 days ago [-]
Since they aim to support every meeting platform, they don’t necessarily even have the codecs. Platforms like Zoom can and do use custom video formats within their web clients.
With that constraint, letting a full browser engine decode and composite the participant streams is the only option. And it definitely is an expensive way to do it.
sfink 57 days ago [-]
...and this is why I will never start a successful business.
The initial approach was shipping raw video over a WebSocket. I could not imagine putting something like that together and selling it. When your first computer came with 64KB in your entire machine, some of which you can't use at all and some you can't use without bank switching tricks, it's really really hard to even conceive of that architecture as a possibility. It's a testament to the power of today's hardware that it worked at all.
And yet, it did work, and it served as the basis for a successful product. They presumably made money from it. The inefficiency sounds like it didn't get in the way of developing and iterating on the rest of the product.
I can't do it. Premature optimization may be the root of all evil, but I can't work without having some sense for how much data is involved and how much moving or copying is happening to it. That sense would make me immediately reject that approach. I'd go off over-architecting something else before launching, and somebody would get impatient and want their money back.
dmazzoni 57 days ago [-]
The initial approach was shipping raw video over a WebSocket...between two processes running on the same machine.
That doesn't sound like a ridiculous idea to me. How else would you get video data out of a sandboxed Chromium process?
sfink 56 days ago [-]
Short answer: raw video is big.
With my mindset, you have a gigantic chunk of data. Especially if you're recording multiple streams per machine. The immediate thought is that you want to avoid copying as much as possible. If you really, really have to, you can copy it once. Maybe even twice, though before moving from 1 to 2 copies you should spend some time thinking about whether it's possible to move from 1 to 0, or never materializing the full data at all (i.e., keep it compressed, which could apply here but only as an optimization for certain video applications and so is irrelevant to the bootstrapping phase).
WebSockets take your giant chunk of data and squeeze it through a straw. How many times does each byte get copied in the process? I don't know, but probably more than twice. Even worse, it's going to process it in chunks, so you're going to have per-chunk overhead (maybe including a context switch?) that is O(number of chunks in a giant data set).
But the application fundamentally requires squishing that giant data back down again, which immediately implies moving the computation to the data. I would want to experiment with a wasm-compiled video compressor (remember, we already have the no GPU constraint, so it's ok to light the CPU on fire), and then get compressed video out of the sandbox. WebSockets don't seem unreasonable for that -- they probably cost a factor of 2-4 over the raw data size, but once you've gained an order of magnitude from the compression, that's in the land of engineering tradeoffs. The bigger concern is dropping frames by combining the frame generation and reading with the compression, though I think you could probably use a Web Worker and SharedArrayBuffers to put those on different cores.
But I'm wrong. The data isn't so large that the brute force approach wouldn't work at all. My version would take longer to get up and running, which means they couldn't move on to the rest of the system.
ketzo 57 days ago [-]
If you can feel that way about your work, but still understand that this approach has its own benefits, you're probably a really good person to hire when a startup does hit scaling issues from their crappy original code!
Knowing thyself is a superpower all its own; we need people to write scrappy code to validate a business idea, and we need people who look at code with disgust, throw it out, and write something 100x as efficient.
austin-cheney 57 days ago [-]
We read through the WebSocket RFC, and Chromium's WebSocket implementation, dug through our profile data, and discovered two primary causes of slowness: fragmentation, and masking.
So they are only half way correct about masking. The RFC does mandate that client to server communication be masked. That is only enforced by web browsers. If the client is absolutely anything else just ignore masking. Since the RFC requires a bit to identify if a message is masked and that bit is in no way associated to the client/server role identity of the communication there is no way to really mandate enforcement. So, just don't mask messages and nothing will break.
Fragmentation is completely unavoidable though. The RFC does allow for messages to be fragmented at custom lengths in the protocol itself, and that is avoidable. However, TLS imposes message fragmentation. In some run times messages sent at too high a frequency will be concatenated and that requires fragmentation by message length at the receiving end. Firefox sometimes sends frame headers detached from their frame bodies, which is another form of fragmentation.
You have to account for all that fragmentation from outside the protocol and it is very slow. In my own implementation receiving messages took just under 11x longer to process than sending messages on a burst of 10 million messages largely irrespective of message body length. Even after that slowness WebSockets in my WebSocket implementation proved to be almost 8x faster than HTTP 1 in real world full-duplex use on a large application.
simoncion 57 days ago [-]
> However, TLS imposes message fragmentation.
If one is doing websockets on the local machine (or any other trusted network) and one has performance concerns, one should maybe consider not doing TLS.
If the websocket standard demands TLS, then I guess getting to not do that is would be another benefit of not using a major-web-browser-provided implementation.
CyberDildonics 57 days ago [-]
Actual reality beyond the fake title:
"using WebSockets over loopback was ultimately costing us $1M/year in AWS spend"
then
"and the quest for an efficient high-bandwidth, low-latency IPC"
Shared memory. It has been there for 50 years.
pier25 57 days ago [-]
Why were they using websockets to send video in the first place?
Was it because they didn't want to use some multicast video server?
dmazzoni 57 days ago [-]
They join a 3rd-party meeting using a browser.
Then they capture the video from the meeting in Chromium.
Then they need to send that captured video to another process for compression and processing.
No, WebSockets isn't the most efficient, but there aren't that many options once you're capturing inside a web page.
dilyevsky 57 days ago [-]
Not totally sure but they probably extract video via Chrome DevTools Protocol which uses WebSocket as transport.
IamLoading 57 days ago [-]
My surprise too, whats the issue of webRTC again?
remram 56 days ago [-]
Since their overhead is memcpy, I don't see how WebRTC would help.
Dylan16807 57 days ago [-]
The title makes it sound like there was some kind of blowout, but really it was a tool that wasn't the best fit for this job, and they were using twice as much CPU as necessary, nothing crazy.
renewiltord 57 days ago [-]
That's a good write-up with a standard solution in some other spaces. Shared memory buffers are very fast too. It's interesting to see them being used here. Nice write up. It wasn't what I expected: that they were doing something dumb with API Gateway Websockets. This is actual stuff. Nice.
cosmotic 57 days ago [-]
Why decode to then turn around and re-encode?
pavlov 57 days ago [-]
Reading their product page, it seems like Recall captures meetings on whatever platform their customers are using: Zoom, Teams, Google Meet, etc.
Since they don't have API access to all these platforms, the best they can do to capture the A/V streams is simply to join the meeting in a headless browser on a server, then capture the browser's output and re-encode it.
MrBuddyCasino 57 days ago [-]
They‘re already hacking Chromium. If the compressed video data is unavailable in JS, they could change that instead.
pavlov 57 days ago [-]
If you want to support every meeting platform, you can’t really make any assumptions about the data format.
To my knowledge, Zoom’s web client uses a custom codec delivered inside a WASM blob. How would you capture that video data to forward it to your recording system? How do you decode it later?
Even if the incoming streams are in a standard format, compositing the meeting as a post-processing operation from raw recorded tracks isn’t simple. Video call participants have gaps and network issues and layer changes, you can’t assume much anything about the samples as you would with typical video files. (Coincidentally this is exactly what I’m working on right now at my job.)
cosmotic 57 days ago [-]
At some point, I'd hope the result of zooms code quickly becomes something that can be hardware decoded. Otherwise the CPU, battery consumption, and energy usage are going to be through the roof.
pavlov 57 days ago [-]
The most common video conferencing codec on WebRTC is VP8, which is not hardware decoded either almost anywhere. Zoom’s own codec must be an efficiency improvement over VP8, which is best described as patent-free leftovers from the back of the fridge.
Hardware decoding works best when you have a single stable high bitrate stream with predictable keyframes — something like a 4K video player.
Video meetings are not like that. You may have a dozen participant streams, and most of them are suffering from packet loss. Lots of decoder context switching and messy samples is not ideal for typical hardware decoders.
MrBuddyCasino 56 days ago [-]
This makes sense. I find it curious that a WASM codec could be competitive with something that is presumably decoded natively. I know Teams is a CPU hog, but I don't remember Zoom being one.
moogly 57 days ago [-]
They did what every other startup does: put the PoC in production.
ketzo 57 days ago [-]
I had the same question, but I imagine that the "media pipeline" box with a line that goes directly from "compositor" to "encoder" is probably hiding quite a lot of complexity
Recall's offering allows you to get "audio, video, transcripts, and metadata" from video calls -- again, total conjecture, but I imagine they do need to decode into raw format in order to split out all these end-products (and then re-encode for a video recording specifically.)
Szpadel 57 days ago [-]
my guess is either that video they get use some proprietary encoding format (js might do some magic on the feed) or it's because it's latency optimized stream that consumes a lot of bandwidth
a_t48 57 days ago [-]
Did they consider iceoryx2? From the outside, it feels like it fits the bill.
nottorp 56 days ago [-]
Wait but you can hire cheap kids to do it in javascript with web 5.0 technologies. Or pay a little more and have it done with web 7.0 tech.
Always. Always ?!?
Article summary: if you're moving a lot of data, your protocol's structure and overhead matters. A lot.
57 days ago [-]
akira2501 57 days ago [-]
> A single 1080p raw video frame would be 1080 * 1920 * 1.5 = 3110.4 KB in size
They seem to not understand the fundamentals of what they're working on.
> Chromium's WebSocket implementation, and the WebSocket spec in general, create some especially bad performance pitfalls.
You're doing bulk data transfers into a multiplexed short messaging socket. What exactly did you expect?
> However there's no standard interface for transporting data over shared memory.
Yes there is. It's called /dev/shm. You can use shared memory like a filesystem, and no, you should not be worried about user/kernel space overhead at this point. It's the obvious solution to your problem.
> Instead of the typical two-pointers, we have three pointers in our ring buffer:
You can use two back to back mmap(2) calls to create a ringbuffer which avoids this.
Scaevolus 57 days ago [-]
It's pretty funny that they assumed that memory copying was the limiting factor when they're pushing a mere 150MB/s around instead of the various websocket overheads, then jumped right into over-engineering a zero copy ring buffer. I get it, but come on!
>50 GB/s of memory bandwidth is common nowadays[1], and will basically never be the bottleneck for 1080P encoding. Zero copy matters when you're doing something exotic, like Netflix pushing dozens of GB/s from a CDN node.
I agree with you. The moment they said shared memory, I was thinking /dev/shm. Lots of programming languages have libraries to /dev/shm already.
And since it behaves like filesystem, you can swap it with real filesystem during testing. Very convenient.
I am curious if they tried this already or not and if they did, what problems did they encounter?
anonymous344 57 days ago [-]
well someone will feel like an idiot after reading your facts. This is why education and experience is important. Not just React/rust course and then you are full stack senior :D
mlvljr 57 days ago [-]
[dead]
pyeri 57 days ago [-]
WebSockets has become popular only due to this "instant" mindset. IRL, only a handful of messages or notifications need a truly real-time priority such as Bank OTPs, transaction notifs, etc. Others can wait for a few seconds, and other architectures like middleware, client-side AJAX polling, etc. could be both cheaper and sufficient there.
jjeryini 53 days ago [-]
You mentioned in the article that you went searching for an alternative to WebSocket for transporting the raw decoded video out of Chromium's Javascript environment. Have you also considered WebTransport?
ComputerGuru 57 days ago [-]
I don't mean to be dismissive, but this would have been caught very early on (in the planning stages) by anyone that had/has experience in system-level development rather than full-stack web js/python development. Quite an expensive lesson for them to learn, even though I'm assuming they do have the talent somewhere on the team if they're able to maintain a fork of Chromium.
(I also wouldn't be surprised if they had even more memory copies than they let on, marshalling between the GC-backed JS runtime to the GC-backed Python runtime.)
I was coming back to HN to include in my comment a link to various high-performance IPC libraries, but another commenter already beat me linking to iceoryx2 (though of course they'd need to use a python extension).
SHM for IPC has been well-understood as the better option for high-bandwidth payloads from the 1990s and is a staple of Win32 application development for communication between services (daemons) and clients (guis).
diroussel 57 days ago [-]
Sometimes it is more important to work on proving you have a viable product and market to sell it in before you optimise.
On the outside we can’t be sure. But it’s possible that they took the right decision to go with a naïve implementation first. Then profile, measure and improve later.
But yes the hole idea of running a headless web browser to get run JavaScript to get access to a video stream is a bit crazy. But I guess that’s just the world we are in.
CharlieDigital 57 days ago [-]
> I don't mean to be dismissive, but this would have been caught very early on (in the planning stages) by anyone that had/has experience in system-level development rather than full-stack web js/python development
Based on their job listing[0], Recall is using Rust on the backend.
It's not even clear why they need a browser in the mix; most of these services have APIs you can use. (Also, why fork Chromium instead of using CEF?)
dmazzoni 57 days ago [-]
> It's not even clear why they need a browser in the mix; most of these services have APIs you can use
They have APIs to schedule meetings.
They don't have APIs that give you access to the compressed video stream.
Sesse__ 57 days ago [-]
Many of them do, they're just not something any random person can go and sign up for.
randomdata 57 days ago [-]
> rather than full-stack web js/python development.
The product is not a full-stack web application. What makes you think that they brought in people with that kind of experience just for this particular feature?
Especially when they claim that they chose that route because it was what was most convenient. While you might argue that wasn't the right tradeoff, it is a common tradeoff developers of all kinds make. “Make It Work, Make It Right, Make It Fast” has become pervasive in this industry, for better or worse.
whatever1 57 days ago [-]
Wouldn’t also something like redis be an alternative?
devit 57 days ago [-]
The problem seems to be that they are decompressing video in Chromium and then sending the uncompressed video somewhere else.
A more reasonable approach would be to have Chromium save the original compressed video to disk, and then use ffmpeg or similar to reencode if needed.
Even better not use Chromium at all.
ec109685 57 days ago [-]
They’re scraping video from VC services, so only have access to the frame buffer.
devit 56 days ago [-]
Presumably those services use Chromium codecs and perhaps standard WebRTC, so they have access to the encoded data stream as well (via JavaScript injection or patching Chromium to dump the data).
mannyv 57 days ago [-]
Why don't they just write the meeting output to hls, then use the hls as input to the rest of their stuff?
Cheaper and more straightforward.
Their discussion of fragmentation shows they are clueless as to the details of the stack. All that shit is basically irrelevant.
dmazzoni 57 days ago [-]
> Why don't they just write the meeting output to hls
They're capturing video from inside a Chromium process. How exactly do you expect to send the raw captured video frames to hls?
Are you proposing implementing the HLS server inside a web process?
mannyv 55 days ago [-]
No, what I'm saying is pipe the video output to an HLS encoder. HLS live will rewrite the m3u8 as more segments come in. In fact, they can render the audio into its own m3u8 and use that as input for their transcriber, saving even more bandwidth/data transfer/etc.
Since it's coming from a headless process, they can just pipe it into ffmpeg, which is probably what they're using on the back-end anyway. Send the output to a file, then copy those to s3 as they're generated. And you can drop the frame rate and bitrate on that while you're at it, saving time and latency.
It's really not rocket science. You just have to understand your problem domain more better.
Shipping uncompressed video around is ridiculous, unless you're doing video editing. And even then you should use low-res copies and just push around EDLs until you need to render (unless you need high-res to see something).
Given that they're doing all that work, they might as well try to get an HLS encoder running in chrome. There just was an mp3 codec in web assembly on HN, so an HLS live encoder may not be too hard. I mean, if they were blowing a million because of their bad design they could blow another million building a browser-based HLS encoder.
jgalt212 57 days ago [-]
> But it turns out that if you IPC 1TB of video per second on AWS it can result in enormous bills when done inefficiently.
As a point of comparison, how many TB per second of video does Netflix stream?
ffsm8 57 days ago [-]
I don't think that number is as easy to figure out as most people think.
Netflix has hardware ISPs can get so they can serve their content without saturating the ISPs lines.
There is a statistic floating around that Netflix was responsible for 15% of the global traffic 2022/2023, and YouTube 12%. If that number is real... That'd be a lot more
bauruine 57 days ago [-]
FWIW: The MTU of the loopback interface on Linux is 64KB by default
cperciva 57 days ago [-]
We use atomic operations to update the pointers in a thread-safe manner
Are you sure about that? Atomics are not locks, and not all systems have strong memory ordering.
Sesse__ 57 days ago [-]
Rust atomics, like C++ atomics, include memory barriers (the programmer chooses how strong, the compiler/CPU is free to give stronger).
CodesInChaos 57 days ago [-]
> not all systems have strong memory ordering
Atomics require you to explicitly specify a memory ordering for every operation, so the system's memory ordering doesn't really matter. It's still possible to get it wrong, but a lot easier than in (traditional) C.
reitzensteinm 57 days ago [-]
It's still possible to incorrectly use relaxed operations, and have your algorithm only incidentally work because the compiler hasn't reordered them and you're on a CPU with a stronger memory model.
But yes, it's an order of magnitude easier to get portability right using the C++/Rust memory model than what came before.
jpc0 57 days ago [-]
> ... update the pointers ...
Pretty sure ARM and x86 you would be seeing on AWS does have strong memory ordering, and has atomic operations that operate on something the size of a single register...
cperciva 57 days ago [-]
Graviton has weaker memory ordering than amd64. I know this because FreeBSD had a ring buffer which was buggy on Graviton...
CyberDildonics 56 days ago [-]
A huge part of atomic operations in programming languages and modern CPUs (meaning in the last few decades) is updating pointers in a thread safe manner.
Why would think you need locks?
57 days ago [-]
londons_explore 57 days ago [-]
They are presumably using the GPU for video encoding....
And the GPU for rendering...
So they should instead just be hooking into Chromium's GPU process and grabbing the pre-composited tiles from the LayerTreeHostImpl[1] and dealing with those.
One of the first parts of the post explains how they are using CPUs only
mbb70 57 days ago [-]
They are very explicit in the article that they run everything on CPUs.
isoprophlex 57 days ago [-]
You'd think so but nope, they deliberately run on CPU, as per the article...
yjftsjthsd-h 57 days ago [-]
> We do our video processing on the CPU instead of on GPU, as GPU availability on the cloud providers has been patchy in the last few years.
I dunno, when we're playing with millions of dollars in costs I hope they're at least regularly evaluating whether they could at least run some of the workload on GPUs for better perf/$.
londons_explore 57 days ago [-]
And their workload is rendering and video encoding. Using GPU's should have been where they started, even if it limits their choice of cloud providers a little.
OptionOfT 57 days ago [-]
Did they originally NOT run things on the same machine? Otherwise the WebSocket would be local and incur no cost.
nemothekid 57 days ago [-]
>WebSocket would be local and incur no cost.
The memcopys are the cost that they were paying, even if it was local.
ted_dunning 57 days ago [-]
The article describes why this isn't the problem. You might enjoy reading it.
The basic point is that WebSockets requires that data move across channels that are too general and cause multiple unaligned memory copies. The CPU cost to do the copies was what cost the megabuck, not network transfer costs.
magamanlegends 57 days ago [-]
our websocket traffic is roughly 40% of recall.ai and our bill was $150 USD this month using a high memory VPS
jgauth 57 days ago [-]
Did you read the article? It is about the CPU cost of using WebSockets to transfer data over loopback.
kunwon1 57 days ago [-]
I read the entire article and that wasn't my takeaway. After reading, I assumed that AWS was (somehow) billing for loopback bandwidth, it wasn't apparent (to me) from the article that CPU costs were the sticking point
DrammBA 57 days ago [-]
> We set a goal for ourselves to cut this CPU requirement in half, and thereby cut our cloud compute bill in half.
From the article intro before they dive into what exactly is using the CPU.
calibas 57 days ago [-]
Why use Chromium at all? Isn't it just decoding video and sending it over a websocket?
nutanc 57 days ago [-]
A little off topic, but if you need a bot for a meeting, do you even need a meeting?
jazzyjackson 57 days ago [-]
I for one would like to praise the company for sharing their failure, hopefully next time someone Googles "transport video over websocket" theyll find this thread.
dbrower 57 days ago [-]
How much did the engineering time to make this optimization cost?
nicwolff 53 days ago [-]
This is very confusing naming:
> write pointer: the next address to write to
OK
> peek pointer: the address of the next frame to read
> read pointer: the address where data can be overwritten
What? If the "write pointer" is the "the next address to write to" then the "read pointer" had better be "the next address to read from".
The "peek pointer" should be the "read pointer", and the pointer to the end of the free sector should be the "stop pointer" or "unfreed pointer" or "in-use pointer" or literally anything else. Even "third pointer" would be less confusing!
beoberha 57 days ago [-]
Classic Hacker News getting hung up on the narrative framing. It’s a cool investigation! Nice work guys!
ahmetozer 57 days ago [-]
At least 1m in a year not week
hipadev23 57 days ago [-]
what was the actual cost? cpu?
cynicalsecurity 57 days ago [-]
They are desperately trying to blame anyone except themselves.
ted_dunning 57 days ago [-]
Yes. CPU costs due to multiple memcpy operations.
cyberax 57 days ago [-]
Egress fees strike again.
ted_dunning 57 days ago [-]
No. That isn't what they said.
Read the article.
algobro 57 days ago [-]
Good flex. These days access to capital is attractive for talent.
yapyap 57 days ago [-]
> But it turns out that if you IPC 1TB of video per second on AWS it can result in enormous bills when done inefficiently.
that’s surprising to.. almost no one? 1TBPS is nothing to scoff at
blibble 57 days ago [-]
in terms of IPC, DDR5 can do about 50GB/s per memory channel
assuming you're only shuffling bytes around, on bare metal this would be ~20 DDR5 channels worth
or 2 servers (12 channels/server for EPYC)
you can get an awful lot of compute these days for not very much money
(shipping your code to the compressed video instead of the exact opposite would probably make more sense though)
pyrolistical 57 days ago [-]
Terabits vs gigabytes
CyberDildonics 56 days ago [-]
Terabits vs gigabytes
What does this mean? The article says 'TB' which would be terabytes. Terabytes are made out of gigabytes. There is nothing faster than straight memory bandwidth. DDR5 has 64 GB/s max. 12 channels of that is 768 GB/s.
Terabytes per second is going to take multiple computers, but it will be a lot less computers if you're using shared memory bandwidth and not some sort of networking loopback.
blibble 57 days ago [-]
multiply 50 gigabytes * 20 and tell me what you get
pro-tip: it's quite a bit bigger than a terabit
mlhpdx 57 days ago [-]
It seems like UDP would be a valid option.
thadk 57 days ago [-]
Could Arrow be a part of the shared memory solution in another context?
remram 56 days ago [-]
No, why?
> Arrow libraries provide convenient methods for reading and writing columnar file formats
hkgjjgjfjfjfjf 57 days ago [-]
[dead]
punduk 57 days ago [-]
[flagged]
apitman 57 days ago [-]
I've been toying around with a design for a real-time chat protocol, and was recently in a debate of WebSockets vs HTTP long polling. This should give me some nice ammunition.
pavlov 57 days ago [-]
No, this story is about interprocess communication on a single computer, it has practically nothing to do with WebSockets vs something else over an IP network.
apitman 57 days ago [-]
Why do they claim their profile data showed that WebSocket fragmentation and masking were the hot spots?
pavlov 57 days ago [-]
Because they were sending so much data to another process over the Websocket.
An uncompressed 1920*1080 30fps RGB stream is 178 megabytes / second. (This is 99% likely what they were capturing from the headless browser, although maybe at a lower frame rate - you don’t need full 30 for a meeting capture.)
In comparison, a standard Netflix HD stream is around 1.5 megabits / s, so 0.19 megabytes.
The uncompressed stream is almost a thousand times larger. At that rate, the Websocket overhead starts having an impact.
apitman 56 days ago [-]
It should still have the same impact at scale, right? ie if I had a server handling enough WebSocket connections to be at 90% CPU usage, switching to a protocol with lower overhead should reduce the usage and thus save me money. This is of course assuming the system isn't io bound.
Rendered at 23:35:19 GMT+0000 (Coordinated Universal Time) with Vercel.
---
I have a similar story: Where I work, we had a cluster of VMs that were always high CPU and a bit of a problem. We had a lot of fire drills where we'd have to bump up the size of the cluster, abort in-progress operations, or some combination of both.
Because this cluster of VMs was doing batch processing that the founder believed should be CPU intense, everyone just assumed that increasing load came with increasing customer size; and that this was just an annoyance that we could get to after we made one more feature.
But, at one point the bean counters pointed out that we spent disproportionately more on cloud than a normal business did. After one round of combining different VM clusters (that really didn't need to be separate servers), I decided that I could take some time to hook up this very CPU intense cluster up to a profiler.
I thought I was going to be in for a 1-2 week project and would follow a few worms. Instead, the CPU load was because we were constantly loading an entire table, that we never deleted from, into the application's process. The table had transient data that should only last a few hours at most.
I quickly deleted almost a decade's worth of obsolete data from the table. After about 15 minutes, CPU usage for this cluster dropped to almost nothing. The next day we made the VM cluster a fraction of its size, and in the next release, we got rid of the cluster and merged the functionality into another cluster.
I also made a pull request that introduced a simple filter to the query to only load 3 days of data; and then introduced a background operation to clean out the table periodically.
It drives ya nuts to read a story where some guy on the shop floor saves his employer ten million dollars and to reward him they give him a 20% off coupon for Home Depot.
It’s your job as an employee, it’s why you get paid in the first place
"It depends" is, of course, the common answer, but in most places I've worked, "please help find operational optimizations that can have a positive impact for the team, department or organization" has certainly never been an explicit ask.
Ask team mates to change something to help on a project may fall under the same category, but usually the effect isn't felt beyond a project.
My default mode when coming in to jobs is to try to get a 'full company' view, because I want to know how things work and how they might be made better. That approach is usually not met with much enthusiasm, and usually more with "that's not your job, you don't need to know that", etc.
I took a daily import routine that was taking 25+ hours (meaning we couldn't show 'daily info' because it was out of date before finishing import) and got it down to 30 minutes. This was after having to fight/argue to see the data, and being told for a couple weeks "it can't be sped up, we'll have to buy faster hardware" ($8-$10k min, but they weren't looking at $15-20k IIRC). I spent a few hours over a weekend and got it down to 30 minutes, and saved the company minimum $8k. But I had to fight/argue to even do that ("that's not your job", "Charles is taking care of that", "the client will just have to deal with more delays while we upgrade", etc).
Fixing the business is very explicitly not your job, and is absolutely not what you're paid for. Any value you create for the business outside of those bounds is at your own cost and you absolutely will not be compensated unless the business is so small you don't have six layers of management trying to extract any kind of promotion.
That's what quadratic means.
I was really disappointed when my wife couldn't get the night off from work when the company took everyone out to a fancy steak house.
You seem to be assuming that a $200 meal was the only compensation the person received, and they weren't just getting a nice meal as a little something extra on top of getting paid for doing their job competently and efficiently.
But that's the kind of deal I make when I take a job: I do the work (pretty well most of the time), and I get paid. If I stop doing the work, I stop getting paid. If they stop paying, I stop doing the work. (And bonus, literally, if I get a perk once in a while like a free steak dinner that I wasn't expecting)
It doesn't have to be more complicated than that.
That's independent of pay scale.
Granted, if you pay way below expectations, you'll lose the professionals over time. But if you pay lavishly no matter what, you get the 2021/2022 big tech hiring cycle instead. Neither one is a great outcome.
There is no single mechanism that does. Paying well is always a component in attracting talent (see my original comment)
It is not a guarantor of quality/motivation. That's ongoing leadership work. And part of that is maintaining/kindling pride in people's work (and firing the ones who are just there for the money)
Of course the truth is more complicated than the sound bite, but still...
Don't assume that a steak dinner was the only recognition we got.
As far as comp: I was well taken care of, and I won't discuss more in a public forum.
It will be great if anyone write a checklist(playbook) to be checked for CPU, Memory, Disk, IO and network issues.
It's weird to be living in a world where this is a surprise but here we are.
Nice write up though. Web sockets has a number of nonsensical design decisions, but I wouldn't have expected that this is the one that would be chewing up all your cpu.
I can easily imagine the author being in a similar boat, knowing that it isn't cheap, but then not realizing that expensive in this context truly does mean expensive until they actually started seeing the associated costs.
I think it's because the cost of it is so abstracted away with free streaming video all across the web. Once you take a look at the egress and ingress sides you realize how quickly it adds up.
If so, they deserve a $1M bill.
If I said that "childbirth cost us 5000 on our <hospital name> bill", you assume the issue is with the hospital.
The idea that clearer titles are just babying some class of people is perverse.
Titles are the foremost means of deciding what to read, for anyone of any sophistication. Clearer titles benefit everyone.
The subject matter is meaningful to more than AWS users, but non-AWS users are going to be less likely to read it based on the title.
I doubt they would have even noticed this outrageous cost if they were running on bare-metal Xeons or Ryzen colo'd servers. You can rent real 44-core Xeon servers for like, $250/month.
So yes, it's an AWS issue.
I see "AMD EPYC 7502P 32-Core" for 236 EUR per month. Can you tell me where you see 48c/96t?
EDIT
I found it! Unbelievable that it is so cheap.
https://www.hetzner.com/dedicated-rootserver/#cores_threads_...
GCP exposes their cpu models, and they have some Haswell and Broadwell lithographies in service.
Thats a 10+ year old part, for those paying attention.
I used to work for a company that rented lots of hetzner boxes. Consumer grade hardware with frequent disk failures was just what we excepted for saving a buck.
Anyway, depending on individual nodes to always be up for reliability is incredibly foolhardy. Things can happen, cloud isn't magic, I’ve had instances become unrecoverable. Though it is rare.
So, I still don’t understand the point, that was not exactly relevant to what I said.
AWS: E5-2680 v4 (2016)
Hetzner: Ryzen 5 (2019)
the AWS one is some emulated block device, no idea what it is, other than it's 20x slower
Hetzner isn't the best provider in the world, but it's also not as bad as you say they are. They're not just renting old servers.
https://www.colocrossing.com/server/dedicated-servers/
[0]https://instances.vantage.sh/aws/ec2/c8g.12xlarge?region=us-... [1]https://portal.colocrossing.com/register/order/service/480 [2]https://browser.geekbench.com/v6/cpu/8305329 [3]https://browser.geekbench.com/processors/intel-xeon-e5-2699-...
If I try to match the actual machine. 16G ram. A rough estimate is that their Xeon E3-1240 would be ~2 AWS vCPU. So an r6g.large is the instance that would roughly match this one. Add 500G disk + 1 Gbps to/from the internet and ... monthly cost 3,700 USD.
Without any disk and without any data transfer (which would be unusable) it's still ~80USD. Maybe you could create a bootable image that calculates primes.
These are still not the same thing, I get it, but ... it's safe to say you cannot get anything remotely comparable on AWS. You can only get a different thing for way more money.
(made estimates on https://calculator.aws/ )
125 MB per second × 60 seconds per minute × 60 minutes per hour × 24 hours per day x 30 days = 324 TB?
If you want 1 Gbps unmetered colo pricing, AWS is not competitive. So set up your video streaming service elsewhere :-)
https://portal.colocrossing.com/register/order/service/480 offers unmetered for $2,500 additional per month, for the record.
If you have high bandwidth needs on AWS you can use AWS Lightsail, which has some discounted transfer rates.
That's not dedicated 48 cores, it's 48 "vCPUs". There are probably 1,000 other EC2 instances running on those cores stealing all the CPU cycles. You might get 4 cores of actual compute throughput. Which is what I was saying
In fact, you can even get a small discount with the -flex series, if you're willing to compromise slightly. (Small discount for 100% of performance 95% of the time).
> Even the theoretical maximum size of a TCP/IP packet, 64k, is much smaller than the data we need to send, so there's no way for us to use TCP/IP without suffering from fragmentation.
Just highlights that they do not have enough technical knowledge in house. Should spend the $1m/year saving on hiring some good devs.
Edit: I guess perhaps you're saying that they don't know all the networking configuration knobs they could exercise, and that's probably true. However, they landed on a more optimal solution that avoided networking altogether, so they no longer had any need to research network configuration. I'd say they made the right choice.
This is because reading how they came up with the solution it is clear they have little understanding how low level stuff works. For example, they surprised by the amount of data, that TCP packets are not the same as application level packets or frames, etc.
As for ring buffer design I’m not sure I understand their solution. Article mentions media encoder runs in a separate process. Chromium threads live in their processes (afaik) as well. But ring buffer requirement says “lock free” which only make sense inside a single process.
No, "lock free" is a thing that's nice to have when you've got two threads accessing the same memory. It doesn't matter if those two threads are in the same process or it's two different processes accessing the same memory. It's almost certainly two different processes in this case, and the shared memory is probably memory mapped file.
Whatever it is, the shared memory approach is going to be much faster using the kernel to ship the data between the two processes. Via the kernel means two copies, and probably two syscalls as well.
Lock-free code is designed for concurrent access, but using some clever tricks to handle synchronization between processes without actually invoking a lock. Lock-free explicitly means parallel.
More shocking to me is that anyone would attempt to run network throughput oriented software inside of Chromium. Look at what Cloudflare and Netflix do to get an idea what direction they should really be headed in.
What’s surprising to me is they can’t access the compressed video on the wire and have to send decoded raw video. But presumably they’ve thought about that too.
More than likely the CPU wasn’t able to keep up. The pipeline was likely generating a frame, storing it to memory, copying from memory to the PCIe device memory, displaying the frame, then generating the next frame. It wouldn’t surprise me if a ~2010 era CPU struggled doing so.
[1] Pretty much any GPU’s memory bandwidth is going to be limited by link speed. An 8800GTS 320MB from 2007 had a theoretical memory bandwidth of around 64GB/s, for reference.
But writing a custom ring buffer implementation is also nice, I suppose...
Jokes aside though, some good performance sleuthing there.
The problem is that the developers behind this way of streaming video data seem to have no idea of how video codecs work.
If they are in control of the headless chromium instances, the video streams, and the receiving backend of that video stream...why not simply use RDP or a similar video streaming protocol that is made exactly for this purpose?
This whole post reads like an article from a web dev that is totally over their head, trying to implement something that they didn't take the time to even think about. Arguing with TCP fragmentation when that is not even an issue, and trying to use a TCP stream when that is literally the worst thing you can do in that situation because of roundtrip costs.
But I guess that there is no JS API for that, so it's outside the development scope? Can't imagine any reason not to use a much more efficient video codec here other than this running in node.js, potentially missing offscreen canvas/buffer APIs and C encoding libraries that you could use for that.
I would not want to work at this company, if this is how they develop software. Must be horribly rushed prototypical code, everywhere.
They don't have control of the incoming video format.
They don't even have access to the incoming video data, because they're not using an API. They're joining the meeting using a real browser, and capturing the video.
Is it an ugly hack? Maybe. But it's also a pretty robust one, because they're not dependent on an API that might break or reverse-engineering a protocol that might change. They're a bit dependent on the frontend, but that changes rarely and it's super easy to adapt when it does change.
They are in control of the bot server that joins with the headless chrome client. They can use the CDP protocol to use the screencast API to write the recorded video stream to filesystem/disk, and then they can literally just run ffmpeg on that on-disk-on-server file and stream it somewhere else.
But instead they decided to use websockets to send it from that bot client to their own backend API, transmitting the raw pixels as either a raw blob or base64 encoded data, each frame, not encoded anyhow. And that is where the huge waste in bandwidth comes from.
(The article hints to this in a lot of places)
It is difficult to say, I’ve never used the product. They don’t describe what it is they are trying to do.
If you want to pipe a Zoom call to a Python process it’s complicated.
Everything else that uses WebRTC, I suppose Python should generate the candidates, and the fake browser client hands over the Python process’s candidates instead of its own. It could use the most basic bindings to libwebrtc.
If the bulk of their app is JavaScript, they ought to inject a web worker and use encoded transforms.
But I don’t know though.
The linked section of the RFC is worth the read: https://www.rfc-editor.org/rfc/rfc6455#section-10.3
The RFC has a link to a document describing the attack, but the link is broken.
Here they have a nicely compressed stream of video data, so they take that stream and... decode it. But they aren't processing the decoded data at the source of the decode, so instead they forward that decoded data, uncompressed(!!), to a different location for processing. Surprisingly, they find out that moving uncompressed video data from one location to another is expensive. So, they compress it later (Don't worry, using a GPU!)
At so many levels this is just WTF. Why not forward the compressed video stream? Why not decompress it where you are processing it instead of in the browser? Why are you writing it without any attempt at compression? Even if you want lossless compression there are well known and fast algorithms like flv1 for that purpose.
Just weird.
I'm pretty sure that feeding the browser an emulated hardware decoder (ie - write a VAAPI module that just copies compressed frame data for you) would be a good semi-universal solution to this, since I don't think most video chat solutions use DRM like Widevine, but it's not as universal as dumping the framebuffer output off of a browser session.
They could also of course one-off reverse each meeting service to get at the backing stream.
What's odd to me is that even with this frame buffer approach, why would you not just recompress the video at the edge? You could even do it in Javascript with WebCodecs if that was the layer you were living at. Even semi-expensive compression on a modern CPU is going to be way cheaper than copying raw video frames, even just in terms of CPU instruction throughput vs memory bandwidth with shared memory.
It's easy to cast stones, but this is a weird architecture and making this blog post about the "solution" is even stranger to me.
I mean, I would presume that the entire reason they forked chrome was to crowbar open the black box to get at the goodies. Maybe they only did it to get a framebuffer output stream that they could redirect? Seems a bit much.
Their current approach is what I'd think would be a temporary solution while they reverse engineer the streams (or even get partnerships with the likes of MS and others. MS in particular would likely jump at an opportunity to AI something).
> What's odd to me is that even with this frame buffer approach, why would you not just recompress the video at the edge? You could even do it in Javascript with WebCodecs if that was the layer you were living at. Even semi-expensive compression on a modern CPU is going to be way cheaper than copying raw video frames, even just in terms of CPU instruction throughput vs memory bandwidth with shared memory.
Yeah, that was my final comment. Even if I grant that this really is the best way to do things, I can't for the life of me understand why they'd not immediately recompress. Video takes such a huge amount of bandwidth that it's just silly to send around bitmaps.
> It's easy to cast stones, but this is a weird architecture and making this blog post about the "solution" is even stranger to me.
Agreed. Sounds like a company that likely has multiple million dollar savings just lying around.
They support 7 meeting platforms. Even if 1 or 2 are open to providing APIs, they're not all going to do that.
Reverse-engineering the protocol would be far more efficient, yes - but it'd also be more brittle. The protocol could change at any time and reverse-engineering it again could days between days and weeks. Would you want a product with that sort of downtime?
Also, does it scale? Reverse-engineering 7+ protocols is a lot of engineering work, and it's very specialized work that not any software engineer could just dive into quickly.
In comparison, writing web scrapers to find the video element for 7 different meeting products is super easy to write, and super easy to fix.
As it turns out, doing something in Rust does not absolve you of the obligation to actually think about what you are doing.
But they don't.
They support 7 different meeting providers (Zoom, Meet, WebEx, ...), none of which have an API that give you access to the compressed video stream.
In theory, you could try to reverse-engineer each protocol...but then your product could break for potentially days or weeks anytime one of those companies decides to change their protocol - vs web scraping, where if it breaks they can probably fix it in 15 minutes.
Their solution is inefficient, but robust. And that's ultimately a more monetizable product.
Context matters? As someone working in production/post, we want to keep it uncompressed until the last possible moment. At least as far as no more compression than how it was acquired.
It does, but you just removed all context from their comment and introduced a completely different context (video production/post) for seemingly no reason.
Going back to the original context, which is grabbing a compressed video stream from a headless browser, the correct approach to handle that compressed stream is to leave it compressed until the last possible moment.
With that constraint, letting a full browser engine decode and composite the participant streams is the only option. And it definitely is an expensive way to do it.
The initial approach was shipping raw video over a WebSocket. I could not imagine putting something like that together and selling it. When your first computer came with 64KB in your entire machine, some of which you can't use at all and some you can't use without bank switching tricks, it's really really hard to even conceive of that architecture as a possibility. It's a testament to the power of today's hardware that it worked at all.
And yet, it did work, and it served as the basis for a successful product. They presumably made money from it. The inefficiency sounds like it didn't get in the way of developing and iterating on the rest of the product.
I can't do it. Premature optimization may be the root of all evil, but I can't work without having some sense for how much data is involved and how much moving or copying is happening to it. That sense would make me immediately reject that approach. I'd go off over-architecting something else before launching, and somebody would get impatient and want their money back.
That doesn't sound like a ridiculous idea to me. How else would you get video data out of a sandboxed Chromium process?
With my mindset, you have a gigantic chunk of data. Especially if you're recording multiple streams per machine. The immediate thought is that you want to avoid copying as much as possible. If you really, really have to, you can copy it once. Maybe even twice, though before moving from 1 to 2 copies you should spend some time thinking about whether it's possible to move from 1 to 0, or never materializing the full data at all (i.e., keep it compressed, which could apply here but only as an optimization for certain video applications and so is irrelevant to the bootstrapping phase).
WebSockets take your giant chunk of data and squeeze it through a straw. How many times does each byte get copied in the process? I don't know, but probably more than twice. Even worse, it's going to process it in chunks, so you're going to have per-chunk overhead (maybe including a context switch?) that is O(number of chunks in a giant data set).
But the application fundamentally requires squishing that giant data back down again, which immediately implies moving the computation to the data. I would want to experiment with a wasm-compiled video compressor (remember, we already have the no GPU constraint, so it's ok to light the CPU on fire), and then get compressed video out of the sandbox. WebSockets don't seem unreasonable for that -- they probably cost a factor of 2-4 over the raw data size, but once you've gained an order of magnitude from the compression, that's in the land of engineering tradeoffs. The bigger concern is dropping frames by combining the frame generation and reading with the compression, though I think you could probably use a Web Worker and SharedArrayBuffers to put those on different cores.
But I'm wrong. The data isn't so large that the brute force approach wouldn't work at all. My version would take longer to get up and running, which means they couldn't move on to the rest of the system.
Knowing thyself is a superpower all its own; we need people to write scrappy code to validate a business idea, and we need people who look at code with disgust, throw it out, and write something 100x as efficient.
So they are only half way correct about masking. The RFC does mandate that client to server communication be masked. That is only enforced by web browsers. If the client is absolutely anything else just ignore masking. Since the RFC requires a bit to identify if a message is masked and that bit is in no way associated to the client/server role identity of the communication there is no way to really mandate enforcement. So, just don't mask messages and nothing will break.
Fragmentation is completely unavoidable though. The RFC does allow for messages to be fragmented at custom lengths in the protocol itself, and that is avoidable. However, TLS imposes message fragmentation. In some run times messages sent at too high a frequency will be concatenated and that requires fragmentation by message length at the receiving end. Firefox sometimes sends frame headers detached from their frame bodies, which is another form of fragmentation.
You have to account for all that fragmentation from outside the protocol and it is very slow. In my own implementation receiving messages took just under 11x longer to process than sending messages on a burst of 10 million messages largely irrespective of message body length. Even after that slowness WebSockets in my WebSocket implementation proved to be almost 8x faster than HTTP 1 in real world full-duplex use on a large application.
If one is doing websockets on the local machine (or any other trusted network) and one has performance concerns, one should maybe consider not doing TLS.
If the websocket standard demands TLS, then I guess getting to not do that is would be another benefit of not using a major-web-browser-provided implementation.
"using WebSockets over loopback was ultimately costing us $1M/year in AWS spend"
then
"and the quest for an efficient high-bandwidth, low-latency IPC"
Shared memory. It has been there for 50 years.
Was it because they didn't want to use some multicast video server?
Then they capture the video from the meeting in Chromium.
Then they need to send that captured video to another process for compression and processing.
No, WebSockets isn't the most efficient, but there aren't that many options once you're capturing inside a web page.
Since they don't have API access to all these platforms, the best they can do to capture the A/V streams is simply to join the meeting in a headless browser on a server, then capture the browser's output and re-encode it.
To my knowledge, Zoom’s web client uses a custom codec delivered inside a WASM blob. How would you capture that video data to forward it to your recording system? How do you decode it later?
Even if the incoming streams are in a standard format, compositing the meeting as a post-processing operation from raw recorded tracks isn’t simple. Video call participants have gaps and network issues and layer changes, you can’t assume much anything about the samples as you would with typical video files. (Coincidentally this is exactly what I’m working on right now at my job.)
Hardware decoding works best when you have a single stable high bitrate stream with predictable keyframes — something like a 4K video player.
Video meetings are not like that. You may have a dozen participant streams, and most of them are suffering from packet loss. Lots of decoder context switching and messy samples is not ideal for typical hardware decoders.
Recall's offering allows you to get "audio, video, transcripts, and metadata" from video calls -- again, total conjecture, but I imagine they do need to decode into raw format in order to split out all these end-products (and then re-encode for a video recording specifically.)
Always. Always ?!?
Article summary: if you're moving a lot of data, your protocol's structure and overhead matters. A lot.
They seem to not understand the fundamentals of what they're working on.
> Chromium's WebSocket implementation, and the WebSocket spec in general, create some especially bad performance pitfalls.
You're doing bulk data transfers into a multiplexed short messaging socket. What exactly did you expect?
> However there's no standard interface for transporting data over shared memory.
Yes there is. It's called /dev/shm. You can use shared memory like a filesystem, and no, you should not be worried about user/kernel space overhead at this point. It's the obvious solution to your problem.
> Instead of the typical two-pointers, we have three pointers in our ring buffer:
You can use two back to back mmap(2) calls to create a ringbuffer which avoids this.
>50 GB/s of memory bandwidth is common nowadays[1], and will basically never be the bottleneck for 1080P encoding. Zero copy matters when you're doing something exotic, like Netflix pushing dozens of GB/s from a CDN node.
[1]: https://lemire.me/blog/2024/01/18/how-much-memory-bandwidth-...
And since it behaves like filesystem, you can swap it with real filesystem during testing. Very convenient.
I am curious if they tried this already or not and if they did, what problems did they encounter?
(I also wouldn't be surprised if they had even more memory copies than they let on, marshalling between the GC-backed JS runtime to the GC-backed Python runtime.)
I was coming back to HN to include in my comment a link to various high-performance IPC libraries, but another commenter already beat me linking to iceoryx2 (though of course they'd need to use a python extension).
SHM for IPC has been well-understood as the better option for high-bandwidth payloads from the 1990s and is a staple of Win32 application development for communication between services (daemons) and clients (guis).
On the outside we can’t be sure. But it’s possible that they took the right decision to go with a naïve implementation first. Then profile, measure and improve later.
But yes the hole idea of running a headless web browser to get run JavaScript to get access to a video stream is a bit crazy. But I guess that’s just the world we are in.
[0] https://www.workatastartup.com/companies/recall-ai
They have APIs to schedule meetings.
They don't have APIs that give you access to the compressed video stream.
The product is not a full-stack web application. What makes you think that they brought in people with that kind of experience just for this particular feature?
Especially when they claim that they chose that route because it was what was most convenient. While you might argue that wasn't the right tradeoff, it is a common tradeoff developers of all kinds make. “Make It Work, Make It Right, Make It Fast” has become pervasive in this industry, for better or worse.
A more reasonable approach would be to have Chromium save the original compressed video to disk, and then use ffmpeg or similar to reencode if needed.
Even better not use Chromium at all.
Cheaper and more straightforward.
Their discussion of fragmentation shows they are clueless as to the details of the stack. All that shit is basically irrelevant.
They're capturing video from inside a Chromium process. How exactly do you expect to send the raw captured video frames to hls?
Are you proposing implementing the HLS server inside a web process?
Since it's coming from a headless process, they can just pipe it into ffmpeg, which is probably what they're using on the back-end anyway. Send the output to a file, then copy those to s3 as they're generated. And you can drop the frame rate and bitrate on that while you're at it, saving time and latency.
It's really not rocket science. You just have to understand your problem domain more better.
Shipping uncompressed video around is ridiculous, unless you're doing video editing. And even then you should use low-res copies and just push around EDLs until you need to render (unless you need high-res to see something).
Given that they're doing all that work, they might as well try to get an HLS encoder running in chrome. There just was an mp3 codec in web assembly on HN, so an HLS live encoder may not be too hard. I mean, if they were blowing a million because of their bad design they could blow another million building a browser-based HLS encoder.
As a point of comparison, how many TB per second of video does Netflix stream?
Netflix has hardware ISPs can get so they can serve their content without saturating the ISPs lines.
There is a statistic floating around that Netflix was responsible for 15% of the global traffic 2022/2023, and YouTube 12%. If that number is real... That'd be a lot more
Are you sure about that? Atomics are not locks, and not all systems have strong memory ordering.
Atomics require you to explicitly specify a memory ordering for every operation, so the system's memory ordering doesn't really matter. It's still possible to get it wrong, but a lot easier than in (traditional) C.
But yes, it's an order of magnitude easier to get portability right using the C++/Rust memory model than what came before.
Pretty sure ARM and x86 you would be seeing on AWS does have strong memory ordering, and has atomic operations that operate on something the size of a single register...
Why would think you need locks?
And the GPU for rendering...
So they should instead just be hooking into Chromium's GPU process and grabbing the pre-composited tiles from the LayerTreeHostImpl[1] and dealing with those.
[1]: https://source.chromium.org/chromium/chromium/src/+/main:cc/...
I dunno, when we're playing with millions of dollars in costs I hope they're at least regularly evaluating whether they could at least run some of the workload on GPUs for better perf/$.
The memcopys are the cost that they were paying, even if it was local.
The basic point is that WebSockets requires that data move across channels that are too general and cause multiple unaligned memory copies. The CPU cost to do the copies was what cost the megabuck, not network transfer costs.
From the article intro before they dive into what exactly is using the CPU.
> write pointer: the next address to write to
OK
> peek pointer: the address of the next frame to read > read pointer: the address where data can be overwritten
What? If the "write pointer" is the "the next address to write to" then the "read pointer" had better be "the next address to read from".
The "peek pointer" should be the "read pointer", and the pointer to the end of the free sector should be the "stop pointer" or "unfreed pointer" or "in-use pointer" or literally anything else. Even "third pointer" would be less confusing!
Read the article.
that’s surprising to.. almost no one? 1TBPS is nothing to scoff at
assuming you're only shuffling bytes around, on bare metal this would be ~20 DDR5 channels worth
or 2 servers (12 channels/server for EPYC)
you can get an awful lot of compute these days for not very much money
(shipping your code to the compressed video instead of the exact opposite would probably make more sense though)
What does this mean? The article says 'TB' which would be terabytes. Terabytes are made out of gigabytes. There is nothing faster than straight memory bandwidth. DDR5 has 64 GB/s max. 12 channels of that is 768 GB/s.
Terabytes per second is going to take multiple computers, but it will be a lot less computers if you're using shared memory bandwidth and not some sort of networking loopback.
pro-tip: it's quite a bit bigger than a terabit
> Arrow libraries provide convenient methods for reading and writing columnar file formats
An uncompressed 1920*1080 30fps RGB stream is 178 megabytes / second. (This is 99% likely what they were capturing from the headless browser, although maybe at a lower frame rate - you don’t need full 30 for a meeting capture.)
In comparison, a standard Netflix HD stream is around 1.5 megabits / s, so 0.19 megabytes.
The uncompressed stream is almost a thousand times larger. At that rate, the Websocket overhead starts having an impact.