Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Show HN: I turned my face rec system into a video codec (vertigo.ai)

518 points by jacobgorm 1250 days ago | 145 comments

akrymski 1250 days ago [-]

Well done for putting it out there!

We've worked on this about 3 years ago, plus background removal (realtime alpha-matting still not really done well by anyone), portrait re-lighting (G-Meet is now doing this) and even eye-contact (adjusting eye position to create the illusion of looking directly into the camera).

Some findings:

- Competing with dedicated H264/5 chips is very hard, especially when it comes to energy efficiency, which is something that mobile users ultimately care a lot about. Super-resolution on H264 is probably the way to go.

- It's hard to turn this into a business (corporates that pay for Zoom don't seem to care much).

PS Also a big fan of super-tiny AI models (binarized NNs, frequency-domain NNs, etc) for edge applications. Happy to chat!

jacobgorm 1250 days ago [-]

Thanks!

Wrt the speed, I worked very long and hard on finding the right NN architecture to do this without too much overhead.

My concern wrt super-resolution H264 is that you are going to have to encode and decode the full image anyway, so the cost should be very similar to doing encode-decode with network transmission in the middle. I've tried various DCT and DWT approaches, and yet not found them to be a win, but I'd be happy to learn what you guys found out.

I have sent you an invite to connect at Linkedin, I am https://www.linkedin.com/in/jacob-gorm-hansen-85b724/ if anybody else wants to connect there.

jamra 1249 days ago [-]

Are you sending occasional key frames and then just some points for a GAN to generate movement over the wire?

gnramires 1250 days ago [-]

Awesome!

A thought: now that neural compression is becoming widespread, it could be a good idea to put some kind of indicator or watermark stating the compression is neural (learned/function approximation in general). I think this would avoid liabilities and criticism around the fact that some weird things may appear (incorrect detail generation), maybe giving a wrong semantic idea. It may also be a good idea to put a mean squared error term in your objective function to help preserve general meaning.

danuker 1250 days ago [-]

> incorrect detail generation

Absolutely. Reminds me of Xerox number mangling:

https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

sitkack 1250 days ago [-]

Or when AI resolution enhancement inserts Ryan Goslings face.

https://petapixel.com/2020/08/17/gigapixel-ai-accidentally-a...

monkeybutton 1250 days ago [-]

Also white Obama! https://www.theverge.com/21298762/face-depixelizer-ai-machin...

gnramires 1250 days ago [-]

Interesting, and it may also indicate a way to address this issue with learning.

For example, you could train a network to give semantic image descriptions of significant features in the image, and maybe also transcribe text. Then you can include semantic preservation in the objective, or some kind of graceful degradation when semantic preservation isn't achieved.

newaccount74 1250 days ago [-]

> put some kind of indicator or watermark stating the compression is neural

That ship has sailed.

Smartphone cameras and laptop webcams already use machine learning algos to improve low light performance and noise. The result is that the images already contain details that are generated.

And it's impossible to turn off.

mchusma 1250 days ago [-]

You lost me at "patent pending". This idea has been obvious and in progress for a while now with lots of prior work. The issue is more the standards. Please don't sell this to a patent troll who will harass the industry for 20 years.

sounds 1250 days ago [-]

Seriously, this hits all the wrong points before sliding into "don't sell this to a patent troll."

"We, the armchair internet, have come to shout at you, innovative dude, that your ideas are worthless and that your interesting product is just fodder for patent trolls," ignores a few minor things -

• The video codec industry is absolutely crawling with overlapping patents enforced by shell companies whose only purpose is to extract license fees from every screen that ever existed. The USPTO seems to not care.

• This video codec was produced cleanly, and does not appear to overlap with any of the existing codecs.

I'd say give this smart person some credit.

latexr 1250 days ago [-]

Please don’t strawman. I see no evidence of shouting or calling the idea worthless in the root comment. Quite the contrary: if the poster had found it worthless, they wouldn’t be worried about it being sold to patent trolls.

Furthermore, according to the App Store what’s patent-pending is the algorithm that boosts eye contact, so comments on the codec don’t apply.

sounds 1250 days ago [-]

Not a strawman - I think I'm seeing a video codec. You're seeing a "boosts eye contact" video filter?

Can you clarify your statement? Maybe I missed what you meant.

riskable 1250 days ago [-]

> You lost me at "patent pending".

I had the same thought: As soon as I saw, "patent pending" I stopped reading. Not worth my time to learn about something that's going to be locked away. Talk to me in 18 years.

Patents on software and algorithms (aka "math") shouldn't even exist.

Kleto 1250 days ago [-]

Nvidia already showed this last year I think.

So whatever he is trying to patent, big companies already patent something.

lmc 1250 days ago [-]

> This idea has been obvious and in progress for a while now with lots of prior work.

Such as?

latexr 1250 days ago [-]

Apple’s Eye Contact feature for FaceTime, introduced in 2019: https://www.fastcompany.com/90372724/welcome-to-post-reality...

atleta 1250 days ago [-]

E.g.: "Neural Image Compression and Explanation" https://arxiv.org/abs/1908.08988

causalmodels 1250 days ago [-]

https://www.dpreview.com/news/5756257699/nvidia-research-dev...

atleta 1250 days ago [-]

Yep, it's indeed a trivial idea. I'd say I probably seen it mentioned in the explanation of convnets: they compress the images more and more as you go deeper in the network (i.e. they extract the features).

Compression+decompression is basically what an autoencoder does. See https://en.wikipedia.org/wiki/Autoencoder .

bambax 1250 days ago [-]

> Because it uses face rec, you can ONLY show your face, and if you disappear from view your audio will stop after a while, to avoid situations like when you need to go the the restroom but forget to mute.

Of course, the real killer app for Zoom calls is the opposite of this: some kind of deep fakery that makes it seem we're there when we're not.

Yet as it is, this is a fantastic idea. It's surprising video codecs deal so little with the nature of the images (AFAIK) and try to be generalists.

As this demonstrates, not all pixels are created equal.

pvillano 1250 days ago [-]

OBS allows you to create a virtual webcam, where you can use the live compositor to switch between, say, a looping video of you sitting still and your actual webcam. Smoke and mirrors could include a low frame rate and heavy compression to hide the transition

TranquilMarmot 1249 days ago [-]

Somebody tried this and had pretty good success

https://youtu.be/b-VCzLiyFxc

punnerud 1250 days ago [-]

If you have a helmet with a camera pointing at you, this could work.

Then it could look like you are at home in the meeting, but you are out in the forrest walking.

willy_k 1250 days ago [-]

Well if you just want to trick people into thinking you’re present when you aren’t, you can just use a video file as your camera feed. There are a bunch of tools to do this. Now, the main issue would be if you want to speak using this setup, but I guess one solution to that would be to use a model to lip sync your non-speaking video on-the-fly, which seems to be being discussed here[0].

[0] https://github.com/Rudrabha/Wav2Lip/issues/358

lostgame 1250 days ago [-]

>> Well if you just want to trick people into thinking you’re present when you aren’t, you can just use a video file as your camera feed.

I’m immediately reminded of the classic Simpsons episode where Homer plays a clip of him, Carl and Lenny; on a poorly-edited 5-second loop, of them clearly from 10-15 years before, overtop of the security cameras.

If faking webcam presence ends up being a thing, it’ll just be another bit of the future the Simpsons nailed.

telesilla 1250 days ago [-]

OBS virtual camera is another easy way. Load a video of you paying attention and have a quick key assigned to a scene of your live cam if needed.

billiam 1250 days ago [-]

>It's surprising video codecs deal so little with the nature of the images (AFAIK) and try to be generalists.

This. A small number of pixels matter, the rest don't.

gfodor 1250 days ago [-]

I have a prototype thing that creates a virtual webcam and drives an avatar off of your microphone. It's not photorealistic, but it works.

bambax 1250 days ago [-]

Please share!

cupofpython 1250 days ago [-]

>and if you disappear from view your audio will stop after a while

I sure hope this "feature" can be turned off in the settings

kordlessagain 1250 days ago [-]

Wouldn't this be an "auto mute"? I haven't tried it, but maybe it unmutes when it sees your face again.

jacobgorm 1250 days ago [-]

Yeah that is a feature, but I have plans for adding a button that disables it. I like not having to worry about muting my mic when I leave the camera, but there are times where you want the opposite.

Nowado 1250 days ago [-]

Now that's mixing up user and client.

kzrdude 1250 days ago [-]

Does it have any interesting "nonlinear" effects where it can show an entirely different face (the wrong face) based on misidentification or even adversary input?

These kinds of failure modes for "AI" are the most interesting to me.

It seems extremely smart - but don’t you think that to have success in a mass market product - think like MS Teams - it would need to be a combined solution. Where it both can do this for faces, efficiently, but also continues to work in a predictable way if I want to show an item/page from a book/my cat/my kid to other people in the call?

jacobgorm 1250 days ago [-]

It will not show another person's face, it does not (yet ;) possess that level of understanding of the image contents.

Yes I agree I have to work more on the business side of things, definitely on the lookup for for a business-savvy co-founder, hints at potential companies to partner with, etc.!

noduerme 1250 days ago [-]

also, what if you want to share dicks? It would be creepy if they all had eyes. There should be a mode for that where you drop into jitsi space

hyperdimension 1250 days ago [-]

Good lord. If it does, don't ever mention it to anyone on chatroulette. It's bad enough without that horror.

withinboredom 1250 days ago [-]

I bit my tongue off at one point in my life (jumped from a high height and my knee hit my chin as I was screaming). The fact that it captures most of the details of the scar where it was reattached is phenomenal. Majorly impressed.

There’s some weird banding/rainbow effects around my glasses and the background (not on my face), but that’s the only major artifact that stood out to me.

jacobgorm 1250 days ago [-]

Thanks!

Glasses are sometimes a little bit of a problem, I don't have enough of those in my training sets.

fao_ 1250 days ago [-]

I mean the obvious question here is... how many BIPOC (Black, Indigenous, People of Colour) do you have in your training sets?

samhw 1249 days ago [-]

Nah, it's "how many Black, Indigenous, People of Colour do you have who wear glasses and have facial scars from having fallen from a great height while screaming?". If you can't find enough preëxisting BIPOCWWGAHFSFHFFAGHWS people, I suppose you're limited to finding other BIPOC people, giving them glasses, and throwing them from a great height. (Manufacturing them the other way around might be too offensive.)

fao_ 1246 days ago [-]

My question was unironic, because to date there have been a large amount of issues resulting from black people using e.g. Zoom's autobackground, and it detecting them as the background, because black people were not featured in the training sets or considered when the AI was being constructed.

Likewise, many cameras do not properly pick up the skin tone of black, indigenous, people of colour. This is partly because of technological limits with respect to the number of F-stops available in commercial cameras. But also because there is a wide variety of human skin tones existing in the world, and camera manufacturers do not test for a majority of them.

Perhaps these papers speak louder for me, given that Hacker News only accepts neoliberal anecdotes :)

"Until recently, due to a light-skin bias embedded in colour film stock emulsions and digital camera design, the rendering of non-Caucasian skin tones was highly deficient and required the development of compensatory practices and technology improvements to redress its shortcomings"

https://cjc-online.ca/index.php/journal/article/view/2196/30...

"For a fixed decision threshold, the African-American image cohort has a higher false match rate and a lower false non-match rate. "

https://arxiv.org/abs/1904.07325

" The Gender Shades project revealed discrepancies in the classification accuracy of face recognition technologies for different skin tones and sexes. These algorithms consistently demonstrated the poorest accuracy for darker-skinned females and the highest for lighter-skinned males."

https://sitn.hms.harvard.edu/flash/2020/racial-discriminatio...

samhw 1245 days ago [-]

Ah, I see. Well diagnosed, by the way - I did indeed think you were being ironic, and was going along with the joke, as opposed to being antagonistic in the knowledge you were being sincere.

To respond in sincere mode: while I don’t think it’s terribly important whether black people are rendered correctly by some Zoom feature, nevertheless including a reasonable number of black people in one’s training data sounds like it shouldn’t require any extra effort, and so I think it’s a reasonable enough point to make.

moritonal 1250 days ago [-]

Firstly, this is pretty awesome, love it. I have a few questions:

* I applaud the work to have it run on tiny-bandwidths, how hard would it be to up the frame-rate to 60?

* How well does "framing" work? Are you able to add flexible amounts of padding around the head or is very focussed on a face taking up the whole canvas?

* How much does it "cheat". Is it firing only feature-maps so if I have a spot on my chin does it loose it in translation?

* How did you build the face-recogniser? Is it bespoke or a library?

* Is there a testing framework? Does it work on diverse faces?

jacobgorm 1250 days ago [-]

Thanks!

Wrt upping the frame rate the main problem is that the phone may run a bit hot, newer iPhones/iPads should be able to handle it just fine, but the older ones based on, say, the A10, might have trouble keeping up, especially with multiple remote parties connected.

* The framing depends on a transformation derived from the face landmarks, and the amount of padding is somewhat flexible. Distance from the camera seems to impact this, so it could be that my landmarks model needs some tweaking to work better when you are sitting very close to the camera.

* This is closer to being a general video codec than a face-generating GAN, so there is not a lot of "cheating" in that respect. It is optimized for transmission of faces, but other images will pass through if you let them (which I currently don't).

* I built the AI engine and the face recognizer etc from scratch, though with the help of a former co-founder who was originally the one training our models (in pytorch). The vertigo.ai home page has some demo videos. We initially targeted raspberry-pi style devices, NVIDIA Jetsons, etc., but have since ported to IOS and MacOS. Our initial customers were startups, mostly in the US, and a large Danish university that uses us for auditorium head counting.

* It empirically does seem to work on diverse faces, both in real life and when testing on for example the "coded bias" trailer. Ideally I would like to test more systematically on something like Facebook/Meta's "casual conversations" dataset.

samstave 1250 days ago [-]

>Danish university that uses us for auditorium head counting.

Just wait until you find out the Chinese have the same, but they train theirs for Uygher locating...

Yeah, these technologies are amazing, but also terrifying when viewed through the OBEY lens.

Jolter 1250 days ago [-]

What are you saying, that the Danish are selling the technology on to the Chinese? As if the Chinese government didn’t already have massively deployed facial recognition tech?

throwaway14356 1250 days ago [-]

for poor hardware a face generator with a set of mouth and eye states seems a good failback. It would be a huge difference if both hw and bw are bad.

espadrine 1250 days ago [-]

Eventually, a neural-net approach to video codecs is inevitable, as including high-level semantics is much more dense. I wonder about a few things though:

• How much of the 8.7MB of the app are the weights?

• Did you measure the energy consumption difference between this and H265? Especially considering Apple has hardware acceleration for this.

• Do you plan for a Web port as well?

• Is the performance envelope tied to CoreML, or has the Android version already been confirmed to have the same performance even without a Tensor chip?

• Do you have plans to address the enterprise market? How many participants could you scale to?

(I don’t think any of this would be a fundamental issue, but it could help frame the target market. Maybe phone conversations are not as marketable because of the limited battery, but daily team meetings with ~10 people could have adoption if a few features were added.)

jacobgorm 1250 days ago [-]

* The weights are currently around 6MiB uncompressed, but most of the networks can be sparsified to some extent, so that could be reduced somewhat. I also have a very fast sparse inference engine, but that is currently not in use, as the main win is on CPU, whereas I am mostly using GPUs for the NNs at the moment as it draws less power.

* I did not measure it methodically, but am always careful to not overheat the device when testing (XCode allows you to track this). My main testing device is an iPhone 11, and battery drain does not seem to be an issue compared with e.g. Zoom or Facetime. Where H265 currently wins is when you want to run in higher resolutions, but H265 is not available everywhere, say on slightly older iPads and there is no license included on Windows unless the customer pays separately.

* A WebGPU port would be nice, but I am currently waiting for the APIs to stabilize. If I can find some funding this will be a priority.

* I am not using CoreML but writing my own Metal compute shaders, but am using the "neural" parts of the Apple GPU through other APIs (MPS). I also have support for D3D and OpenCL, but have only tested the latter on teensy Mali GPUs, which at the time did not show impressive performance. On Android my approach would be to target Vulkan now that OpenCL is deprecated, I believe I have most of the plumbing in place, and speculate that things would work on modern mid-to-high end devices.

* When not cutting code, I am working on a plan for enterprise markets. Personally I have found the MacOS version really useful for pair-programming style scenarios, so that could be what I will be going after.

(The reason the MacOS version is still only in beta is because I hit a bug in AVFoundation where capturing from the web camera seems to be burning lots of CPU for absolutely no reason, and I don't want people to come away with the impression that it is my app doing that.)

Cadwhisker 1250 days ago [-]

This looks very similar to nVidia's method. Vertigo's website says this is patented, does nVidia have prior art here?

https://developer.nvidia.com/ai-video-compression

jacobgorm 1250 days ago [-]

NVIDIA's solution seems very expensive, using GANs. They synthesize more of the face, where ours is closer to learning a better general image codec by training on faces. I don't think they can run it on current generation edge/mobile devices. Also, our patent to the best of my knowledge does not overlap with theirs.

syrusakbary 1250 days ago [-]

WOW. This is amazing. I really believe your project can be game changing for the video-call industry.

Have you considered entering into the YC Program? I think it could be an awesome match. There are many startups I know they may want to take use of your service, and even fly.io is part of YC family!

Also, have you thought about open-sourcing it? (perhaps using a dual license could work great for an enterprise offering)

jacobgorm 1250 days ago [-]

Thanks!

I tried entering YC in the fall 2021 batch, and got to the top 10%. I believe my main problem wrt YC is that I currently lack a co-founder, so I did not apply in the Spring as this was still the case.

I am seriously thinking about open source, I believe for instance WebRTC found a good model with dual-licensing, where you have something like AGPL with the option of buying exceptions.

I have had multiple advisors telling me not to though, they fear it would scare away any potential investors ;)

syrusakbary 1250 days ago [-]

I entered YC as a solo-founder with Wasmer, so I think it might just be a circumstantial thing (they receive a lot of applications so it's always hard to judge who should enter with the limited time and data they have to make a decision). I would really encourage you to apply again!

In any case I'd love to help you on both aspects (YC application and OSS), I believe your idea has really great potential. Please ping me to syrus@wasmer.io and we can schedule some time to chat!

dicknuckle 1250 days ago [-]

Do what your heart feels is right.

nicr_22 1250 days ago [-]

Nice work, but you might find it's not super unique - video codec people have been thinking about how to apply face recognition ML tech to this use case for 5+ years.

For instance, have you seen https://developer.nvidia.com/maxine ? They released some pretty nice demos 2 years ago.

jacobgorm 1250 days ago [-]

Their approach is more heavy-weight as it uses GANs (IIRC) to dream up a reconstruction of your face. They need GPU VMs in the cloud, whereas mine runs on device.

samstave 1250 days ago [-]

>*whereas mine runs on device*

$$$

This IS the killer feature.

Now make a face recog PI (as you stated you tried) -- or a cheap Android Phone which best serves HW(gpu) for your needs and you have solved some complex surveillance matters.

Ventito 1250 days ago [-]

I'm not sure if the original post is the same as this: https://developer.nvidia.com/ai-video-compression

As maxime can do more like live translation.

bsenftner 1250 days ago [-]

Very cool work. I'd love to sit down and talk with you, jacobgorm. I spent 7 years in FR after failing my startup working on Personalized Advertising, which was based on Automated Actor Replacement in Filmed Media. The VC/Angel world wanted the startup to pursue deep fake pornography, but I refused, and ultimately went bankrupt. However, I managed to globally patent the actor replacement technology, create an automated digital double of real people pipeline, as well as get really deep into face ML/DL. That's how I ended up the principal engineer for the #3 FR company in the world for 7 years. I have since left FR, and am CTO of an AI-integrated 3D media production pipeline (I have a long games industry history). From the information in your post, it sounds like we are both on similar trajectories. It would be interesting and potentially synergistic if we met.

jacobgorm 1250 days ago [-]

I'd love to talk. Could you ping me at jacob@vertigo.ai?

noduerme 1250 days ago [-]

This is an amazing idea and I can't wait to try it. Just one question...

>> This also solves dick-pics etc.

Is this a problem on zoom meetings, for people other than Jeffrey Toobin?

Jolter 1250 days ago [-]

Maybe not on Zoom but I’ll bet it is on FaceTime and Messenger video chats.

nonrandomstring 1250 days ago [-]

Well done on creating a useful codec. One specifically optimised for face data seems apropos the emerging demand for more intimate remote interaction. Many teachers, therapists, doctors and social workers who conduct remote sessions rely on clear non-verbal signalling and need to read affect.

But the story has a deeper meaning for me (because of the book I am writing). You switched from street face surveillance (an arguably highly unethical application) to more intimate videoconferencing (a more humanistic and socially beneficial end).

May I ask you in all sincerity, what if any ethical considerations played a part in your change of direction?

I suspected from the title to read at least some sub-text that you turned your back on mass-surveillance to find a "better" use for your work. But you express no value judgements and only really mention that the pandemic took away potential targets.

jacobgorm 1250 days ago [-]

The ethical considerations played a large part of the pivot. I would rather help people communicate than surveil them.

That said, we build embeddable/edge face rec because we could, and I believe our partners who used it in the real pre-pandemic world found some very innocuous uses for it. In one case we replaced a system running all the faces through Rekognition with one running purely on devices and not storing any long-term data, which I think was an ethics win overall.

UncleEntity 1250 days ago [-]

> You switched from street face surveillance (an arguably highly unethical application) to more intimate videoconferencing (a more humanistic and socially beneficial end).

Or that makes it easier to identify individuals as they give consent to have their face ‘fingerprinted’ as part of the app’s EULA.

If I were going to sell a mass-surveillance solution I’d certainly want to have the ability to identify individuals without having to scrape all of Facebook or whatever. As much as people hate on apple they do make it so carrying around one of their phones doesn’t make it easy for someone to identify you.

I, for one, would think twice about installing an app from someone who “pivoted away” from their Orwellian surveillance unicorn dreams.

jacobgorm 1250 days ago [-]

Hi, we don't collect any data from the app, and have filled in the privacy etc. statements on the App Store accordingly.

Ideally I would like to collect faces to train the compression on, in which case we would have to consult with lawyers to come up with an EULA allowing us to do this. The advantage compared to using broadly available datasets to train on would be more realistic shot noise, low light images, and so on. I don't see any other valid business purpose of collecting people's faces.

We've been sitting on the face recognition tech since 2018, so if we'd wanted to become Clearview.ai we probably would have a long time ago.

UncleEntity 1250 days ago [-]

> We've been sitting on the face recognition tech since 2018, so if we'd wanted to become Clearview.ai we probably would have a long time ago.

It says right at the beginning of the post you were doing quite well until the pandemic shut down businesses.

I try not to be overly critical (I really do) but this is one of those cases I just can’t help myself, I see no reason individual businesses should be running facial recognition on their customers and am kind of wary of someone who would enable that. And cities adding it to their collection of public cameras is beyond wrong.

IDK, somewhere we, as a society, decided 1984 was an instruction manual and not a warning…

jacobgorm 1250 days ago [-]

There is a difference between doing well and becoming China in terms of surveillance. Most of our revenue were from just plain head counting and from tracking cars in a smart city project.

The one face rec system we actually sold was used to measure waiting times in a retail setting, and replaced an existing system that was using AWS Rekognition in the same manner, except with all the video footage going to the cloud. That license has long expired and the system is no longer running.

In any case, what is stated in the app's privacy statement is what we are doing. At the moment I don't even collect the IP addresses of users connecting.

nonrandomstring 1250 days ago [-]

Thanks for your honest answers Jacob, and the good spirit to do so in public. This kind of discussion really helps with the topics I am researching (and btw if I mention this or anyone else from HN in the text it will always be in a non-identifying way). Good luck with your project.

iforgotpassword 1250 days ago [-]

This looks awesome. Looking forward to a non-Apple version to try it out myself. Great idea to solve some of the issues of video conferences. I assume it also upscales people when they move away further from the camera, as some sort of welcome side effect. So you only need to be close to the camera once and then make yourself more comfortable a little farther away, or even with suboptimal lighting.

One thing that struck me as odd on the page is that the H265 still looks considerably worse than H264, despite being the better codec and being larger. What's up with that?

jacobgorm 1250 days ago [-]

In general, H265 is better than H264, but when you really squeeze it it seems to fall over its own feet. This is measured against ffmpeg's x265 implementation, the hardware-accelerated version on the iPhone will just refuse to go down to that bitrate.

londons_explore 1250 days ago [-]

How do you deal with network weights versioning?

I assume the version that does the compressing and decompressing needs to match? And if you release an update and half the users install it, this is a problem?

Do you have some mechanism to dynamically download and update weights to ensure that all users in a call at least have a common version of the network? Or will you just globally require all users to update before joining a call? (which in turn means every time you release an update, all calls must end, which isn't very enterprise-friendly)

jacobgorm 1250 days ago [-]

Good question. The protocol is versioned, so you will be prompted to upgrade if/when I change the network doing the encoding/decoding. Downloading new weights on the fly should not be a problem (Apple would allow it as long as I don't change the code), but in many cases when evolving the protocol I've had to make changes to other parts of the code too, so not sure if it will be worthwhile.

londons_explore 1250 days ago [-]

Thinking about it... The obvious solution is to make every version support the current protocol and the previous protocol. You might sometimes have to downgrade a call sometimes if someone joins with an older client.

Then anyone can join any call unless they are 2 or more versions behind.

londons_explore 1250 days ago [-]

One benefit of lower bandwidth is you have the possibility of reducing the glass-to-glass latency, since network queues will be less crowded.

But with this you have the downside of more milliseconds spent compressing and decompressing the frames.

Do you have any indication which effects dominate in typical 4G/wifi environments?

jacobgorm 1250 days ago [-]

I don't really have a good answer, except that I don't do any frame batching that would cause delays, and that my UDP protocol ships the frames (and audio) out as soon as they are ready, as to reduce latency to a minimum.

parentheses 1250 days ago [-]

What’s funny is I’ve had this very idea but no AI skills or extra time to pull it off. Bravo!!

capableweb 1250 days ago [-]

Yet another example that having ideas is worth close to 0, while being able to execute on your own or others ideas is worth > 0 :)

mromanuk 1250 days ago [-]

Yes. It's weird and forced when people tell you that ideas are super powerful, there is a good book "Made to stick" which is really cool. There's a section where they look at JFK's "get to the moon and back safely" as the driving force behind the whole thing. Thousands or millions before him, had the same idea, even the URSS was planning a trip. Even the travel is portrayed in a highly popular 1902 film. What set apart the moon landing of 1969 (by USA) was the execution.

jacobgorm 1250 days ago [-]

That is where I started ~5 years ago :-) Thanks!

losvedir 1250 days ago [-]

Really? Impressive! Can I ask how you went about learning it all then? Any books or online courses you can recommend?

jacobgorm 1250 days ago [-]

I learned by joining an early AI startup with some co-founders who knew about old-school AI (but didn't believe in backprob!), and then reading absolutely every ML paper I could find, following AI hotshots on Twitter, reading the darknet source code, and experimenting with pytorch.

Eventually two of us left to start Vertigo.ai, and found a customer who would fund a fast object detector to run on a $18 Nano-PI. That was a fun challenge and forces me to think about how to make the AI run fast and with relatively low footprint.

Today fast.ai might be a good starting point, definitely recommend going with pytorch, cloning cool projects from github, and going from there.

lyind 1250 days ago [-]

The future alternative to watching TV: run an AI writer/director/actor pipeline

daenz 1250 days ago [-]

Very cool, but where is the video demo?

ant6n 1250 days ago [-]

Very interesting! I’m getting more like 400-600 kbit/s, maybe too much beard and long hair.

The face boxing seems very aggressive, I feel like I’m trapped in the virtual prison of some 90s dr who episode.

jacobgorm 1250 days ago [-]

Try moving a bit further away from the camera, and placing the device on a steady surface. It might help.

thinkski 1250 days ago [-]

Very cool. Out of curiosity, why is the H.265 size slightly more than the H.264 size? How does the compute complexity for encoding and decoding compare with those two codecs?

jacobgorm 1250 days ago [-]

I got these results by using Vertigo's bitrate as the target, and squeezing the other ones until they got as close as possible to that. H265 is in general better than H264, but when you put the thumbscrews on it seems to get itself into a bit of trouble.

Wrt encoding/decoding complexity this is the major bottleneck, because you have the GPU competing with custom ASICs. I have a version of the codec that works in combination with H265, but still gets largish bandwidth gains, so if all you wanted was an insane hi-res and hi-bitrate transmissions that might be the way to go near-term.

ricochet11 1250 days ago [-]

this is really smart, one of those ideas that seems so obvious but i'd have never thought of it. i think the content moderation angle is pretty interesting to expand upon, a lot of livestream platforms have "if x is detected stop the stream" but this idea of make it impossible to show x in the first place would be much cheaper if it can be expanded enough for the relevant domains.

mateo1 1250 days ago [-]

I believe that algorithms reconstructing your face or part if it (ie facial expressions) are already in use, it's just not advertised.

sam_bristow 1250 days ago [-]

This reminds me of an idea I had about 15 years ago but never pursued. The concept was using basic object detection as an input as a first pass to to a standard video codec to guide where it should spend it's data budget. So, for example, a TV news broadcast could put more details on the host's eyes and mouth while a parking garage camera would be trained to get clearer number plates.

andai 1250 days ago [-]

Tried it out on my phone, app crashes every time I press "go to room". I'm using iPhone SE from 2016 so maybe my phone is too old for the GPU features? Alas! Was looking forward to trying it.

As a side note the UI looks like a toy or joke app. I'm not sure what market you're going for (it seems like a general purpose video chat app?) but you might want to reconsider the aesthetic.

jacobgorm 1250 days ago [-]

Could you post the exact specs? I've tested on iPhone 6s until recently and it used to work, but it could be I am doing something silly.

andai 1250 days ago [-]

I believe the 6s and (original) SE had the exact same internals. Here is a comparison: https://www.gsmarena.com/compare.php3?idPhone1=7969&idPhone2...

I haven't upgraded to iOS 15 though, I'm on 14.4, maybe that's the issue?

jacobgorm 1249 days ago [-]

I've gotten a repro of that issue on an A9 IPad (happened to be on IOS 14.x too, but deemed not relevant), something silly to do with Metal Language versions that was easily fixed. A new build that also fixes the UI issues others have reported is on its way out, please email me at jacob@vertigo.ai if you want beta access via Testflight. Should be GA within 1-2 days.

londons_explore 1250 days ago [-]

Can we have a video demo rather than a still image?

savolai 1250 days ago [-]

Hi. I’m on iphone XS. Could you help me understand what the buttons on the right do?

https://imgur.com/a/r8SbYfp

Also not sure - is the URL so that this can be used in browser for the other party, or - i am assuming - it’s just for accessing in the app on the other end too?

Thanks. Inspiring!

jacobgorm 1250 days ago [-]

Sorry about that, I've gotten that bug reported from a few people today, not sure what is causing SwiftUI to scale things like that, looking into it. I think it has more to do with display/font settings than the exact phone model.

Until I can get a fix out, a workaround is to rotate the device to landscape mode, you should be able to read the text on the buttons that way.

The text on the buttons to the right of the QR should say: * Clear * Copy URL * Copy QR

jacobgorm 1249 days ago [-]

I believe that issue has been fixed now, thanks for reporting it. Just waiting for app store review...

CTDOCodebases 1250 days ago [-]

Who would have thought anti dick pic technology would become a product feature but here we are in 2022.

robertlagrant 1250 days ago [-]

Stock ticker: NOTHOTDOG

daniel_iversen 1250 days ago [-]

Cool, but just tried to try it and if I send the “room” (or whatever) url from the front page by copying and pasting it into an iMessage to my wife, when she clicks it it says the page does not exist. Just as an fyi. Same if I recreate the url, use your “copy” buttons or even get her to scan the qr code.

jacobgorm 1250 days ago [-]

That is a known and very annoying problem, I think killing and restarting iMessage might help.

I am registering the correct URL handler for the app, but it seems to not always work immediately.

ge96 1250 days ago [-]

Curious if you were to compare this with standard webRtc and say Tensor Flow JS face landmark detection running against the mediastream, what is the difference?

edit: oh the "recognize your face" and compression I see (referring to Nvidia link someone else posted) wow

TruthWillHurt 1250 days ago [-]

This is awesome. Reminds me of what Nvidia did with facial-puppetry to reduce bandwith, but while theirs was just a POC, you're amazing for making it actually available.

Looking forward to Linux port!

1250 days ago [-]

wfme 1250 days ago [-]

This is super cool. The UI on the app could do with a bit of work - it’s easy to use but the text in the buttons next to the QR code wrap and cut off.

The actual video works well. kudos!

jacobgorm 1250 days ago [-]

I agree the UX needs more work.

swayvil 1250 days ago [-]

Neural compression has philosophical implications.

The map may not serve understanding. It may serve economics.

The cheap videophones make everbody look like Keanu Reeves.

tr33house 1250 days ago [-]

Really cool idea. Unfortunately I can't try it because app crashes for me. I'm guessing it's because of the HN crowd

Semaphor 1250 days ago [-]

From another comment:

> The rooms are hosted on a Raspberry Pi 3 lying on the floor of my office.

Hah, probably :D

bsnal 1250 days ago [-]

Do you have any benchmarks? Because this looks like it will be unusable on the average $300 windows laptop.

jacobgorm 1250 days ago [-]

I currently have (the prototype port of) it running on my T14-2 (11th gen i5) and my 2018 Mac Mini without problems, and have it confirmed to work back to 2015 macs, but older laptops may suffer. The main problem the cheap laptops (and even my quite expensive T14) have is that the cameras are absolutely horrible compared to anything Apple would ship.

chayesfss 1250 days ago [-]

Didn't take a look yet but will, very cool. Where are the meetings rooms hosted?

jacobgorm 1250 days ago [-]

The rooms are hosted on a Raspberry Pi 3 lying on the floor of my office. With the traffic it seems to be getting I will move it to a cloud server soonish.

It should do peer-to-peer for the majority of connections though, the server just does the initial hand-shake.

vletal 1250 days ago [-]

What about battery life on iPhones? Is the hardware utilised efficiently?

jacobgorm 1250 days ago [-]

Most of the heavy lifting is done on the GPU, which runs at a lower clock rate and uses less energy than the CPU. The neural nets used are fairly light-weight. I think the power consumption is similar to a Zoom call.

bobthehomeless 1250 days ago [-]

makes me think if this kind of algorithm would work with anything, not just faces. Using something like general object recognition or DeepDream as start point.

quanto 1250 days ago [-]

Great work!

Also, where can I learn about your edge AI smart camera system?

jacobgorm 1250 days ago [-]

https://vertigo.ai/sensoros/ has some info, but you probably read through that already. There is a public github repo, but I think it needs some explanation to be useful. You are welcome to ping me at jacob@vertigo.ai.

julienfr112 1250 days ago [-]

This is amazing !

jacobgorm 1250 days ago [-]

Glad to hear it :)

mutant 1250 days ago [-]

I'm IOS-less is there a video demo?

ReactiveJelly 1250 days ago [-]

> I uses real time AI

Should be "It uses"

Only works on iOS?

timeimp 1250 days ago [-]

impressive-very-nice.gif

This is just a phenomenal idea - I hope your patent is approved too!

jacobgorm 1250 days ago [-]

Thanks! :)

poniko 1250 days ago [-]

Impressive work, well done!

jacobgorm 1250 days ago [-]

Thank you! :)

st3ve445678 1250 days ago [-]

Does it leverage middle out compression?

jacobgorm 1250 days ago [-]

We're working on that part.

ochronus 1250 days ago [-]

Wow! Congrats, amazing work!

jacobgorm 1250 days ago [-]

Thanks!

sydthrowaway 1250 days ago [-]

gamechanger

spicyramen_ 1249 days ago [-]

axiosgunnar 1250 days ago [-]

bhaval 1249 days ago [-]

samstave 1250 days ago [-]

This codec needs to be implemented in Tesla Cars.

Full Stop.

jacobgorm 1249 days ago [-]

If anyone can hook me up with the right people I'd love to do it :-). Not sure if it's legal to use it that way, but doing video calls with my app while driving with the phone in a holder on the dashboard already works really well.

varispeed 1250 days ago [-]

> is a new from-the-ground-up patent pending

What is the invention? The models are just complex mathematical formulas and these cannot be patented.

jacobgorm 1250 days ago [-]

The patent that I filed is not around the models, but in how it boosts the parts of the face most relevant to face-to-face conversations.

I am not a super fan of patents, but for background please consider that Asger Jensen and I could have patented VM live migration in 2002 and chose not to for idealistic reasons, just to see VMware do it.

quickthrower2 1250 days ago [-]

Would be interesting if a foundation with a charter could own a a patent, to prevent later trolling.

jazzyjackson 1250 days ago [-]

If that were the case, Google would not own a patent on PageRank and we wouldn't have to bother with open source audio/video codecs.

jhgb 1250 days ago [-]

> Google would not own a patent on PageRank

Well, in my country they actually don't. YMMV as per one's location.

kleer001 1250 days ago [-]

I love the idea and I hope it succeeds.

Only one small bit of cosmetic feedback. Maybe think about hiring a face model to work your demo. I personally don't care, but I think it might improve your optics. And yes, I understand it's a tech demo. But booth babes were a thing (are they still a thing?) for good reason, grab those eyeballs, yo.

samstave 1250 days ago [-]

Babes are always a thing. Stop objectifying Babes.

kleer001 1250 days ago [-]

Which is it?

Rendered at 13:42:46 GMT+0000 (Coordinated Universal Time) with Vercel.