I've been working with scrapers quite a lot. I started with python requests, then to scrapy, then selenium, then selenium via undetected_chromedriver, and once that started being detected during a chrome update about a year ago, I've switched over to seleniumbase. It got by undetected, but to get it working with pre-downloaded drivers, I had to look into the code. I have never, and I mean never, in all my python years, seen such a horrible mess of code. We are talking 1000lines long methods, with 20-30 different flags and branches Just horrible. I have since switched to Playwright, which seems to be also undetected, and offers a much saner interface.
seleniumbase 3 days ago [-]
SeleniumBase modifies the webdriver so that it doesn't get detected when used alongside the CDP stealth mode and methods. It'll download chromedriver for you. Not sure what you mean by the multiple branches, as there's just the primary one. What 1000-line methods are you referring to? By "flags", do you mean the different command-line options available? As for Playwright, they aren't undetected: See https://github.com/microsoft/playwright/issues/23884#issueco... - "Playwright is an end-to-end testing framework, where we expect you test on your own environments. Bypassing any form of bot protection is not something we can act on. Thanks for your understanding." On the contrary, SeleniumBase is OK with bypassing bot detection: https://github.com/seleniumbase/SeleniumBase/blob/master/exa...
cyanmagenta 3 days ago [-]
Not the commenter, but “multiple branches” in this context is referring to if/else statements in the code, not source-control branches. Similarly, “flags” is referring to function arguments like a boolean “is_original.” More generally, they are just saying that the code has long, complicated, bug-prone functions.
That said, I just spent a few minutes browsing the SeleniumBase repro, and honestly it didn’t seem that unusual to me. Would be interested in seeing a specific example of what the commenter had in mind.
That's not amazing code but that's not that bad. In the grand scheme of things, that's not code debt that would ever seriously make my life any harder.
TeMPOraL 2 days ago [-]
Yup. At least it's self-contained and easy to step through and modify if something breaks or needs to be changed.
And, a my previous PM would point out, even the copy-pasting and verifying no mistakes were made was a solution that took a fraction of the time a modern "clean" approach would. She had a point; as much as I'm against writing this simple code in the general case, plenty of devs tend to err towards overcomplicating solutions when given a chance.
I mean, the modern, proper, Clean Code™ solution would have this split into multiple files (not counting tests), and across two or three abstraction levels. I've seen this happen enough that I can tell I'd much prefer working with code like this capabilities parser (and hell, it can be beaten into near-perfection in an hour or three).
the_real_cher 2 days ago [-]
Amen!
I think the more experienced you get in coding the more you appreciate straight forward code you can immediately look at and understand.
Call it "legacy code" if you'd like. That specific part is from a less common feature for setting options when running on a Selenium Grid. The new CDP Mode isn't compatible with The Grid (since CDP Mode makes direct CDP API calls without making Selenium API calls).
MstWntd 3 days ago [-]
it's always easier for people today to look at the work of other people in the past and draw stupid conclusions.. don't mind them..
the_real_cher 2 days ago [-]
It's not really bad thought.
It's clear, it's intuitive, it's easy to understand on first glance, it's a single purpose function, it's easy to step through.
you don't have anything to defend here.
bryanrasmussen 2 days ago [-]
Maybe I am just a cynic but I would expect Playwright to be detected when using Chrome, I mean I would expect it was to the benefit of Google to make that happen for the sake of making reCaptcha detect bots better.
That's actually why I've been scrapping my Playwright automation (because I expect I will encounter problems even if hasn't happened yet, cynical and paranoid) and moving towards writing a browser extension to automate Firefox.
Basically my use case is automating tedious things for myself not running bots at scale, so that's why it is imperative not to get caught being "not human", because then risk account problems.
robertlagrant 2 days ago [-]
How can Google make that happen? Playwright's made by Microsoft. It can use Firefox as a browser as well as Chrome.
pryelluw 3 days ago [-]
Enterprise Python code. Somehow ends up being worse than Java enterprise code. I’m too used to it at this point.
seleniumbase 2 days ago [-]
The "Python vs Java" debate is probably one for a different Hacker News post. :)
pryelluw 2 days ago [-]
I meant that some of the code reminds me of enterprise python. The kicker is that code that works > pretty code. People here act as if ugly code is somehow lesser just because it’s ugly. Meanwhile there’s a lot of ugly code making millions of dollars.
Didn’t mean to bash your project. Sorry if it came across that way.
seleniumbase 2 days ago [-]
It's OK. No offense was taken. It almost looked like the conversation was expanding into a "Python vs Java" debate, but (thankfully) it did not. I've seen both worlds. I've seen advantages to both. I decided to stay in the Python world.
pryelluw 22 hours ago [-]
Same. Although enterprise python is akin to wrestling a boa constrictor.
edm0nd 3 days ago [-]
Not sure if you have explored rolling captcha solving services into your code. Its easy as fuck and you can do it in a few lines of code. Check out DeathByCaptcha or AntiCaptcha. It's like $2.99 per 1,000 successfully solved captchas.
I guess my point is, you dont have to be undetected nor write 1000 lines of code to scrape or do whatever you are needing to do always. Saved me a ton of headaches and time when captchas are involved.
mintzworld 3 days ago [-]
SeleniumBase is free, open-source, can bypass CAPTCHAs with a few lines of code, and it works from the free tier of GitHub Actions.
edm0nd 3 days ago [-]
It cant bypass all captchas and thats what im talking about.
That patches chromedriver, (which gets renamed to uc_driver), but patching by itself isn't enough to bypass bot-detection. SeleniumBase also sets specific Chrome options and modifies methods to use the Chrome Devtools Protocol.
lyu07282 2 days ago [-]
I was more astonished that you could just search and replace a string in a PE/ELF binary without breaking everything, but I take your solution over recompiling chrome anytime. Awesome job, very well done!
seleniumbase 2 days ago [-]
Thank you!
theanonymousone 3 days ago [-]
Is it demonstrably better than Playwright in bypassing Cloudflare measures? I have some scraping projects and the "cat and mouse game" (what's the right expression here?) got so much energy that I finally went with an external dedicated scraping service. It doesn't feel right that some scrapers are considered friendly (e.g. Google), while smaller ones are vilified..
Is there a reason why the crawling and browser automation people don't just patch the browser to be controlled with no possibility of detection?
The web page is heavily restricted in what is can access through various interfaces and you can feed it anything you want by patching the browser. Once you do that the problem becomes just simulating a legitimate user to a sufficient degree.
I wonder if that's what's already happening with CDP and ReCAPTCHA and hCaptcha - the two services mentioned that are strong and a problem. Are they detecting the "Stealth" or is it just the lack of user activity and reputation? Is CDP by itself detectable by some means?
seleniumbase 2 days ago [-]
Patching chromedriver is a lot easier than patching the browser. Plus, if you're just using a regular Chrome browser for the automation, then there's nothing to patch. Automated CDP calls aren't detectable if they don't leave any trace of automation activity. However, since Google created CDP, they might have ways of detecting automated CDP in ways that other services cannot.
coppsilgold 2 days ago [-]
What about faking mouse movement from inside the browser? PyAutoGUI is not the right way to be doing this for interacting with JavaScript that has no hope of interrogating user operating system GUI interactions.
And it seems like it would be important to try and adopt user-like mouse movement since JavaScript has access to this information.
mintzworld 2 days ago [-]
PyAutoGUI is the optimal tool for clicking things inside of closed shadow-root elements, which are hidden to JavaScript. Can use CDP for clicking other elements.
ghxst 2 days ago [-]
The reason in my experience is that there's a high barrier of entry for most devs when it comes to setting up an environment for Chromium and a workflow for patches that still allows you to quickly and easily pull in and apply upstream changes whenever a new Chromium version releases.
In reality, if you know how to use CDP correctly and you have control over the environment that you run the browser in, you have to make very few browser patches.
What I mean with using CDP correctly is that, yes it is detectable to a certain extent but it comes down to things like enabling Runtime domain for example which you can easily mitigate in your own solution but is something that libraries like puppeteer / playwright often do out of the box (this is where the "stealth" versions of these libraries come in, they will either mitigate by disabling features or use some hacky approaches to instrument the JS that runs on the pages).
Then when you move into an environment that is a lot more stripped down (let's say from your home machine to docker) now you run into A LOT of issues that you definitely are better off fixing with browser patches, however figuring out what those issues are and how to fix them is a huge feat in itself and often will require you to have the ability to reverse engineer things like Cloudflare, Akamai and other anti bot vendors just to know what leaks you still have to patch.
It doesn't help that there is no end to misinformed articles on things like "browser fingerprinting" that you encounter when you try to solve your issues the first time you encounter them, a lot of articles based on nothing but superstition, articles that basically say "proxies are never good enough", "captchas are getting out of hand" that get things wrong and will just eat away at your sanity while trying to debug issues.
This is long enough of a rant already but maybe offers you some insight, if you have any specific questions feel free to ask.
coppsilgold 2 days ago [-]
Why not create a library that you inject into the Chrome process though?
It seems to me that playing a cat and mouse game with these anti-bot systems is unnecessary. Design a system which mimics a legitimate user to such a degree that it's either indistinguishable from an actual user or would produce an unacceptable level of false positives for the detection system. This is not an even playing field, the bot has all the advantages.
For example:
- Enumerate all the possible ways in which the webpage can glean insight into user input/activity.
- Hook all these functions by injecting code into the browser. At a level above and completely inaccessible to anything the web page can do to detect/interfere.
- Create functions that mimic user activities (mouse pathing, aimless mouse wondering, random scrolls, clicks, text selections, etc)
- Feed the outputs of these functions into the functions that you hooked.
- Rip out whatever information you want from the Chrome data structures in memory. Can probably reuse CDP code here.
After all this, the only challenge that would remain is to perfect the input functions that are supposed to mimic a legitimate user. Depending on how sophisticated these anti-bot systems can/will get, you may also need to cultivate user browsing habit profiles to enter advertising/spying databases as real humans.
ghxst 1 days ago [-]
> It seems to me that playing a cat and mouse game with these anti-bot systems is unnecessary. Design a system which mimics a legitimate user to such a degree that it's either indistinguishable from an actual user or would produce an unacceptable level of false positives for the detection system.
This is the most common misconception, challenges you face with browser automation at scale are not *automation* challenges.
You can use real human input, by having actual humans doing the input and you will still get blocked.
Automation at scale means running dozens to 100s of browser instances concurrently on the same hardware, then after you mitigate IP related issues is when you start running into actual challenges that are completely different from the actual automation part.
You have to research all the little quirks browsers have through the various APIs that they offer and then compare that data to real world data before you can start to actually fix the problems.
coppsilgold 22 hours ago [-]
There are browsers which randomize such fingerprints such as Brave. The web page does not have any insight into your hardware that you cannot mitigate by having the browser fake the responses.
You can also use Linux features such as namespaces & TUN's[1] to properly utilize proxies. Something I noticed is that Chrome under --proxy-server=socks5:// is incapable of using HTTP3 (UDP) for example, perhaps a deliberate oversight.
When scaling browser automation, generating random fingerprints for most common high entropy data points is counterproductive. It just ends up lowering your trust score and shifts attention to other browser properties with less entropy, making those primary identifiers.
For example, degrading canvas, WebGL, or WebGPU fingerprints (e.g., by introducing noise like Brave does) might lead anti-bot systems to either ignore them or punish you with captchas. Once ignored, other signals, such as screen resolution (just an example), become more important.
While this helps people with privacy by blending in with users and a single user visiting a website normally will probably not notice much, an influx of multiple users with degraded fingerprints and similar resolutions become easy to detect and might get a captcha or get blocked (e.g. 30-50+ browser sessions generating cookies for a specific captcha concurrently).
You can spoof multiple resolutions and then add some other properties, but it requires consistency across all of them, which can come down to weird browser specific quirks as well as whatever the data set of the anti bot vendor contains (regardless of how accurate). There are only so many plausible values for each low entropy data point that anti-bot systems will give you a high score for, forcing you to spoof as many data points as possible to maintain a high trust score across many concurrent sessions and eventually scale back or hit a limit for your operation, or deal with captchas by solving them and lose to the competition that doesn't have to do that.
Fingerprinting at scale isn’t just about spoofing individual data points - it’s about aligning all points in a realistic way and knowing which and how they relate to eachother, which requires extensive data and research.
On proxies: flagged IPs with residential ASNs often work fine if the overall trust score is high, but degraded fingerprints like Brave’s can undermine that advantage and then it becomes a lot more important, though it's always nice to eliminate if you are able to do so.
mintzworld 1 days ago [-]
Even a single script that performs actions too quickly on a website can trigger anti-bot measures, even if the bot isn't detected directly.
ghxst 22 hours ago [-]
I'm not denying that, I'm saying it's not a difficult challenge to solve when u compare it to the others I mentioned.
seleniumbase 2 days ago [-]
The biggest issue with going from a home machine to a server is that you may lose having a "residential IP address", which is something that you'll want to have in order to prevent automation from being blocked outright. Hence the popularity of residential proxies. However, some servers live in a residential IP space, which makes them optimal for running web automation in. As was partially covered in https://www.youtube.com/watch?v=Mr90iQmNsKM, GitHub Actions appears to live in a "Residential IP space", which makes it a good server choice for web automation.
ghxst 2 days ago [-]
IP is definitely not the biggest issue in my experience, as proxies are required at scale regardless, unless you get into more theoretical areas like p0f.
The biggest issues are the ones that aren't obvious or easily tested for like missing a particular font, being on an abnormal gfx driver that produces an unidentified hash for particular fingerprint methods, not having certain APIs available that require browser patches, and then these aspects will differ between anti bot vendors and the data sets that they have.
The reason they can be hard to test for is that everything is based on a trust score, which is potentially influenced by anything from website load to things tied to your personal session and for some vendors optionally even input data.
cruffle_duffle 3 days ago [-]
As somebody who is now on the “need to scrape a website to get my customers data for them” side of the fence… I get the reason bot detection exists. If you want people to not scrape, offer API’s that allow customers or their software to log in using oauth and let their software / LLM agent grab their data for them.
chii 2 days ago [-]
> If you want people to not scrape, offer API’s
many sites want to prevent scrapers because they don't want their information aggregated - things like price lists and product availability etc.
I know groceries sites do this, to prevent customers from knowing price histories of products. They want to raise prices, then offer a discount to make it seem like the discount is legitimate.
It seems weird to me that works - when I do scroll into views and similar behaviors in other code I do a random scroll speed to simulate human behavior, but SeleniumBase evidently doesn't.
Maybe I am just too paranoid.
seleniumbase 2 days ago [-]
SeleniumBase CDP Mode uses `DOM.scrollIntoViewIfNeeded` (https://chromedevtools.github.io/devtools-protocol/tot/DOM/#...), so it only scrolls when elements are offscreen, rather than always scrolling. This reduces the number of scrolls needed. Also, it seems that most anti-bot services are not looking at scrolling as a way of identifying users.
I was scraping my own oai chatgpt.com with playwright, and cloudflare blocked any attempts and the same with selenium and puppeteer. Only seleniumbase got pass it
Rendered at 16:09:07 GMT+0000 (Coordinated Universal Time) with Vercel.
That said, I just spent a few minutes browsing the SeleniumBase repro, and honestly it didn’t seem that unusual to me. Would be interested in seeing a specific example of what the commenter had in mind.
And, a my previous PM would point out, even the copy-pasting and verifying no mistakes were made was a solution that took a fraction of the time a modern "clean" approach would. She had a point; as much as I'm against writing this simple code in the general case, plenty of devs tend to err towards overcomplicating solutions when given a chance.
I mean, the modern, proper, Clean Code™ solution would have this split into multiple files (not counting tests), and across two or three abstraction levels. I've seen this happen enough that I can tell I'd much prefer working with code like this capabilities parser (and hell, it can be beaten into near-perfection in an hour or three).
I think the more experienced you get in coding the more you appreciate straight forward code you can immediately look at and understand.
It's clear, it's intuitive, it's easy to understand on first glance, it's a single purpose function, it's easy to step through.
you don't have anything to defend here.
That's actually why I've been scrapping my Playwright automation (because I expect I will encounter problems even if hasn't happened yet, cynical and paranoid) and moving towards writing a browser extension to automate Firefox.
Basically my use case is automating tedious things for myself not running bots at scale, so that's why it is imperative not to get caught being "not human", because then risk account problems.
Didn’t mean to bash your project. Sorry if it came across that way.
I guess my point is, you dont have to be undetected nor write 1000 lines of code to scrape or do whatever you are needing to do always. Saved me a ton of headaches and time when captchas are involved.
Write a ton of code or just roll in a solving service API. Ez decision and save a ton of time + get to scraping faster.
Can your script even do Google CAPTCHA and HCaptcha? What about the captcha from Dread? (aint no way it can)
There is no need to bypass them when you can just solve them.
There is no need to solve them when you can just bypass them.
That's.. that works?? :D
I was scraping my own oai data
The web page is heavily restricted in what is can access through various interfaces and you can feed it anything you want by patching the browser. Once you do that the problem becomes just simulating a legitimate user to a sufficient degree.
I wonder if that's what's already happening with CDP and ReCAPTCHA and hCaptcha - the two services mentioned that are strong and a problem. Are they detecting the "Stealth" or is it just the lack of user activity and reputation? Is CDP by itself detectable by some means?
And it seems like it would be important to try and adopt user-like mouse movement since JavaScript has access to this information.
In reality, if you know how to use CDP correctly and you have control over the environment that you run the browser in, you have to make very few browser patches.
What I mean with using CDP correctly is that, yes it is detectable to a certain extent but it comes down to things like enabling Runtime domain for example which you can easily mitigate in your own solution but is something that libraries like puppeteer / playwright often do out of the box (this is where the "stealth" versions of these libraries come in, they will either mitigate by disabling features or use some hacky approaches to instrument the JS that runs on the pages).
Then when you move into an environment that is a lot more stripped down (let's say from your home machine to docker) now you run into A LOT of issues that you definitely are better off fixing with browser patches, however figuring out what those issues are and how to fix them is a huge feat in itself and often will require you to have the ability to reverse engineer things like Cloudflare, Akamai and other anti bot vendors just to know what leaks you still have to patch.
It doesn't help that there is no end to misinformed articles on things like "browser fingerprinting" that you encounter when you try to solve your issues the first time you encounter them, a lot of articles based on nothing but superstition, articles that basically say "proxies are never good enough", "captchas are getting out of hand" that get things wrong and will just eat away at your sanity while trying to debug issues.
This is long enough of a rant already but maybe offers you some insight, if you have any specific questions feel free to ask.
It seems to me that playing a cat and mouse game with these anti-bot systems is unnecessary. Design a system which mimics a legitimate user to such a degree that it's either indistinguishable from an actual user or would produce an unacceptable level of false positives for the detection system. This is not an even playing field, the bot has all the advantages.
For example:
- Enumerate all the possible ways in which the webpage can glean insight into user input/activity.
- Hook all these functions by injecting code into the browser. At a level above and completely inaccessible to anything the web page can do to detect/interfere.
- Create functions that mimic user activities (mouse pathing, aimless mouse wondering, random scrolls, clicks, text selections, etc)
- Feed the outputs of these functions into the functions that you hooked.
- Rip out whatever information you want from the Chrome data structures in memory. Can probably reuse CDP code here.
After all this, the only challenge that would remain is to perfect the input functions that are supposed to mimic a legitimate user. Depending on how sophisticated these anti-bot systems can/will get, you may also need to cultivate user browsing habit profiles to enter advertising/spying databases as real humans.
This is the most common misconception, challenges you face with browser automation at scale are not *automation* challenges.
You can use real human input, by having actual humans doing the input and you will still get blocked.
Automation at scale means running dozens to 100s of browser instances concurrently on the same hardware, then after you mitigate IP related issues is when you start running into actual challenges that are completely different from the actual automation part.
You have to research all the little quirks browsers have through the various APIs that they offer and then compare that data to real world data before you can start to actually fix the problems.
You can also use Linux features such as namespaces & TUN's[1] to properly utilize proxies. Something I noticed is that Chrome under --proxy-server=socks5:// is incapable of using HTTP3 (UDP) for example, perhaps a deliberate oversight.
[1] <https://github.com/xjasonlyu/tun2socks>
For example, degrading canvas, WebGL, or WebGPU fingerprints (e.g., by introducing noise like Brave does) might lead anti-bot systems to either ignore them or punish you with captchas. Once ignored, other signals, such as screen resolution (just an example), become more important. While this helps people with privacy by blending in with users and a single user visiting a website normally will probably not notice much, an influx of multiple users with degraded fingerprints and similar resolutions become easy to detect and might get a captcha or get blocked (e.g. 30-50+ browser sessions generating cookies for a specific captcha concurrently).
You can spoof multiple resolutions and then add some other properties, but it requires consistency across all of them, which can come down to weird browser specific quirks as well as whatever the data set of the anti bot vendor contains (regardless of how accurate). There are only so many plausible values for each low entropy data point that anti-bot systems will give you a high score for, forcing you to spoof as many data points as possible to maintain a high trust score across many concurrent sessions and eventually scale back or hit a limit for your operation, or deal with captchas by solving them and lose to the competition that doesn't have to do that.
Fingerprinting at scale isn’t just about spoofing individual data points - it’s about aligning all points in a realistic way and knowing which and how they relate to eachother, which requires extensive data and research.
On proxies: flagged IPs with residential ASNs often work fine if the overall trust score is high, but degraded fingerprints like Brave’s can undermine that advantage and then it becomes a lot more important, though it's always nice to eliminate if you are able to do so.
The biggest issues are the ones that aren't obvious or easily tested for like missing a particular font, being on an abnormal gfx driver that produces an unidentified hash for particular fingerprint methods, not having certain APIs available that require browser patches, and then these aspects will differ between anti bot vendors and the data sets that they have.
The reason they can be hard to test for is that everything is based on a trust score, which is potentially influenced by anything from website load to things tied to your personal session and for some vendors optionally even input data.
many sites want to prevent scrapers because they don't want their information aggregated - things like price lists and product availability etc.
I know groceries sites do this, to prevent customers from knowing price histories of products. They want to raise prices, then offer a discount to make it seem like the discount is legitimate.
Maybe I am just too paranoid.
a good example is realestate.com.au
I was scraping my own oai chatgpt.com with playwright, and cloudflare blocked any attempts and the same with selenium and puppeteer. Only seleniumbase got pass it