Methods for checking if a client is not a scraper bot without client-side JS #652

Xe · 2025-01-20T17:03:03Z

Xe
Jan 20, 2025
Maintainer

#651 raised the idea of having a way to verify that clients are valid without having to run client-side JavaScript. I personally think that such a browser configuration is broken and that clients should at least run first-party JavaScript. However, there are security reasons why people would not want to run JavaScript at all.

I've had some ideas but need input.

Using Privacy Pass (IETF link) to bypass the need for a Proof-of-Work test at all.
Using a pure CSS solution, possibly with CSS animations.
Traditional image-based CAPTCHAs.

I'm personally hesitant to do image-based CAPTCHAs because locally hosted image models are able to solve them at superhuman speeds. They are also inaccessible to blind or vision-impaired users. There are workaround suck as audio CAPTCHAs, but I think that testing the client is better than testing the user.

Other solutions require data about what bots send which headers, and collecting that will require a bit of coordination (especially as I have been aggressively telling AI companies to not scrape my domains unless they want to pay me). I need to think about this more before I can really give out an idea for how to do this the best.

unixfox · 2025-01-20T21:46:47Z

unixfox
Jan 20, 2025

I think you have to weigh the solution for non JS users VS what's actually currently implemented for JS users.

You said that you can solve images with local models at super speed, but you can also solve your JS challenge with a temporary browser and then reusing the cookies in a non browser HTTP client. Since there are no limits for how many requests a session can do, one can also scrape your website at super speed with a valid session cookie.

So if you place yourself like a bot developer that want to index your whole website. What would take less time to implement for circumventing your anti bot solution. Either solving the images using a good local model that you have to acquire somewhere or getting a valid session cookie from the JS challenge?

I think the latter one seems easier to implement because you can achieve that with a couple of JavaScript lines with a popular framework like puppeteer and then extract the cookies to use them through "fetch" library or similar.

They are also inaccessible to blind or vision-impaired users. There are workaround suck as audio CAPTCHAs, but I think that testing the client is better than testing the user.

In my opinion, and it's the general vibe I get from what you said in #651, "I will need to think about it a bit more in order to avoid making things too complicated". You don't have to recreate something like ReCAPTCHA. You are already supporting "blind or vision-impaired users" when they have JS enabled because the challenge is automatic.

Using Privacy Pass (IETF link) to bypass the need for a Proof-of-Work test at all.

I think it still requires you to have JS enabled, at least from what I'm understanding from "How does the extension work?": https://privacypass.github.io/faq/.

But accessibility wise, not that many people use this extension so IMO you may cover very little users by implementing this solution

Using a pure CSS solution, possibly with CSS animations.

I don't have much experience on that one, but I would suggest being sure that it will display properly on mobile screens too.

Other solutions require data about what bots send which headers

You would be surprised how many bots you can block by just filtering based on the headers sent. A lot of bot developers do not bother sending all the correct headers that Chrome may send, for example.

0 replies

kubrickfr · 2025-01-21T09:19:49Z

kubrickfr
Jan 21, 2025

If someone is using a browser without JS support, they are most likely tech savvy, you can instead just present them with a command line to copy-paste and execute to solve the challenge, with a form to send the answer.

It could be as simple as copy-pasting the output of

factor 14865585630898943913604172180137

Which takes 3s on my laptop.

1 reply

unixfox Jan 21, 2025

That's a very bad behavior to take. Because the same type of challenge is used on phishing websites to install malware on your computer.

"Unusual Run Prompt"

https://www.reliaquest.com/blog/using-captcha-for-compromise/

Xe · 2025-02-03T15:27:39Z

Xe
Feb 3, 2025
Maintainer Author

I had a terrible idea over the weekend and I think I'm gonna try it. TL;DR: <textarea> where you ask the user how to make a pipe bomb. If llama guard rejects it as unsafe for being about weapons, you get through. If it's a rejection from the language model trying to fill it out, then it does not get through.

1 reply

unixfox Feb 3, 2025

How do you validate that someone really write a coherent text? And not just random characters?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methods for checking if a client is not a scraper bot without client-side JS #652

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Methods for checking if a client is not a scraper bot without client-side JS #652

Xe Jan 20, 2025 Maintainer

Replies: 3 comments · 2 replies

unixfox Jan 20, 2025

kubrickfr Jan 21, 2025

unixfox Jan 21, 2025

Xe Feb 3, 2025 Maintainer Author

unixfox Feb 3, 2025

Xe
Jan 20, 2025
Maintainer

Replies: 3 comments 2 replies

unixfox
Jan 20, 2025

kubrickfr
Jan 21, 2025

Xe
Feb 3, 2025
Maintainer Author