So, to add to this list here are my top 3 favorite article that could expand OP's document:
1, 2 - xpath and css selector introduction articles where I built a widget into our article that allows to test all css and xpath selectors right there in the learning material.
3 - introduction to reverse engineering - quick introduction to using browser devtools for web scraping, how to inspect the network and replicate it in your program. This is where I point all beginners as understanding the browser really helps to understand web scraping!
0 - https://scrapfly.io/blog/parsing-html-with-css/
1 - https://scrapfly.io/blog/parsing-html-with-xpath/
2 - https://scrapfly.io/blog/how-to-scrape-without-getting-block...
3 - https://scrapecrow.com/reverse-engineering-intro.html
Primary argument _in favor_ of automation (e.g. web scraping) is that it would be unethical to hire hudnreds people to do meanial, unfufilling tasks like mindlessly clicking around the website and saving the pages when it could be done by a program which is countless times more efficient for everyone involved (the website included) and safer.
The whole point is to avoid unnecessary or excessive crawling by bots that are engineered with no concern for anything other than the owners motivations, presumably financial in most cases.
Robots.txt allows the site owners to restrict such pages from being crawled by bots. Services that allow people to circumvent the restrictions are being rude to say the least. Many crawling services also use a farm of proxies that spoof their real identity with fake user agents to circumvent rate limiting etc. All of these "strategies" go far beyond basic automation and is quite shady in reality.
Also, unfortunately, robots.txt is rarely used to indicate non crawlable endpoints these days but instead is used as a way to withold public data. Just take any random big website and take a look at their robots.txt file:
Having said that, if you’re in the web scraping business dealing with anti-scrape shields and whatnot, ignoring robots.txt is the least nefarious of them all.
I mean, we had a solution to web scrapers since the inception of web authentication, but the value of having the data publically clearly outweighs the costs of having your data scraped to the point where big corporations would rather take web scrapers all the way to Ninth Circuit (Linkedin case) than shut down the public access.
That being said our understanding of information philosophy is still in complete infancy so it's hard to discuss ethics here. Generally, I'm in favor of hackers, individuals and decentralization over big corporations and access to web scraping empowers the former and weakens latter - so, I'm rooting for the healthier, better version of the internet above all!
But please stop framing your encouragement of toxic crawling practices as some sort of noble pursuit in a made-up fight against The Man.
Just own it as the "I'm-alright-Jack" approach it is; the honesty will make it a more respectable position intellectually, even if it remains unethical.
In fact, they even stated their reasoning in the document. I don't see why anyone has to blindly follow PEP8 nor do I get why 4 spaces indent has to be considered a standard.
A standard is not a set of rules already enforced in a language, otherwise it would not be needed. It's rather a set of practical guidelines that a group of people agrees upon, with the purpose of making each other's lives easier. That's why indenting with tabs is weird.
> I don't see why anyone has to blindly follow PEP8
In the <tabs> vs <space> debate there's really no reasons not to follow PEP8. The number of people (and editors, and tooling, etc) that abide by it is quite large and it seems to work well enough for most. The only reason that someone would even mention spaces as an inconvenience, is that they can somehow perceive a difference when editing code, which may point to badly configured or outdated tooling, rather than faulty standard. Most people who code Python have their editor set to feed 4 spaces when the <tab> key is pressed and delete 4 consecutive spaces on <Backspace>. If instead you repeatedly press (4 times) the <space> bar or the <Backspace> to insert or remove code indentations, you're doing it wrong.
Also, it avoids the issue of only accepting 2 or 4 spaces, leaving the 1,3,5 spaces as incorrect combinations leading to issues. With tabs there are no combinations which are invalid, though you can still have too many or few indents than intended of course. But the hunting for that extra space you copied in is gone.
I just don't subscribe to the 'tabs are evil' narrative. I like that python supports them but I think it's really annoying that YAML doesn't.
The argument of "each editor does things differently" is also not really valid when you're going to need a special editor that can convert tabs to spaces and delete spaces in bunches to really work with it comfortably. It would have been much easier to just use an editor that handles tabs in the way that was needed. Either way you're going to want specific editor features.
You also seem to advertise many of the sites/datasets you're scraping, which opens you up to litigation. Especially if they're employing anti-scraping tooling and you're brazenly bypassing those. It doesn't matter that it's legal in most jurisdictions of the world, you'll still have to handle cease and desists or potential lawsuits, which is a major cost and distraction.
Is that a done deal now after the “LinkedIn vs HiQ” case public information only hold copyright, but you can use the by product as it’s fit you for new business?
I'd like to point out that, while HiQ Labs "won" the case, that company is basically dead. The CEO and CTO are both working for other companies now. So I think the bigger takeaway is: don't get yourself sued while you're a tiny little startup.
Can anyone recommend other resources for understanding anti-bot tech and their workarounds?
Projects like cloudscraper are often linked to point and say "look! they broke Cloudflare!" but CF and the rest of the industry has detections for tools like this, and instead of rolling out blocks for these tools, they give website owners tools like bot score to manage their own risk level on a per-page basis.
Probably need to find an in with bot builders, if that's really a goal I have.
I haven't dug into it enough to know if there's some technical reason it's not currently the case, or just lack of (interest|willpower)
https://www.bing.com/search?q=%22thrill+me%22+%22common+craw... (and DDG similarly, because bing)
ed: I was curious if maybe HN publishes a sitemap, and it seems no. Then again, hnreplies knows about the HN API so maybe it's special-cased by the big crawlers https://github.com/ggerganov/hnreplies#hnreplies
If you rarely make updates to your site, Google crawls it infrequently and new things won't show up very quickly.
But if you do have frequent updates and lots of traffic, like any popular forum style site, you will get lots of crawler traffic. And I would bet the algo does the same for all endpoints on a site. So the "about us" page on a popular site probably ends up not being crawled nearly as much as the new threads page.
I also implemented to receive a telegram message with the debug trace in case of errors in my pipeline, so that I could have the entire scraping flow to analyze. That’s pretty neat.
I know it's a very open ended question.
I would speculate (based on anecdata) that a lot of the actual load placed upon sites is from the discovery phase -- what pages are there, and have any of them changed -- not so much "hit this one endpoint and unpack its data"
Do you use it? What for?
Another tip I’ve found is to check if the data is accessible on a mobile app and proxy it to see if there is a JSON API available.
XPATH is generally more powerful for really gnarly things and for backtracking. "Show me the 3rd paragraph that's a sibling of the fourth div id="subhed" and contains the text "starting".
> XPATH is generally more powerful...
Hell, even "find id=subhead and _go up one element_" isn't possible in CSS because that's not a problem it was designed to solve
This is in the context of test automation of modern web apps with a virtual DOM. I'm sure things might be different in other areas.
Identifying scrapers is actually really easy but it's not a binary decision. Anti scraping systems usually keep score that is compiled of few measurements so just applying some commonly known patches can improve your trust score significantly!
We recently published a blog series on all things that can be done to avoid blocking  request headers, proxies, TLS fingerprint, JS fingerprint etc but it's quite a bit of work to get there if you're new to web scraping - there's just so much information to get through and it's growing every day.
1 - https://scrapfly.io/blog/how-to-scrape-without-getting-block...
Largely this taught me to be creative with my data sources. For example, I built a virtual weather vane powered by a Raspberry Pi that would scrape my local airport's website to get wind direction data, then turn the vane via a servo to the correct direction. So my takeaway from this project was scraping isn't as straight forward as one would thing, there's more of an art to it in order to figuring out where to get the information you want.
Am I correct that the examples listed here are (a) www.google.com, (b) mail.google.com and (c) www.google.com/finance/. I have no trouble extracting data from these examples.[FN1] I do not use a graphical web browser to make HTTP requests nor do I use Python or BeatifulSoup. A cookie is required for mail.google.com, in lieu of a password, but the cookie can be saved and will work for years.
2. Google makes it impossible to sort results by URL, date, or even number of keyword/string hits in the page. Results are ordered according to secret algorithm, designed for advertising.