class: center, middle # Web Scraping Patterns (and Anti-Patterns) ### Greg Gorlen
Credit:
Wikipedia
#### [Browser Conference](https://www.browserconference.com/talks/web-scraping-patterns-and-antipatterns/), 20 June 2024, 3 PM PDT [GitHub source](https://www.github.com/ggorlen/web-scraping-patterns-talk) --- # Agenda 1. What's this talk about? 1. whoami 1. Defining terms 1. The patterns (and antipatterns) 1. Recap (skip here if you need a 1 slide TL;DR) 1. Questions? --- # What's this talk about? Patterns and antipatterns\* in - Playwright (Node.js) - Puppeteer - Web scraping - Testing - Browser automation Some familiarity with this tech will be assumed. Goal: deeper understanding will make scripts more reliable, faster to write and execute, easier to maintain. Feel free to ask questions throughout, but there's a lot to cover, so I might skim certain parts or loop back for deeper dives as time permits.
\* In the tradition of [this blog post](https://kentcdodds.com/blog/common-mistakes-with-react-testing-library), for example (or the Drake meme). Terms will be defined momentarily.
\** This talk is for educational purposes. Scrape responsibly. --- # `$ whoami` - Full stack software engineer based in San Francisco, CA - currently building [Qualified.io](https://www.qualified.io/) at [Andela](https://www.andela.com/)\* - we also do [Codewars](https://www.codewars.com) - Generalist - freelancing - teaching - writing - experimenting - Avid Q&A participant - [Stack Overflow](https://stackoverflow.com/users/6243352/ggorlen) - [Codementor](https://www.codementor.io/@ggorlen) - ... [Codidact](https://software.codidact.com/users/56855)? - Fan of - quality assurance/testing - web scraping - userscripts - ...
\* Disclaimer: my views are my own
--- # Defining Terms: Patterns ✅ Patterns are common idioms, ways of doing things in a language or framework that are usually good. --- # Defining Terms: Antipatterns ❌ Antipatterns are common idioms, ways of doing things in a language or framework that are usually suboptimal. Not exactly _wrong_, per se. --- class: center, middle # On to the patterns ... --- ## ❌ Sleeping - Never the _right_ amount of time - Sometimes doesn't wait long enough - Sometimes waits too long - "Flaky" (testing lingo) - [Discouraged](https://playwright.dev/docs/api/class-page#page-wait-for-timeout) in Playwright - [Removed](https://stackoverflow.com/a/77078430/6243352) in Puppeteer ## ✅ Using a precise predicate - locators - `page.waitForSelector()` - `page.waitForFunction()` - `page.waitForResponse()` - `page.waitForURL()` - ...
[Read more](https://stackoverflow.com/a/73676564/6243352)
--- ## ❌ Not using `"domcontentloaded"` - `"load"` - Waits for initial set of resources - Can get stuck (see [this weird case](https://stackoverflow.com/questions/78490582/why-is-playwright-screenshot-timing-out-with-waiting-for-fonts-to-load/78490782?noredirect=1#comment138481703_78490782)) - Default predicate (poor design?) - `"networkidle"` - Waits for even more resources, then sleeps for 500ms - Can get stuck - Slow, pessimistic, too general, flaky - OK-ish for PDF generation and screenshots. ## ✅ Using `"domcontentloaded"` - Only waits for the HTML file - Doesn't get stuck - Quickest navigation resolver in Puppeteer - `"commit"` is even faster in Playwright!
[Read more](https://serpapi.com/blog/puppeteer-antipatterns/#never-using-domcontentloaded)
--- ## ❌ Using `{timeout: 0}` on navs - Can hang the script forever - No stack trace - Hard to debug - Sign of desperation ## ✅ Using `{timeout: 120_000}` - Throw, catch, log and retry operation - Or log the error for the maintainer to fix and crash - Caveat: some Pupp/PW stack traces are hard to understand and debug (and the error names change every few versions just because). Next talk...?
[Read more](https://blog.appsignal.com/2023/02/08/puppeteer-in-nodejs-common-mistakes-to-avoid.html#using-infinite-timeouts)
--- ## ❌ Not blocking resources (CSS, fonts, GA, etc) - Don't need 'em if you're scraping text - Slow - Can get stuck - Not nice to the server ## ✅ Blocking resources - Speeds up the code! - If the page is static, disable JS! - A bit more code to write - Less useful in testing
[Read more](https://serpapi.com/blog/puppeteer-antipatterns/#not-blocking-images-and-resources)
--- ### ❌ Element Handles ```js const titleHandles = await page.$$(".title"); const textContents = []; for (const el of titleHandles) { textContents.push(await el.evaluate(el => el.textContent)); } ``` ### ✅ `$$eval` (in Puppeteer) ```js const textContents = await page.$$eval( ".title", elements => elements.map(el => el.textContent) ); ``` ### ✅ `locators` (in Playwright) ```js const textContents = await page.locator(".title").allTextContents(); ```
[Read more](https://stackoverflow.com/a/77597068/6243352)
--- ## ❌ Scraping the DOM - Liable to change - Timing issues - Visibility issues - Other selection issues (Tailwind? Shadow DOM? IFrames?) - Data is often formatted weirdly - ...All right, sometimes you have to ## ✅ Intercepting network requests or scraping JSON - More likely to be stable - Faster code execution - Gets you raw, unformatted data - ...But not always feasible
[Read more](https://stackoverflow.com/a/76023261/6243352)
--- #### ❌ Using browser automation on static sites - Overkill - ...Sometimes necessary if you're detected #### ✅ Using `fetch` + Cheerio on static sites ```js fetch(url) .then(res => res.text()) .then(html => { const $ = cheerio.load(html); const data = [...$("table")].map(e => [...$(e).find("tr:has(td)")].map(e => [...$(e).find("td")].map(e => $(e).text().trim()) ) ); console.table(data); }) .catch(err => console.error(err)); ``` - Lightweight, fast, easier (synchronous) coding, fewer deps - Potentially more reliable - Gets you raw, unformatted data - ...But not always feasible
(I have [dozens](https://stackoverflow.com/a/78376148/6243352) of examples of speedups replacing Puppeteer with Cheerio)
--- #### ❌ Using trusted clicks when scraping ```js const el = await page.waitForSelector(".foo"); await el.click(); ``` - Can get stuck due to visibility checks - Button is under cookie banner? Timeout! - Makes you want to use evil Element Handles in Puppeteer - Good to use in testing, though (we want to act like a user) - Caveat: you've already tested the banner dismissal (or whatever) and want to skip testing it again #### ✅ Using untrusted clicks when scraping ```js const el = await page.waitForSelector(".foo"); await el.evaluate(el => el.click()); ``` - General principle: No obligation to act like a user when scraping - Way more reliable - No visibility/actionability checks - Button is under cookie banner? No prob. - Avoid in testing (we want to act like a user)
[Read more](https://stackoverflow.com/a/78376148/6243352)
--- ### ❌ Clicking to navigate when scraping ```js const navP = page.waitForNavigation({waitUntil: "domcontentloaded"}); await page.click(".btn-that-goes-to-next-page"); await navP; // should be waiting for a selector on the next page anyway... ``` - Avoid the DOM--it's a common source of failure - Avoid element handles - Also avoid `page.back()` ### ✅ Using `page.goto()` when scraping ```js await page.goto("URL for the next page", {waitUntil: "domcontentloaded"}); ``` - General principle: No obligation to act like a user when scraping - General principle: Avoid the DOM when scraping - Way more reliable - No visibility/actionability checks - Can navigate directly to iframe source - Avoid in testing (we want to act like a user)
[Read more](https://blog.appsignal.com/2023/06/14/puppeteer-in-nodejs-more-antipatterns-to-avoid.html#adding-premature-abstractions#underusing-pagegoto)
--- ## ❌ Dismissing cookie banners with `.click()` ```js const btn = await page.waitForSelector(".banner .close-btn"); await btn.click(); // can fail ``` - Timing problems, nondeterminism and flakiness ## ✅ Rip out cookie banners with `.remove()` ```js const el = await page.waitForSelector(".banner"); await el.evaluate(el => el.remove()); ``` - General principle: No obligation to act like a user when scraping - General principle: Avoid the DOM when scraping - Just blast away the cookie banner - No visibility/actionability checks - Avoid annoying animations - Feel free to rip out other annoying parts of the page
[Read more](https://stackoverflow.com/a/78414695/6243352)
--- ### ❌ Using `page.on` to get one response - Easy to [leak listeners](https://stackoverflow.com/a/74660359/6243352) - Callback-based - Useful for monitoring all requests ### ✅ Using `.waitForResponse`/`.waitForRequest` - Promise based, flatter is better, control flow stays in a chain - General principle: nesting bad --- ## ❌ Using `.filter` (Playwright-specific) ```js // from https://stackoverflow.com/a/73216189/6243352 page .locator(`[aria-controls=${table_id}]`) .filter({ has: page.locator(`text="${page_number}"`) }); ``` ## ✅ Chaining locators ```js // from https://stackoverflow.com/a/75553716/6243352 page .locator(`[aria-controls="${tableId}"]`) .getByText(pageNumber, {exact: true}); ``` General principle: nesting bad. --- #### ❌ Using `.locator` - Not specific enough - Makes you want to use CSS or XPath (not user-visible) #### ❌ Using `.getByText` - OK-ish--at least it's user visible - `getByRole` is better because it also specifies the role #### ❌ Using `.getByTestId` - OK-ish--at least it's specific - Pollutes the code with `data-testid="..."` attrs everywhere - Not user-visible #### ✅ Using `.getByRole` - Specify both text/placeholder/value and element with unified syntax - User-visible (This has to do more with testing--use whatever seems most reliable for a particular site if you're scraping... but you're probably using Puppeteer for scraping anyway) --- ## ❌ Using XPath - Nasty syntax, usually unnecessary, hard to maintain - No longer necessary for text content scraping (use `::-p-text()` in Pupp, `textContent()` in PW) ## ✅ Using `::-p-`/CSS selectors (in Puppeteer) - Fine for scraping, if chosen carefully (see next slide) - Pupp locators are still experimental at the moment ## ✅ Using `getByRole` (in Playwright) - Great for testing --- #### ❌ Using long path or CSS chains - Really brittle: ```js "#answer-60796572 > div > div.answercell > div.s-prose.js-post-body > pre" ``` - Also avoid n-th/first/last #### ✅ Using short CSS selectors (in scraping/Pupp) - Better (assuming it's enough to disambiguate the element): ```js "#answer-60796572 .js-post-body pre" ``` - Difficult to choose properly. Find the right balance: - Precise enough that it's unlikely to match the wrong thing (_false positive_) - Loose enough that it doesn't assume a rigid DOM structure that's liable to break (_false negative_) #### ✅ Using `getByRole` (in testing/PW) - ...or one of the many other Playwright location best practices --- #### ❌ Using CSS attribute selectors unnecessarily - Really brittle: ```js '[class="foo"]' ``` - This is too precise (prone to false negatives) - This won't match `
` - Occasionally useful, like matching ```html
``` with `'[class^=foo-]'`. #### ✅ Using CSS `.` syntax - Better: ```js ".foo" ``` - This will match ```html
``` --- ## ❌ Not taking advantage of the library API - Still selecting text with XPath (already covered) - Traversing shadow DOMs by hand - Using `page.$` and family (element handles) ## ✅ Using new features (Puppeteer) - `::-p-text(foobar)` to select text (`text/foobar` was cool but now is "legacy") - `>>>` to traverse the shadow DOM - Maybe non-experimental locators someday? In the meantime, `page.waitForSelector()` with `page.$$eval()` handles 90% of the work. ## ✅ Using new features (Playwright) - Heed the warnings in the docs! --- ## ❌ Mixing `await`/`then` - Not just Pupp/PW but since these libs are heavily async, tends to show up here - [Hall of Shame](https://stackoverflow.com/a/75785234/6243352) of collected problems this has caused, too many to list here ## ❌ Using `.then` instead of `await` - Can be fine outside of Playwright or Puppeteer, but not here ## ✅ Use `await` always! - General principle: nesting bad. --- ### ❌ Overusing `networkidle` - Already mentioned, slow, unreliable, pessimistic and has a sleep ### ✅ Use specific predicates - `waitForResponse`, `waitForSelector`, `waitForFunction`... precise! - See [this example](https://stackoverflow.com/questions/57364320/puppeteers-setcontent-function-not-making-network-requests-for-static-files/78594844#78594844) that gives a 2x speedup for screenshots/PDF gen ```js const loaded = Promise.all([ page.waitForResponse(e => e.url() === "https://unpkg.com/hack@0.8.1/dist/dark-grey.css" ), page.waitForResponse(e => e.url() === "https://picsum.photos/100/100"), page.waitForFunction("!!window.axios"), page.waitForSelector("p::-p-text(Leanne Graham)"), ]); await page.goto(url, {waitUntil: "domcontentloaded"}); await loaded; await page.evaluate(` Promise.all([...document.querySelectorAll("img")].map(e => e.decode())) `); ``` - Downsides: verbose, a bit more work, not as general --- ## ❌ Using overly-general methods ```js await page.evaluate(() => { const elements = document.querySelectorAll(".foo-bar"); return [...elements].map((el) => el.textContent.trim()); }); ``` ## ✅ Using the right tool for the job ```js // Puppeteer await page.$$eval(".foo-bar", (elements) => elements.map((el) => el.textContent.trim()) ); ``` ```js // Playwright await page.locator(".foo-bar").allTextContents(); ``` Might seem trivial, but this applies to so many parts of the API and magnifies.
[Read more](https://blog.appsignal.com/2023/06/14/puppeteer-in-nodejs-more-antipatterns-to-avoid.html#not-using-specific-wait-or-evaluate-methods)
--- ## ❌ Not cleaning up the browser ```js const browser = await puppeteer.launch(); // ... do the scrape await browser.close(); ``` Easy to write, but if an error is thrown, the process won't exit. ## ✅ Cleaning up the browser ```js let browser; try { browser = await puppeteer.launch(); // ... do the scrape } catch (err) { console.error(err); } finally { await browser?.close(); } ``` Verbose but correct.
[Read more](https://blog.appsignal.com/2023/06/14/puppeteer-in-nodejs-more-antipatterns-to-avoid.html#not-cleaning-up-browser-and-page-handles-with-finally)
--- ## ❌ Not using `userDataDir` - Keep having to log in again and again, possibly with 2FA - Not nice to the server ## ✅ Using `userDataDir` - Login (by hand, even), maybe with 2FA and save state - Less code, great for whipping up a scrape quickly --- ## ❌ Trying to rewrite working browser code in Puppeteer when scraping - You usually start in the browser console - Wind up in element handle/async hell trying to rewrite - Code stops being portable (browser DOM can run anywhere and will always be relevant to be able to write) ## ✅ Using Puppeteer as a lightweight wrapper on browser DOM manipulation - If it already works, use it (plop into an `evaluate`) - Throw in waits where necessary - Scraping only! - Caveat: sometimes you need trusted events or better application design than a dump of browser code. - incrementally take advantage of Puppeteer idioms where appropriate - and/or organize the browser code better --- ## ❌ Writing premature abstractions - Applies to both scraping and testing - Often too clever - Tends to make correctness harder to establish--too many problems at once ## ✅ Abstracting only when motivation arrives - Generally good advice - General principle: nesting bad - A bit of duplication can be OK - Abstractions have a cost - Find the "right", clearly-motivated abstraction - [AHA! Testing](https://kentcdodds.com/blog/aha-testing) (Kent Dodds) - [Avoid nesting when testing](https://kentcdodds.com/blog/avoid-nesting-when-youre-testing) (Kent again) - [Factoring your code](https://grugbrain.dev/#grug-on-factring-your-code) (Grug-Brained Dev) --- #### ❌ Assuming there's a silver bullet - Common question: How do I scrape any website? - Many variants like: - [When is the page completely loaded?](https://stackoverflow.com/questions/52497252/puppeteer-wait-until-page-is-completely-loaded/72535642#72535642) - [How to pick a selector?](https://stackoverflow.com/questions/73917113/convenient-way-to-get-input-for-puppeteer-page-click/73928661#73928661) - Answer: it's hard. Websites throw a different curveball every time. - If we could do this, libraries wouldn't bother with all the beating around the bush and would just give us a `scrapeAnyPagePerfectly(url)` function. - LLMs haven't solved the problem. #### ✅ Writing site-specific code - Sorry, generally there's no choice. Just gotta do it the hard way! - Playwright codegen helps a bit but also not a silver bullet. #### ✅ Generalizing based on use case - For example, using [Moz Readability](https://news.ycombinator.com/item?id=39504105) can scrape most text-based pages pretty well (useful for RAGs). - For tables there are one-shots, e.g. `pandas.read_html()`. --- # Recap **When scraping**... - No obligation to act like a user - No obligation to adhere to user-visible selectors or events - Avoid the DOM - "Cheating" is often good **When testing**... - Do the opposite of everything above. **In both cases**: - Auto waiting is good - Flatter is better - Avoid premature abstractions **Puppeteer**: generating PDFs, screenshots, scraping. Simpler API, fewer distractions, less need to be user-visible, less opinionated. **Playwright**: testing, assertions. More opinionated/adheres to best practices. --- # Thanks! - [Browser Conference](https://www.browserconference.com/) (Zach!) - [AppSignal](https://blog.appsignal.com/) (Ana!) - [SerpAPI](https://serpapi.com/blog/) (Julien!) - the Stack Overflow community - My colleagues at [Qualified.io](https://www.qualified.io/) and [Andela](https://www.andela.com) - My mentees at [Codementor](https://www.codementor.io/@ggorlen) - Contributors to open source technologies discussed or used here: - [Playwright](https://github.com/microsoft/playwright/) - [Puppeteer](https://github.com/puppeteer/puppeteer/) - [Cheerio](https://github.com/cheeriojs/cheerio) - [Remark](https://github.com/gnab/remark) (markdown slides) - [Kent Dodds](https://kentcdodds.com/) - [Grug-Brained Dev](https://grugbrain.dev/) - [My partner Emily](https://www.instagram.com/emilyhuston_photography/) - Last, but not least: YOU! --- class: center, middle # Questions?