In the README for monolith (a new Rust CLI tool for archiving HTML pages along with their images and assets) I spotted this tip for using Chrome in headless mode to execute JavaScript and output the resulting DOM:
chromium --headless --incognito --dump-dom https://github.com \
| monolith - -I -b https://github.com -o github.htmlI didn't know about that --headless option, so I had a poke around to see if it works on macOS. And it does!
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--headless --dump-dom \
https://github.comThat spits out the rendered DOM from the GitHub home page. The --incognito flag doesn't seem to be necessary - it didn't use my existing cookies when I ran it without.
Add > /tmp/github.html to write that output to a file.
I found more documentation in Getting Started with Headless Chrome, a blog entry published when they released the feature in 2017.
Here's how to take a screenshot:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--headless \
--screenshot=/tmp/shot1.png \
https://simonwillison.netAnd here's a screenshot with a custom width and height:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--headless \
--window-size=375,667 \
--screenshot=/tmp/shot2.png \
https://simonwillison.netFor a multi-page PDF of the full length page:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--headless \
--print-to-pdf=/tmp/page.pdf \
https://simonwillison.netHere's the output PDF for that.
The documentation mentioned this option as something that would start a REPL prompt for interacting with the page using JavaScript:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--headless \
--repl \
https://simonwillison.netThis didn't work for me. Maybe they removed that feature?
@dayson pointed me to Chrome’s Headless mode gets an upgrade: introducing --headless=new providing further documentation of the most recent changes.
Frustratingly that documentation doesn't include a clear date, but it mentions features that are new in Chrome 112 which I found was released in March 2023.
I don't know if you still need to run --headless=new for these new features. The two that caught my eye were --timeout 1000 for capturing the page after the specified number of milliseconds, and the intriguing alternative --virtual-time-budget=5000 which fakes the internal clock in Chrome to behave as if five seconds have passed and take the screenshot then, while not actually waiting those five seconds.
That page also mentions --print-to-pdf --no-pdf-header-footer. I found out elsewhere about --hide-scrollbars for screenshots from @DJDellsperger.
I didn't know about this --headless mode when I built my shot-scraper tool for headless screenshotting and scraping of web pages using Playwright, which drives Chromium (and other browsers) under the hood.
shot-scraper is a lot more ergonomic and has a lot more features, but it's also quite a bit slower if you just want to take a single screenshot.
The shot-scraper equivalent of the above commands would be:
# Full-page screenshot
shot-scraper 'https://simonwillison.net' -o /tmp/shot3.png
# Custom size screenshot
shot-scraper 'https://simonwillison.net' -o /tmp/shot4.png --width 375 --height 667
# HTML snapshot
shot-scraper html 'https://simonwillison.net'
# PDF
shot-scraper pdf 'https://simonwillison.net' -o /tmp/page2.pdfThe more exciting features of shot-scraper are its ability to take multiple screenshots defined in a YAML file:
echo '- output: example.com.png
url: http://www.example.com/
- output: w3c.org.png
url: https://www.w3.org/' | shot-scraper multi -And its ability to scrape data from a page by executing JavaScript and returning the result as JSON:
shot-scraper javascript https://til.simonwillison.net/chrome/headless "
async () => {
const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
return (new readability.Readability(document)).parse();
}"Created 2024-03-24T16:27:03-07:00, updated 2024-03-24T17:06:20-07:00 · History · Edit