In the README for monolith (a new Rust CLI tool for archiving HTML pages along with their images and assets) I spotted this tip for using Chrome in headless mode to execute JavaScript and output the resulting DOM:
chromium --headless --incognito --dump-dom https://github.com \
| monolith - -I -b https://github.com -o github.html
I didn't know about that --headless
option, so I had a poke around to see if it works on macOS. And it does!
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--headless --dump-dom \
https://github.com
That spits out the rendered DOM from the GitHub home page. The --incognito
flag doesn't seem to be necessary - it didn't use my existing cookies when I ran it without.
Add > /tmp/github.html
to write that output to a file.
I found more documentation in Getting Started with Headless Chrome, a blog entry published when they released the feature in 2017.
Here's how to take a screenshot:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--headless \
--screenshot=/tmp/shot1.png \
https://simonwillison.net
And here's a screenshot with a custom width and height:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--headless \
--window-size=375,667 \
--screenshot=/tmp/shot2.png \
https://simonwillison.net
For a multi-page PDF of the full length page:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--headless \
--print-to-pdf=/tmp/page.pdf \
https://simonwillison.net
Here's the output PDF for that.
The documentation mentioned this option as something that would start a REPL prompt for interacting with the page using JavaScript:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--headless \
--repl \
https://simonwillison.net
This didn't work for me. Maybe they removed that feature?
@dayson pointed me to Chrome’s Headless mode gets an upgrade: introducing --headless=new providing further documentation of the most recent changes.
Frustratingly that documentation doesn't include a clear date, but it mentions features that are new in Chrome 112 which I found was released in March 2023.
I don't know if you still need to run --headless=new
for these new features. The two that caught my eye were --timeout 1000
for capturing the page after the specified number of milliseconds, and the intriguing alternative --virtual-time-budget=5000
which fakes the internal clock in Chrome to behave as if five seconds have passed and take the screenshot then, while not actually waiting those five seconds.
That page also mentions --print-to-pdf --no-pdf-header-footer
. I found out elsewhere about --hide-scrollbars
for screenshots from @DJDellsperger.
I didn't know about this --headless
mode when I built my shot-scraper tool for headless screenshotting and scraping of web pages using Playwright, which drives Chromium (and other browsers) under the hood.
shot-scraper
is a lot more ergonomic and has a lot more features, but it's also quite a bit slower if you just want to take a single screenshot.
The shot-scraper
equivalent of the above commands would be:
# Full-page screenshot
shot-scraper 'https://simonwillison.net' -o /tmp/shot3.png
# Custom size screenshot
shot-scraper 'https://simonwillison.net' -o /tmp/shot4.png --width 375 --height 667
# HTML snapshot
shot-scraper html 'https://simonwillison.net'
# PDF
shot-scraper pdf 'https://simonwillison.net' -o /tmp/page2.pdf
The more exciting features of shot-scraper
are its ability to take multiple screenshots defined in a YAML file:
echo '- output: example.com.png
url: http://www.example.com/
- output: w3c.org.png
url: https://www.w3.org/' | shot-scraper multi -
And its ability to scrape data from a page by executing JavaScript and returning the result as JSON:
shot-scraper javascript https://til.simonwillison.net/chrome/headless "
async () => {
const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
return (new readability.Readability(document)).parse();
}"
Created 2024-03-24T16:27:03-07:00, updated 2024-03-24T17:06:20-07:00 · History · Edit