Google Chrome --headless mode

In the README for monolith (a new Rust CLI tool for archiving HTML pages along with their images and assets) I spotted this tip for using Chrome in headless mode to execute JavaScript and output the resulting DOM:

chromium --headless --incognito --dump-dom https://github.com \
  | monolith - -I -b https://github.com -o github.html

I didn't know about that --headless option, so I had a poke around to see if it works on macOS. And it does!

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
  --headless --dump-dom \
  https://github.com

That spits out the rendered DOM from the GitHub home page. The --incognito flag doesn't seem to be necessary - it didn't use my existing cookies when I ran it without.

Add > /tmp/github.html to write that output to a file.

Screenshots and PDFs

I found more documentation in Getting Started with Headless Chrome, a blog entry published when they released the feature in 2017.

Here's how to take a screenshot:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
  --headless \
  --screenshot=/tmp/shot1.png \
  https://simonwillison.net

And here's a screenshot with a custom width and height:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
  --headless \
  --window-size=375,667 \
  --screenshot=/tmp/shot2.png \
  https://simonwillison.net

For a multi-page PDF of the full length page:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
  --headless \
  --print-to-pdf=/tmp/page.pdf \
  https://simonwillison.net

Here's the output PDF for that.

--repl doesn't work for me

The documentation mentioned this option as something that would start a REPL prompt for interacting with the page using JavaScript:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
  --headless \
  --repl \
  https://simonwillison.net

This didn't work for me. Maybe they removed that feature?

Comparison to shot-scraper

I didn't know about this --headless mode when I built my shot-scraper tool for headless screenshotting and scraping of web pages using Playwright, which drives Chromium (and other browsers) under the hood.

shot-scraper is a lot more ergonomic and has a lot more features, but it's also quite a bit slower if you just want to take a single screenshot.

The shot-scraper equivalent of the above commands would be:

# Full-page screenshot
shot-scraper 'https://simonwillison.net' -o /tmp/shot3.png

# Custom size screenshot
shot-scraper 'https://simonwillison.net' -o /tmp/shot4.png --width 375 --height 667

# HTML snapshot
shot-scraper html 'https://simonwillison.net'

# PDF
shot-scraper pdf 'https://simonwillison.net' -o /tmp/page2.pdf

The more exciting features of shot-scraper are its ability to take multiple screenshots defined in a YAML file:

echo '- output: example.com.png
  url: http://www.example.com/
- output: w3c.org.png
  url: https://www.w3.org/' | shot-scraper multi -

And its ability to scrape data from a page by executing JavaScript and returning the result as JSON:

shot-scraper javascript https://til.simonwillison.net/chrome/headless "
async () => {
  const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
  return (new readability.Readability(document)).parse();
}"

Created 2024-03-24T16:27:03-07:00, updated 2024-03-24T17:06:20-07:00 · History · Edit

Google Chrome --headless mode

Screenshots and PDFs

--repl doesn't work for me

More documentation

Comparison to shot-scraper

Related