I figured out how to use a JSON API to run a very limited Google search today in a legit, non-screen-scraper way.
Google offer a product called Programmable Search Engine, which used to be called Google Custom Search.
It's intended for creating a search engine for your own site, by restricting results to specific domains - but when you create one you can opt to search the whole web instead.
You can then use their JSON API to run searches.
It's quite limited:
But it works! And it's pretty easy to get running.
First, create a new Programmable Search Engine from the dashboard. The create page is pretty straight-forward:
Now get an API key - I used the button in the middle of the API documentation:
You need the "Search engine ID" from the dashboard - mine was 84ec3c54dca9646ff
.
And that's it! You can combine the API key and search engine ID to run searches:
https://www.googleapis.com/customsearch/v1?key=API-KEY
&cx=84ec3c54dca9646ff
&q=SEARCH-TERM
It seems to support a lot of the same search filters as Google. I tried using this, URL-encoded, and seemed to get the results I wanted:
"powered by datasette" -site:github.com -site:simonwillison.net -site:datasette.io -site:pypi.org
The results come back as JSON that looks like this (truncated after the first result):
{
"kind": "customsearch#search",
"url": {
"type": "application/json",
"template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json"
},
"queries": {
"request": [
{
"title": "Google Custom Search - \"powered by datasette\" -site:github.com -site:simonwillison.net -site:datasette.io -site:pypi.org",
"totalResults": "65200",
"searchTerms": "\"powered by datasette\" -site:github.com -site:simonwillison.net -site:datasette.io -site:pypi.org",
"count": 10,
"startIndex": 1,
"inputEncoding": "utf8",
"outputEncoding": "utf8",
"safe": "off",
"cx": "84ec3c54dca9646ff"
}
],
"nextPage": [
{
"title": "Google Custom Search - \"powered by datasette\" -site:github.com -site:simonwillison.net -site:datasette.io -site:pypi.org",
"totalResults": "65200",
"searchTerms": "\"powered by datasette\" -site:github.com -site:simonwillison.net -site:datasette.io -site:pypi.org",
"count": 10,
"startIndex": 11,
"inputEncoding": "utf8",
"outputEncoding": "utf8",
"safe": "off",
"cx": "84ec3c54dca9646ff"
}
]
},
"context": {
"title": "The whole web"
},
"searchInformation": {
"searchTime": 0.25516,
"formattedSearchTime": "0.26",
"totalResults": "65200",
"formattedTotalResults": "65,200"
},
"items": [
{
"kind": "customsearch#result",
"title": "hhs",
"htmlTitle": "hhs",
"link": "https://hhscovid.publicaccountability.org/hhs",
"displayLink": "hhscovid.publicaccountability.org",
"snippet": "Powered by Datasette · Queries took 5.536ms · Data source: U.S. Department of Health & Human Services · Home · Name Search · Dataset Search · Browse Datasets.",
"htmlSnippet": "<b>Powered by Datasette</b> · Queries took 5.536ms · Data source: U.S. Department of Health & Human Services · Home · Name Search · Dataset Search · Browse Datasets.",
"cacheId": "QbpCTHbMliYJ",
"formattedUrl": "https://hhscovid.publicaccountability.org/hhs",
"htmlFormattedUrl": "https://hhscovid.publicaccountability.org/hhs",
"pagemap": {
"metatags": [
{
"viewport": "width=device-width, initial-scale=1, shrink-to-fit=no"
}
]
}
}
As a bonus, you can pipe results into a SQLite database using sqlite-utils like this:
curl 'https://www.googleapis.com/customsearch...' | \
jq .items | sqlite-utils insert /tmp/search.db search -
Created 2023-09-16T17:00:48-07:00 · Edit