Running prompts against images, PDFs, audio and video with Google Gemini

I'm still working towards adding multi-modal support to my LLM tool. In the meantime, here are notes on running prompts against images and PDFs and audio and video files from the command-line using the Google Gemini family of models.

Update: I integrated the research from this TIL into my LLM tool, which can now run multi-modal prompts against Gemini like this:
llm -m gemini-1.5-flash "describe this image" -a image.jpg
See You can now run prompts against images, audio and video in your terminal using LLM for details.

Using curl

Here's the initial recipe I figured out using curl.

The Gemini models take a JSON document sent via POST that looks like this:

{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "Extract text from this image"
        },
        {
          "inlineData": {
            "data": "... base 64 encoded image data ...",
            "mimeType": "image/png"
          }
        }
      ]
    }
  ]
}

So the first challenge is to construct that document, including the base64 encoded image.

On macOS you can encode a file using base64 -i image.png. On other platforms you may not need the -i option.

So we can create the JSON document like this:

cat <<EOF > input.json
{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "Extract text from this image"
        },
        {
          "inlineData": {
            "data": "$(base64 -i image.png)",
            "mimeType": "image/png"
          }
        }
      ]
    }
  ]
}
EOF

This creates a input.json file containing the base64 encoded image, ready to be sent to the Gemini API.

Now we can send it using curl:

export GOOGLE_API_KEY='... your key here ...'

curl -s "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-8b-latest:generateContent?key=$GOOGLE_API_KEY" \
  -H 'Content-Type: application/json' \
  -X POST \
  -d @input.json

The model name goes in the URL - here I'm using gemini-1.5-flash-8b-latest, Google's cheapest and fastest model.

Model values you can use are:

gemini-1.5-flash-8b-latest - the cheapest and fastest model, $0.04/million input tokens, 0.001 cents per image
gemini-1.5-flash-latest - the one in the middle, $0.07/million input tokens, 0.0019 cents per image
gemini-1.5-pro-latest - the most powerful model, $1.25/million input tokens, 0.0323 cents per image

It's hard to overestimate how cheap these models are. An input image is charged at 258 tokens. That means the price per image processed is measured in fraction of a cent - those numbers above really are correct, an image even through Gemini Pro will cost less than 1/30th of a cent, and the other two models are even cheaper.

You get charged for output tokens too, which vary depending on the length of the response. Use my LLM pricing calculator to explore those.

The output of a prompt includes a usage section that shows you exactly how many tokens you spent. Here's example output for the prompt "extract text from this image" against this image:

{
  "candidates": [
    {
      "content": {
        "parts": [
          {
            "text": "Example handwriting\nLet's try this out"
          }
        ],
        "role": "model"
      },
      "finishReason": "STOP",
      "safetyRatings": [
        {
          "category": "HARM_CATEGORY_HATE_SPEECH",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_HARASSMENT",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
          "probability": "NEGLIGIBLE"
        }
      ],
      "avgLogprobs": -0.000025986179631824296
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 264,
    "candidatesTokenCount": 9,
    "totalTokenCount": 273
  },
  "modelVersion": "gemini-1.5-flash-8b-001"
}

Total cost: 0.0011 cents.

Using a Bash script

I got Claude to write me a script to automate this process. Here's how you can use it:

export GOOGLE_API_KEY='... your key here ...'

prompt-gemini 'extract text from this image' example-handwriting.jpg

It accepts PNG, JPG, GIF or PDF files, automatically sending the correct mimeType to the API. Note that PDFs with multiple pages are charged differently - I tried a 19 page PDF and it cost 12842 tokens, suggesting around 675 tokens per page.

You can also add a -m option to specify a different model:

prompt-gemini 'extract text from this image' example-handwriting.jpg -m pro

Shortcuts pro, flash and 8b are supported - it defaults to the cheapest 8b model.

By default it outputs the full JSON response, so you can see things like the "usageMetadata" block. To output just the raw returned text add -r:

prompt-gemini 'extract text from this image' example-handwriting.jpg -r

Example handwriting
Let's try this out

Here's the script - save it somewhere on your path and run chmod 755 prompt-gemini to make it executable:

#!/bin/bash

# Check if GOOGLE_API_KEY is set
if [ -z "$GOOGLE_API_KEY" ]; then
    echo "Error: GOOGLE_API_KEY environment variable is not set" >&2
    exit 1
fi

# Default model and options
model="8b"
prompt=""
image_file=""
jq_filter="."

# Parse arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        -m)
            model="$2"
            shift 2
            ;;
        -r)
            jq_filter=".candidates[0].content.parts[0].text"
            shift
            ;;
        *)
            if [ -z "$prompt" ]; then
                prompt="$1"
            elif [ -z "$image_file" ]; then
                image_file="$1"
            fi
            shift
            ;;
    esac
done

# Validate prompt
if [ -z "$prompt" ]; then
    echo "Error: No prompt provided" >&2
    echo "Usage: prompt-gemini \"prompt\" [image_file] [-m model] [-r]" >&2
    exit 1
fi

# Map model names to full model strings
case $model in
    "8b"|"flash-8b")
        model_string="gemini-1.5-flash-8b-latest"
        ;;
    "flash")
        model_string="gemini-1.5-flash-latest"
        ;;
    "pro")
        model_string="gemini-1.5-pro-latest"
        ;;
    *)
        model_string="gemini-1.5-$model"
        ;;
esac

# Create temporary file
temp_file=$(mktemp)
trap 'rm -f "$temp_file"' EXIT

# Determine mime type if image file is provided
if [ -n "$image_file" ]; then
    if [ ! -f "$image_file" ]; then
        echo "Error: File '$image_file' not found" >&2
        exit 1
    fi

    # Get file extension and convert to lowercase
    ext=$(echo "${image_file##*.}" | tr '[:upper:]' '[:lower:]')
    
    case $ext in
        png)
            mime_type="image/png"
            ;;
        jpg|jpeg)
            mime_type="image/jpeg"
            ;;
        gif)
            mime_type="image/gif"
            ;;
        pdf)
            mime_type="application/pdf"
            ;;
        mp3)
            mime_type="audio/mp3"
            ;;
        mp4)
            mime_type="video/mp4"
            ;;
        *)
            echo "Error: Unsupported file type .$ext" >&2
            exit 1
            ;;
    esac

    # Create JSON with image data
    cat <<EOF > "$temp_file"
{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "$prompt"
        },
        {
          "inlineData": {
            "data": "$(base64 -i "$image_file")",
            "mimeType": "$mime_type"
          }
        }
      ]
    }
  ]
}
EOF
else
    # Create JSON without image data
    cat <<EOF > "$temp_file"
{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "$prompt"
        }
      ]
    }
  ]
}
EOF
fi

# Make API request with jq filter
curl -s "https://generativelanguage.googleapis.com/v1beta/models/$model_string:generateContent?key=$GOOGLE_API_KEY" \
    -H 'Content-Type: application/json' \
    -X POST \
    -d @"$temp_file" | jq "$jq_filter" -r

How I got Claude to write the Bash script

Here's the prompt I fed to Claude to create this, starting with the Bash + curl recipe I had already figured out:

cat <<EOF > input.json
{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "Extract text from this imaage"
        },
        {
          "inlineData": {
            "data": "$(base64 -i output_0.png)",
            "mimeType": "image/png"
          }
        }
      ]
    }
  ]
}
EOF

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-8b-latest:generateContent?key=$GOOGLE_API_KEY" \
  -H 'Content-Type: application/json' \
  -X POST \
  -d @input.json | jq
Turn this into a Bash script that runs like this:
prompt-gemini "this is the prompt"
prompt-gemini "This is the prompt" blah.png
prompt-gemini "This is the prompt" blah.pdf
prompt-gemini "this is the prompt" -m pro
It should exit with an error if GOOGLE_API_KEY is not set

It should use a temporary file for input.json which is deleted on completion

If no file was provided it should skip the inlineData bit

It should use the correct mimeType for PNG or PDF or JPG or JPEG or GIF depending on the file extension

The -m option should follow the following rules: it defaults to 8b, or it can be:

8b => gemini-1.5-flash-8b-latest (the default) flash-8b => gemini-1.5-flash-8b-latest flash => gemini-1.5-flash-latest pro => gemini-1.5-pro-latest

Any other value should be passed used directly in the gemini-1.5-flash:generateContent portion of the URL

Here's the full Claude transcript.

Then I added the -r option by pasting in the previous script and prompting:

Modify this script to add an extra -r option which, if present, causes the final line to pipe through jq like this:
... | jq '.candidates[0].content.parts[0].text' -r

Claude transcript here.

Created 2024-10-23T10:49:47-07:00, updated 2024-10-31T07:58:31-07:00 · History · Edit