Optimizing large-context models

May 27, 2024

As expected (see #4 in “Big Post About Big Context”), solutions are beginning to emerge aimed at making working with large contexts more economically viable.

Google has announced Context Caching (https://ai.google.dev/gemini-api/docs/caching) for Gemini. In this mode, part of the tokens can be cached (for an hourly fee) and used for repeated requests (for example, to a large book or a huge context of a previous conversation with a chatbot). This will be cheaper than sending them again as input tokens.

At the moment, this is relevant only for Gemini 1.5 Pro (which has a context size of 1M-2M, and potentially 10M and beyond), and the cost (https://ai.google.dev/pricing) for the cached tokens is reportedly about half the price compared to resending tokens. Additionally, the prices for tokens differ for prompts up to 128k and beyond. I suspect that at this rate, we’ll soon see entire pricing grids for LLMs :)

Summary of Google I/O 2024 - Gemini Pro, Veo, Imagen 3, Project ...

If you put 1M tokens into a prompt, it costs $7.00 per million tokens (for prompts longer than 128K), and if you cache them, it’s $3.50 per million tokens (for prompts longer than 128K) plus an additional $4.50 per million tokens per hour (for storage). It’s not a game changer in terms of cost yet, but it’s still some optimization.

For real use cases, it still seems expensive. If you fill the entire 1M tokens prompt with your content, your request will cost $7. With caching, the first request will cost this amount, the subsequent one will cost even more (because of storage, $3.50+$4.50=$8), but if you send many, you’ll save something (no more than half).

The cost for output tokens is $21.00 per million tokens (for prompts longer than 128K), but it's harder to use up a million, as the output context size is limited to 8k (https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models), so for summaries or answers to questions, the additional cost per question/summary will be small, at worst less than 20 cents (limited by these 8k output tokens).

What kind of requests are you willing to pay $7 for (or $14 if you fill the new Gemini 1.5 Pro with 2M tokens in a prompt), is a complex question. In my original post, I was guided by the prices of Gemini 1.0 Pro, as 1.5 was still in experimental mode and prices were not announced. It seems that the prices for Gemini 1.0 Pro have increased (I previously used the price of $0.125 per 1M tokens, now it's $0.5), and 1.5, compared to those estimates, is insanely expensive.

It’s interesting to see for what cases the economics will make sense here. This is a very high threshold not for the masses.

Gonzo ML

Discussion about this post