Most LLMs come with various configuration options that control the LLM’s output. Effective Prompt Engineering requires setting these configurations optimally for your task.

Output length

An important configuration setting is the number of tokens to generate in a response. Generating more tokens requires more computation from the LLM, leading to higher energy consumption, potentially slower response times, and higher costs.

Reducing the output length of the LLM doesn’t cause the LLM to become more stylistically or textually succinct in the output it creates, it just causes the LLM to stop predicting more tokens once the limit is reached. If your needs require a short output length, you’ll also possibly need to engineer your prompt to accommodate.

Tip

Output length restriction is especially important for some LLM prompting techniques, like ReAct, where the LLM will keep emitting useless tokens after the response you want.

Sampling controls

LLMs do not formally predict a single token. Rather, LLMs predict probabilities for what the next token could be, with each token in the LLM’s vocabulary getting a probability. Those token probabilities are then sampled to determine what the next produced token will be. Temperature, top-K, and top-P are the most common configuration settings that determine how predicted token probabilities are processed to choose a single output token.

Temperature

Temperature controls the degree of randomness in token selection. Lower temperatures are good for prompts that expect a more deterministic response, while higher temperatures can lead to more diverse or unexpected results. A temperature of 0 (greedy decoding) is deterministic: the highest probability token is always selected , while higher temperature introduce randomness

Ties

If two tokens have the same highest predicted probability, depending on how tiebreaking is implemented you may not always get the same output with temperature 0).

The Gemini temperature control can be understood in a similar way to the softmax function used in machine learning:

  • a low temperature setting mirrors a low softmax temperature (T), emphasizing a single, preferred temperature with high certainty
  • an higher Gemini temperature setting is like a high softmax temperature, making a wider range of temperatures around the selected setting more acceptable.

This increased uncertainty accommodates scenarios where a rigid, precise temperature may not be essential like for example when experimenting with creative outputs.

Nucleus sampling (Top-K and top-P)

Top-K and top-P are two sampling settings used in LLMs to restrict the predicted next token to come from tokens with the top predicted probabilities.

Like temperature, these sampling settings control the randomness and diversity of generated text: • Top-K sampling selects the top K most likely tokens from the model’s predicted distribution. The higher top-K, the more creative and varied the model’s output; the lower top-K, the more restive and factual the model’s output. A top-K of 1 is equivalent to greedy decoding. • Top-P sampling selects the top tokens whose cumulative probability does not exceed a certain value (P). Values for P range from 0 (greedy decoding) to 1 (all tokens in the LLM’s vocabulary). The best way to choose between top-K and top-P is to experiment with both methods (or both together) and see which one produces the results you are looking for.

Extreme setttings

At extreme settings of one sampling configuration value, that one sampling setting either cancels out other configuration settings or becomes irrelevant. • If you set temperature to 0, top-K and top-P become irrelevant–the most probable token becomes the next token predicted. If you set temperature extremely high (above 1–generally into the 10s), temperature becomes irrelevant and whatever tokens make it through the top-K and/or top-P criteria are then randomly sampled to choose a next predicted token. • If you set top-K to 1, temperature and top-P become irrelevant. Only one token passes the top-K criteria, and that token is the next predicted token. If you set top-K extremely high, like to the size of the LLM’s vocabulary, any token with a nonzero probability of being the next token will meet the top-K criteria and none are selected out. • If you set top-P to 0 (or a very small value), most LLM sampling implementations will then only consider the most probable token to meet the top-P criteria, making temperature and top-K irrelevant. If you set top-P to 1, any token with a nonzero probability of being the next token will meet the top-P criteria, and none are selected out.

Typical values

The following are typical values used in LLM:

  • A a temperature of .2, top-P of .95, and top-K of 30 will give you relatively coherent results that can be creative but not excessively so.
  • If you want especially creative results, try starting with a temperature of .9, top-P of .99, and top-K of 40.
  • if you want less creative results, try starting with a temperature of .1, top-P of .9, and top-K of 20.
  • if your task always has a single correct answer (e.g., answering a math problem), start with a temperature of 0.