Add perplexity as metric
Perplexity is a metric used to evaluate how well a model predicts the next word in a sequence. It can be used as a substitute for accuracy in tasks other than text categorization.
Perplexity is calculated from the probabilities of the input tokens: https://www.comet.com/site/blog/perplexity-for-llm-evaluation/
Most frameworks return the probabilities only for output tokens (e.g. dspy, openai, litellm).
When using vllm, the probabilities of input tokens can be returned:
- echo parameter: https://github.com/vllm-project/vllm/issues/6508#issuecomment-2240051396
- prompt_logprobs: https://github.com/vllm-project/vllm/pull/7453