Add perplexity as metric

Perplexity is a metric used to evaluate how well a model predicts the next word in a sequence. It can be used as a substitute for accuracy in tasks other than text categorization.

Perplexity is calculated from the probabilities of the input tokens: https://www.comet.com/site/blog/perplexity-for-llm-evaluation/

Most frameworks return the probabilities only for output tokens (e.g. dspy, openai, litellm).

When using vllm, the probabilities of input tokens can be returned:

echo parameter: https://github.com/vllm-project/vllm/issues/6508#issuecomment-2240051396
prompt_logprobs: https://github.com/vllm-project/vllm/pull/7453