Add perplexity as metric

Perplexity is a metric used to evaluate how well a model predicts the next word in a sequence. It can be used as a substitute for accuracy in tasks other than text categorization.

Perplexity is calculated from the probabilities of the input tokens: https://www.comet.com/site/blog/perplexity-for-llm-evaluation/

Most frameworks return the probabilities only for output tokens (e.g. dspy, openai, litellm).

When using vllm, the probabilities of input tokens can be returned: