Schema

pydantic model retrieval_qa_benchmark.schema.BaseDataset

Dataset’s Baseclass Dataset should always output QARecord with __getitem__ method

Fields:
  • eval_set (List[retrieval_qa_benchmark.schema.datatypes.QARecord])

  • name (str)

field eval_set: List[QARecord] = []

Data to be evaluated. The data is transformed with its built-in transform.

field name: str = 'dataset'

Name of this dataset

class Config
extra = 'forbid'

No extra field allowed

classmethod build(*args: Any, **kwargs: Any) BaseDataset

build dataset

Raises:

NotImplementedError – user should implement this

Returns:

dataset that iterate over List[QARecord]

Return type:

BaseDataset

iterator() Any
pydantic model retrieval_qa_benchmark.schema.BaseEvaluator

Base class for evaluators

Fields:
  • dataset (retrieval_qa_benchmark.schema.dataset.BaseDataset)

  • llm (retrieval_qa_benchmark.schema.model.BaseLLM)

  • matcher (Callable[[str, retrieval_qa_benchmark.schema.datatypes.QARecord], float])

  • out_file (str | None)

  • transform (retrieval_qa_benchmark.schema.transform.TransformGraph)

field dataset: BaseDataset [Required]
field llm: BaseLLM [Required]
field matcher: Callable[[str, QARecord], float] = <function default_matcher>
field out_file: str | None = None
field transform: TransformGraph [Required]
class Config
pydantic model retrieval_qa_benchmark.schema.BaseLLM
Fields:
  • context_template (str)

  • name (str)

  • record_template (str)

  • run_args (Dict[str, Any])

field context_template: str = 'Context:\n{context}\n\n'

template to inject contexts

field name: str [Required]

name of the model, like gpt-3.5-turbo or llama2-13b-chat

field record_template: str = 'The following are multiple choice questions (with answers) with context:\n\n{context}Question: {question}\n{choices}Answer: '

template to convert QARecord into string

field run_args: Dict[str, Any] = {}

Runtime keyword arguments

classmethod build(**kwargs: Any) BaseLLM
convert_record(data: QARecord) str
generate(text: str) BaseLLMOutput
property tokenizer_type: str
pydantic model retrieval_qa_benchmark.schema.BaseLLMOutput
Fields:
  • completion_tokens (int)

  • generated (str)

  • prompt_tokens (int)

field completion_tokens: int [Required]
field generated: str [Required]
field prompt_tokens: int [Required]
pydantic model retrieval_qa_benchmark.schema.BaseTransform

Base transform object.

This framework is driven by BaseTransform. A BaseTransform will always takes QARecord as input, and outputs a new QARecord.

** Principle of design:

  1. Make every transform as a minimal and atomic operation to QARecord

  2. Only alter the fields it needs to change in a single BaseTransform

Fields:
  • children (List[retrieval_qa_benchmark.schema.transform.BaseTransform | None])

field children: List[BaseTransform | None] = [None, None]

list of next status

class Config
chain(**kwargs: Any) Any
check_status(current: Dict[str, Any]) int

Check the status after all transform functions

Parameters:

current (Dict[str, Any]) – Current transformed QARecord as dictionary

Returns:

the next state ID in BaseTransform.children

Return type:

int

field_targets() Dict[str, Callable[[Dict[str, Any]], Any]]

get collection of all transform function of this transform

Returns:

Dictionary of transform function to fields

Return type:

Dict[str, Callable[[Dict[str, Any]], Any]]

set_children(children: List[BaseTransform | None]) None

Set children for this transform

Parameters:

children (List[Optional[BaseTransform]]) – the next nodes to execute

transform_choices(data: Dict[str, Any], **params: Any) List[str] | None

Special transform function to choices in QARecord

Parameters:

data (Dict[str, Any]) – input QARecord as dictionary

Returns:

transformed choices

Return type:

Optional[List[str]]

pydantic model retrieval_qa_benchmark.schema.LLMHistory

LLM output history

Fields:
  • comment (str)

  • completion_tokens ()

  • created_by (str)

  • extra (retrieval_qa_benchmark.schema.datatypes.ToolHistory | None)

  • generated ()

  • prompt_tokens ()

field comment: str = ''

extra comments to this generation

field completion_tokens: int [Required]
field created_by: str = 'default'

Which node creates this

field extra: ToolHistory | None = None
field generated: str [Required]
field prompt_tokens: int [Required]
pydantic model retrieval_qa_benchmark.schema.QAPrediction

Base prediction result for questioning & answering

Fields:
  • answer (str)

  • choices (Sequence[str] | None)

  • completion_tokens (int)

  • context (Sequence[str] | None)

  • generated (str)

  • id (str)

  • matched (float)

  • profile_avg (Dict[str, float] | None)

  • profile_count (Dict[str, int] | None)

  • profile_time (Dict[str, int | float] | None)

  • prompt_tokens (int)

  • question (str)

  • stack (List[retrieval_qa_benchmark.schema.datatypes.LLMHistory] | None)

  • type (str)

field answer: str [Required]

the true answer from the dataset

field choices: Sequence[str] | None = None

choices where model should be choosing from. only present in [‘mcsa’, ‘mcma’]

field completion_tokens: int = 0

number of generated tokens

field context: Sequence[str] | None = None

list of context strings that are retrieved from db or other sources

field generated: str [Required]

output from the model, is compared with the true answer in QARecord

field id: str [Required]

identifier for this record

field matched: float = 0.0

match score that measures how accurate this prediction is to the answer

field profile_avg: Dict[str, float] | None = {}

calculated averaged time consumption. equals to time / count.

field profile_count: Dict[str, int] | None = {}

accumulated number of execution to each profiled functions

field profile_time: Dict[str, int | float] | None = {}

accumulated time profiling regarding to each function

field prompt_tokens: int = 0

number of input tokens

field question: str [Required]

question to ask in string

field stack: List['LLMHistory'] | None = []

stacked intermediate prediction results (for multi-hop qa pipelines)

field type: str [Required]

type of this question. can be one of [‘mcsa’, ‘mcma’] mcsa: multiple choice single answer mcma: multiple choice multiple answer

class Config
pydantic model retrieval_qa_benchmark.schema.QARecord

Base data record for questioning & answering

Fields:
  • answer (str)

  • choices (Sequence[str] | None)

  • context (Sequence[str] | None)

  • id (str)

  • question (str)

  • stack (List[retrieval_qa_benchmark.schema.datatypes.LLMHistory] | None)

  • type (str)

field answer: str [Required]

the true answer from the dataset

field choices: Sequence[str] | None = None

choices where model should be choosing from. only present in [‘mcsa’, ‘mcma’]

field context: Sequence[str] | None = None

list of context strings that are retrieved from db or other sources

field id: str [Required]

identifier for this record

field question: str [Required]

question to ask in string

field stack: List[LLMHistory] | None = []

stacked intermediate prediction results (for multi-hop qa pipelines)

field type: str [Required]

type of this question. can be one of [‘mcsa’, ‘mcma’] mcsa: multiple choice single answer mcma: multiple choice multiple answer

class Config
pydantic model retrieval_qa_benchmark.schema.ToolHistory

Tool call history

Fields:
  • result (str | None)

  • thought (str)

  • tool (str | None)

  • tool_inputs (str | dict | None)

field result: str | None = None

Output from this function call

field thought: str = ''

rationale step from LLM

field tool: str | None = None

function called in this history

field tool_inputs: str | dict | None = None

Input for this tool call

pydantic model retrieval_qa_benchmark.schema.TransformGraph

Callable graph for BaseTransform

Fields:
  • entry_id (str)

  • nodes (Dict[str, retrieval_qa_benchmark.schema.transform.BaseTransform])

field entry_id: str [Required]
field nodes: Dict[str, BaseTransform] [Required]
classmethod build(nodes: Dict[str, BaseTransform], entry_id: str = '0') TransformGraph