Schema

pydantic model retrieval_qa_benchmark.schema.BaseDataset

Dataset’s Baseclass Dataset should always output QARecord with __getitem__ method

Fields:

eval_set (List[retrieval_qa_benchmark.schema.datatypes.QARecord])
name (str)

field eval_set: List[QARecord] = []: Data to be evaluated. The data is transformed with its built-in transform.

field name: str = 'dataset': Name of this dataset

class Config

extra = 'forbid': No extra field allowed

classmethod build(*args: Any, **kwargs: Any) → BaseDataset

build dataset

Raises:: NotImplementedError – user should implement this
Returns:: dataset that iterate over List[QARecord]
Return type:: BaseDataset

iterator() → Any

pydantic model retrieval_qa_benchmark.schema.BaseEvaluator

Base class for evaluators

Fields:

dataset (retrieval_qa_benchmark.schema.dataset.BaseDataset)
llm (retrieval_qa_benchmark.schema.model.BaseLLM)
matcher (Callable[[str, retrieval_qa_benchmark.schema.datatypes.QARecord], float])
out_file (str | None)
transform (retrieval_qa_benchmark.schema.transform.TransformGraph)

field dataset: BaseDataset [Required]

field llm: BaseLLM [Required]

field matcher: Callable[[str, QARecord], float] = <function default_matcher>

field out_file: str | None = None

field transform: TransformGraph [Required]

class Config

pydantic model retrieval_qa_benchmark.schema.BaseLLM

Fields:

context_template (str)
name (str)
record_template (str)
run_args (Dict[str, Any])

field context_template: str = 'Context:\n{context}\n\n': template to inject contexts

field name: str [Required]: name of the model, like gpt-3.5-turbo or llama2-13b-chat

field record_template: str = 'The following are multiple choice questions (with answers) with context:\n\n{context}Question: {question}\n{choices}Answer: ': template to convert QARecord into string

field run_args: Dict[str, Any] = {}: Runtime keyword arguments

classmethod build(**kwargs: Any) → BaseLLM

convert_record(data: QARecord) → str

generate(text: str) → BaseLLMOutput

property tokenizer_type: str

pydantic model retrieval_qa_benchmark.schema.BaseLLMOutput

Fields:

completion_tokens (int)
generated (str)
prompt_tokens (int)

field completion_tokens: int [Required]

field generated: str [Required]

field prompt_tokens: int [Required]

pydantic model retrieval_qa_benchmark.schema.BaseTransform

Base transform object.

This framework is driven by BaseTransform. A BaseTransform will always takes QARecord as input, and outputs a new QARecord.

** Principle of design:

Make every transform as a minimal and atomic operation to QARecord
Only alter the fields it needs to change in a single BaseTransform

Fields:

children (List[retrieval_qa_benchmark.schema.transform.BaseTransform | None])

field children: List[BaseTransform | None] = [None, None]: list of next status

class Config

chain(**kwargs: Any) → Any

check_status(current: Dict[str, Any]) → int

Check the status after all transform functions

Parameters:: current (Dict[str, Any]) – Current transformed QARecord as dictionary
Returns:: the next state ID in BaseTransform.children
Return type:: int

field_targets() → Dict[str, Callable[[Dict[str, Any]], Any]]

get collection of all transform function of this transform

Returns:: Dictionary of transform function to fields
Return type:: Dict[str, Callable[[Dict[str, Any]], Any]]

set_children(children: List[BaseTransform | None]) → None

Set children for this transform

Parameters:: children (List[Optional[BaseTransform]]) – the next nodes to execute

transform_choices(data: Dict[str, Any], **params: Any) → List[str] | None

Special transform function to choices in QARecord

Parameters:: data (Dict[str, Any]) – input QARecord as dictionary
Returns:: transformed choices
Return type:: Optional[List[str]]

pydantic model retrieval_qa_benchmark.schema.LLMHistory

LLM output history

Fields:

comment (str)
completion_tokens ()
created_by (str)
extra (retrieval_qa_benchmark.schema.datatypes.ToolHistory | None)
generated ()
prompt_tokens ()

field comment: str = '': extra comments to this generation

field completion_tokens: int [Required]

field created_by: str = 'default': Which node creates this

field extra: ToolHistory | None = None

field generated: str [Required]

field prompt_tokens: int [Required]

pydantic model retrieval_qa_benchmark.schema.QAPrediction

Base prediction result for questioning & answering

Fields:

answer (str)
choices (Sequence[str] | None)
completion_tokens (int)
context (Sequence[str] | None)
generated (str)
id (str)
matched (float)
profile_avg (Dict[str, float] | None)
profile_count (Dict[str, int] | None)
profile_time (Dict[str, int | float] | None)
prompt_tokens (int)
question (str)
stack (List[retrieval_qa_benchmark.schema.datatypes.LLMHistory] | None)
type (str)

field answer: str [Required]: the true answer from the dataset

field choices: Sequence[str] | None = None: choices where model should be choosing from. only present in [‘mcsa’, ‘mcma’]

field completion_tokens: int = 0: number of generated tokens

field context: Sequence[str] | None = None: list of context strings that are retrieved from db or other sources

field generated: str [Required]: output from the model, is compared with the true answer in QARecord

field id: str [Required]: identifier for this record

field matched: float = 0.0: match score that measures how accurate this prediction is to the answer

field profile_avg: Dict[str, float] | None = {}: calculated averaged time consumption. equals to time / count.

field profile_count: Dict[str, int] | None = {}: accumulated number of execution to each profiled functions

field profile_time: Dict[str, int | float] | None = {}: accumulated time profiling regarding to each function

field prompt_tokens: int = 0: number of input tokens

field question: str [Required]: question to ask in string

field stack: List['LLMHistory'] | None = []: stacked intermediate prediction results (for multi-hop qa pipelines)

field type: str [Required]: type of this question. can be one of [‘mcsa’, ‘mcma’] mcsa: multiple choice single answer mcma: multiple choice multiple answer

class Config

pydantic model retrieval_qa_benchmark.schema.QARecord

Base data record for questioning & answering

Fields:

answer (str)
choices (Sequence[str] | None)
context (Sequence[str] | None)
id (str)
question (str)
stack (List[retrieval_qa_benchmark.schema.datatypes.LLMHistory] | None)
type (str)

field answer: str [Required]: the true answer from the dataset

field choices: Sequence[str] | None = None: choices where model should be choosing from. only present in [‘mcsa’, ‘mcma’]

field context: Sequence[str] | None = None: list of context strings that are retrieved from db or other sources

field id: str [Required]: identifier for this record

field question: str [Required]: question to ask in string

field stack: List[LLMHistory] | None = []: stacked intermediate prediction results (for multi-hop qa pipelines)

field type: str [Required]: type of this question. can be one of [‘mcsa’, ‘mcma’] mcsa: multiple choice single answer mcma: multiple choice multiple answer

class Config

pydantic model retrieval_qa_benchmark.schema.ToolHistory

Tool call history

Fields:

result (str | None)
thought (str)
tool (str | None)
tool_inputs (str | dict | None)

field result: str | None = None: Output from this function call

field thought: str = '': rationale step from LLM

field tool: str | None = None: function called in this history

field tool_inputs: str | dict | None = None: Input for this tool call

pydantic model retrieval_qa_benchmark.schema.TransformGraph

Callable graph for BaseTransform

Fields:

entry_id (str)
nodes (Dict[str, retrieval_qa_benchmark.schema.transform.BaseTransform])

field entry_id: str [Required]

field nodes: Dict[str, BaseTransform] [Required]

classmethod build(nodes: Dict[str, BaseTransform], entry_id: str = '0') → TransformGraph