Schema
- pydantic model retrieval_qa_benchmark.schema.BaseDataset
Dataset’s Baseclass Dataset should always output
QARecordwith__getitem__method- Fields:
eval_set (List[retrieval_qa_benchmark.schema.datatypes.QARecord])name (str)
- field eval_set: List[QARecord] = []
Data to be evaluated. The data is transformed with its built-in transform.
- field name: str = 'dataset'
Name of this dataset
- classmethod build(*args: Any, **kwargs: Any) BaseDataset
build dataset
- Raises:
NotImplementedError – user should implement this
- Returns:
dataset that iterate over
List[QARecord]- Return type:
- iterator() Any
- pydantic model retrieval_qa_benchmark.schema.BaseEvaluator
Base class for evaluators
- Fields:
dataset (retrieval_qa_benchmark.schema.dataset.BaseDataset)llm (retrieval_qa_benchmark.schema.model.BaseLLM)matcher (Callable[[str, retrieval_qa_benchmark.schema.datatypes.QARecord], float])out_file (str | None)transform (retrieval_qa_benchmark.schema.transform.TransformGraph)
- field dataset: BaseDataset [Required]
- field out_file: str | None = None
- field transform: TransformGraph [Required]
- class Config
- pydantic model retrieval_qa_benchmark.schema.BaseLLM
- Fields:
context_template (str)name (str)record_template (str)run_args (Dict[str, Any])
- field context_template: str = 'Context:\n{context}\n\n'
template to inject contexts
- field name: str [Required]
name of the model, like gpt-3.5-turbo or llama2-13b-chat
- field record_template: str = 'The following are multiple choice questions (with answers) with context:\n\n{context}Question: {question}\n{choices}Answer: '
template to convert
QARecordinto string
- field run_args: Dict[str, Any] = {}
Runtime keyword arguments
- generate(text: str) BaseLLMOutput
- property tokenizer_type: str
- pydantic model retrieval_qa_benchmark.schema.BaseLLMOutput
- Fields:
completion_tokens (int)generated (str)prompt_tokens (int)
- field completion_tokens: int [Required]
- field generated: str [Required]
- field prompt_tokens: int [Required]
- pydantic model retrieval_qa_benchmark.schema.BaseTransform
Base transform object.
This framework is driven by
BaseTransform. ABaseTransformwill always takesQARecordas input, and outputs a newQARecord.** Principle of design:
Make every transform as a minimal and atomic operation to
QARecordOnly alter the fields it needs to change in a single
BaseTransform
- Fields:
children (List[retrieval_qa_benchmark.schema.transform.BaseTransform | None])
- field children: List[BaseTransform | None] = [None, None]
list of next status
- class Config
- chain(**kwargs: Any) Any
- check_status(current: Dict[str, Any]) int
Check the status after all transform functions
- Parameters:
current (Dict[str, Any]) – Current transformed
QARecordas dictionary- Returns:
the next state ID in BaseTransform.children
- Return type:
int
- field_targets() Dict[str, Callable[[Dict[str, Any]], Any]]
get collection of all transform function of this transform
- Returns:
Dictionary of transform function to fields
- Return type:
Dict[str, Callable[[Dict[str, Any]], Any]]
- set_children(children: List[BaseTransform | None]) None
Set children for this transform
- Parameters:
children (List[Optional[BaseTransform]]) – the next nodes to execute
- pydantic model retrieval_qa_benchmark.schema.LLMHistory
LLM output history
- Fields:
comment (str)completion_tokens ()created_by (str)extra (retrieval_qa_benchmark.schema.datatypes.ToolHistory | None)generated ()prompt_tokens ()
- field comment: str = ''
extra comments to this generation
- field completion_tokens: int [Required]
- field created_by: str = 'default'
Which node creates this
- field extra: ToolHistory | None = None
- field generated: str [Required]
- field prompt_tokens: int [Required]
- pydantic model retrieval_qa_benchmark.schema.QAPrediction
Base prediction result for questioning & answering
- Fields:
answer (str)choices (Sequence[str] | None)completion_tokens (int)context (Sequence[str] | None)generated (str)id (str)matched (float)profile_avg (Dict[str, float] | None)profile_count (Dict[str, int] | None)profile_time (Dict[str, int | float] | None)prompt_tokens (int)question (str)stack (List[retrieval_qa_benchmark.schema.datatypes.LLMHistory] | None)type (str)
- field answer: str [Required]
the true answer from the dataset
- field choices: Sequence[str] | None = None
choices where model should be choosing from. only present in [‘mcsa’, ‘mcma’]
- field completion_tokens: int = 0
number of generated tokens
- field context: Sequence[str] | None = None
list of context strings that are retrieved from db or other sources
- field generated: str [Required]
output from the model, is compared with the true answer in
QARecord
- field id: str [Required]
identifier for this record
- field matched: float = 0.0
match score that measures how accurate this prediction is to the answer
- field profile_avg: Dict[str, float] | None = {}
calculated averaged time consumption. equals to time / count.
- field profile_count: Dict[str, int] | None = {}
accumulated number of execution to each profiled functions
- field profile_time: Dict[str, int | float] | None = {}
accumulated time profiling regarding to each function
- field prompt_tokens: int = 0
number of input tokens
- field question: str [Required]
question to ask in string
- field stack: List['LLMHistory'] | None = []
stacked intermediate prediction results (for multi-hop qa pipelines)
- field type: str [Required]
type of this question. can be one of [‘mcsa’, ‘mcma’] mcsa: multiple choice single answer mcma: multiple choice multiple answer
- class Config
- pydantic model retrieval_qa_benchmark.schema.QARecord
Base data record for questioning & answering
- Fields:
answer (str)choices (Sequence[str] | None)context (Sequence[str] | None)id (str)question (str)stack (List[retrieval_qa_benchmark.schema.datatypes.LLMHistory] | None)type (str)
- field answer: str [Required]
the true answer from the dataset
- field choices: Sequence[str] | None = None
choices where model should be choosing from. only present in [‘mcsa’, ‘mcma’]
- field context: Sequence[str] | None = None
list of context strings that are retrieved from db or other sources
- field id: str [Required]
identifier for this record
- field question: str [Required]
question to ask in string
- field stack: List[LLMHistory] | None = []
stacked intermediate prediction results (for multi-hop qa pipelines)
- field type: str [Required]
type of this question. can be one of [‘mcsa’, ‘mcma’] mcsa: multiple choice single answer mcma: multiple choice multiple answer
- class Config
- pydantic model retrieval_qa_benchmark.schema.ToolHistory
Tool call history
- Fields:
result (str | None)thought (str)tool (str | None)tool_inputs (str | dict | None)
- field result: str | None = None
Output from this function call
- field thought: str = ''
rationale step from LLM
- field tool: str | None = None
function called in this history
- field tool_inputs: str | dict | None = None
Input for this tool call
- pydantic model retrieval_qa_benchmark.schema.TransformGraph
Callable graph for
BaseTransform- Fields:
entry_id (str)nodes (Dict[str, retrieval_qa_benchmark.schema.transform.BaseTransform])
- field entry_id: str [Required]
- field nodes: Dict[str, BaseTransform] [Required]
- classmethod build(nodes: Dict[str, BaseTransform], entry_id: str = '0') TransformGraph