Searchers

pydantic model retrieval_qa_benchmark.transforms.searchers.ElSearchSearcher

Elastic searcher

Fields:

dataset_name (Sequence[str])
dataset_split (str)
el_auth (Tuple[str, str])
el_host (str)
template (str)
text_preprocess (Callable)

field dataset_name: Sequence[str] = ['Cohere/wikipedia-22-12-en-embeddings']: dataset name for plugin dataset

field dataset_split: str = 'train': split for that dataset

field el_auth: Tuple[str, str] [Required]: auth tuple for elastic search

field el_host: str [Required]: hostname to elastic search backend

field text_preprocess: Callable = <function text_preprocess>

bm25_filter(query_list: List[str], num_selected: int) → Tuple[List[List[float]], List[List[Entry]]]

BM25 search

Parameters:

query_list (List[str]) – list of queries
num_selected (int) – number of returned context

Returns:

distances and entries

Return type:

Tuple[List[List[float]], List[List[Entry]]]

para_id_list_to_entry(para_id_list: List[List[int]]) → List[List[Entry]]

parse paragraph ID list into list of entry

Parameters:: para_id_list (List[List[int]]) – paragraph ids
Returns:: list of entry
Return type:: List[List[Entry]]

para_id_to_entry(para_id: int, start_para_list: List[int] | None) → Tuple[str, str]

parse paragraph ID into Entry

Parameters:

para_id (int) – paragraph ID (row position)
start_para_list (Optional[List[int]]) – List of start paragraph

Returns:

title and paragraph

Return type:

Tuple[str, str]

search(query_list: list, num_selected: int, context: List[List[str]] | None = None) → Tuple[List[List[float]], List[List[Entry]]]: search interface to for every BaseSearcher

pydantic model retrieval_qa_benchmark.transforms.searchers.FaissElSearchBM25HybridSearcher

Fields:

dataset_name (Sequence[str])
dataset_split (str)
el_auth (Tuple[str, str])
el_host (str)
embedding_name (str)
index_path (str)
is_raw_rank (bool)
nprobe (int)
num_filtered (int)
template (str)

field dataset_name: Sequence[str] = ['Cohere/wikipedia-22-12-en-embeddings']: dataset name for plugin dataset

field dataset_split: str = 'train': split for that dataset

field el_auth: Tuple[str, str] [Required]

field el_host: str [Required]

field embedding_name: str [Required]

field index_path: str [Required]

field is_raw_rank: bool [Required]

field nprobe: int = 128

field num_filtered: int [Required]

bm25_filter(**kwargs: Any) → Any

emb_filter(**kwargs: Any) → Any

faiss_bm25_hybrid_filter(query_list: List[str], num_selected: int, num_filtered: int, is_raw_rank: bool) → Tuple[List[List[float]], List[List[Entry]]]

index_search(**kwargs: Any) → Any

para_id_list_to_entry(para_id_list: List[List[int]]) → List[List[Entry]]

parse paragraph ID list into list of entry

Parameters:: para_id_list (List[List[int]]) – paragraph ids
Returns:: list of entry
Return type:: List[List[Entry]]

para_id_to_entry(para_id: int, start_para_list: List[int] | None) → Tuple[str, str]

parse paragraph ID into Entry

Parameters:

para_id (int) – paragraph ID (row position)
start_para_list (Optional[List[int]]) – List of start paragraph

Returns:

title and paragraph

Return type:

Tuple[str, str]

search(query_list: list, num_selected: int, context: List[List[str]] | None = None) → Tuple[List[List[float]], List[List[Entry]]]: search interface to for every BaseSearcher

pydantic model retrieval_qa_benchmark.transforms.searchers.FaissElSearchBM25UnionSearcher

Fields:

dataset_name (Sequence[str])
dataset_split (str)
el_auth (Tuple[str, str])
el_host (str)
embedding_name (str)
index_path (str)
nprobe (int)
template (str)
text_preprocess (Callable)

field dataset_name: Sequence[str] = ['Cohere/wikipedia-22-12-en-embeddings']: dataset name for plugin dataset

field dataset_split: str = 'train': split for that dataset

field el_auth: Tuple[str, str] [Required]

field el_host: str [Required]

field embedding_name: str [Required]

field index_path: str [Required]

field nprobe: int = 128

field text_preprocess: Callable = <function text_preprocess>

bm25_filter(**kwargs: Any) → Any

emb_filter(**kwargs: Any) → Any

faiss_bm25_union_filter(query_list: List[str], num_selected: int) → Tuple[List[List[float]], List[List[Entry]]]

index_search(**kwargs: Any) → Any

para_id_list_to_entry(para_id_list: List[List[int]]) → List[List[Entry]]

parse paragraph ID list into list of entry

Parameters:: para_id_list (List[List[int]]) – paragraph ids
Returns:: list of entry
Return type:: List[List[Entry]]

para_id_to_entry(para_id: int, start_para_list: List[int] | None) → Tuple[str, str]

parse paragraph ID into Entry

Parameters:

para_id (int) – paragraph ID (row position)
start_para_list (Optional[List[int]]) – List of start paragraph

Returns:

title and paragraph

Return type:

Tuple[str, str]

search(query_list: list, num_selected: int, context: List[List[str]] | None = None) → Tuple[List[List[float]], List[List[Entry]]]: search interface to for every BaseSearcher

pydantic model retrieval_qa_benchmark.transforms.searchers.FaissSearcher

FAISS searcher

Fields:

dataset_name (Sequence[str])
dataset_split (str)
embedding_name (str)
index_path (str)
nprobe (int)
template (str)

field dataset_name: Sequence[str] = ['Cohere/wikipedia-22-12-en-embeddings']: dataset name for plugin dataset

field dataset_split: str = 'train': split for that dataset

field embedding_name: str [Required]: embedding model name

field index_path: str [Required]: path to faiss dumped index

field nprobe: int = 128: number of clusters to search for IVF indices

emb_filter(**kwargs: Any) → Any

index_search(**kwargs: Any) → Any

para_id_list_to_entry(para_id_list: List[List[int]]) → List[List[Entry]]

parse paragraph ID list into list of entry

Parameters:: para_id_list (List[List[int]]) – paragraph ids
Returns:: list of entry
Return type:: List[List[Entry]]

para_id_to_entry(para_id: int, start_para_list: List[int] | None) → Tuple[str, str]

parse paragraph ID into Entry

Parameters:

para_id (int) – paragraph ID (row position)
start_para_list (Optional[List[int]]) – List of start paragraph

Returns:

title and paragraph

Return type:

Tuple[str, str]

search(query_list: list, num_selected: int, context: List[List[str]] | None = None) → Tuple[List[List[float]], List[List[Entry]]]: search interface to for every BaseSearcher

pydantic model retrieval_qa_benchmark.transforms.searchers.MyScaleSearcher

MyScale Searcher

Fields:

embedding_name (str)
host (str)
kw_topk (int)
num_filtered (int)
password (str)
port (int)
table_name (str)
template (str)
two_staged (bool)
username (str)

field embedding_name: str [Required]: embedding model name

field host: str [Required]: hostname to MyScale backend

field kw_topk: int = 10: keyword extraction only extract kw_topk keywords

field num_filtered: int = 100: number sample returned in first stage filter. Does not matter if two_staged is False

field password: str = '': password to connect MyScale

field port: int [Required]: port to MyScale backend

field table_name: str = 'Wikipedia': table name to search on

field two_staged: bool = False: If twostaged search (with keyword) is enabled

field username: str = 'default': user name to connect MyScale

retrieve(**kwargs: Any) → Any

search(**kwargs: Any) → Any: search interface to for every BaseSearcher