Searchers

pydantic model retrieval_qa_benchmark.transforms.searchers.ElSearchSearcher

Elastic searcher

Fields:
  • dataset_name (Sequence[str])

  • dataset_split (str)

  • el_auth (Tuple[str, str])

  • el_host (str)

  • template (str)

  • text_preprocess (Callable)

field dataset_name: Sequence[str] = ['Cohere/wikipedia-22-12-en-embeddings']

dataset name for plugin dataset

field dataset_split: str = 'train'

split for that dataset

field el_auth: Tuple[str, str] [Required]

auth tuple for elastic search

field el_host: str [Required]

hostname to elastic search backend

field text_preprocess: Callable = <function text_preprocess>
bm25_filter(query_list: List[str], num_selected: int) Tuple[List[List[float]], List[List[Entry]]]

BM25 search

Parameters:
  • query_list (List[str]) – list of queries

  • num_selected (int) – number of returned context

Returns:

distances and entries

Return type:

Tuple[List[List[float]], List[List[Entry]]]

para_id_list_to_entry(para_id_list: List[List[int]]) List[List[Entry]]

parse paragraph ID list into list of entry

Parameters:

para_id_list (List[List[int]]) – paragraph ids

Returns:

list of entry

Return type:

List[List[Entry]]

para_id_to_entry(para_id: int, start_para_list: List[int] | None) Tuple[str, str]

parse paragraph ID into Entry

Parameters:
  • para_id (int) – paragraph ID (row position)

  • start_para_list (Optional[List[int]]) – List of start paragraph

Returns:

title and paragraph

Return type:

Tuple[str, str]

search(query_list: list, num_selected: int, context: List[List[str]] | None = None) Tuple[List[List[float]], List[List[Entry]]]

search interface to for every BaseSearcher

pydantic model retrieval_qa_benchmark.transforms.searchers.FaissElSearchBM25HybridSearcher
Fields:
  • dataset_name (Sequence[str])

  • dataset_split (str)

  • el_auth (Tuple[str, str])

  • el_host (str)

  • embedding_name (str)

  • index_path (str)

  • is_raw_rank (bool)

  • nprobe (int)

  • num_filtered (int)

  • template (str)

field dataset_name: Sequence[str] = ['Cohere/wikipedia-22-12-en-embeddings']

dataset name for plugin dataset

field dataset_split: str = 'train'

split for that dataset

field el_auth: Tuple[str, str] [Required]
field el_host: str [Required]
field embedding_name: str [Required]
field index_path: str [Required]
field is_raw_rank: bool [Required]
field nprobe: int = 128
field num_filtered: int [Required]
bm25_filter(**kwargs: Any) Any
emb_filter(**kwargs: Any) Any
faiss_bm25_hybrid_filter(query_list: List[str], num_selected: int, num_filtered: int, is_raw_rank: bool) Tuple[List[List[float]], List[List[Entry]]]
para_id_list_to_entry(para_id_list: List[List[int]]) List[List[Entry]]

parse paragraph ID list into list of entry

Parameters:

para_id_list (List[List[int]]) – paragraph ids

Returns:

list of entry

Return type:

List[List[Entry]]

para_id_to_entry(para_id: int, start_para_list: List[int] | None) Tuple[str, str]

parse paragraph ID into Entry

Parameters:
  • para_id (int) – paragraph ID (row position)

  • start_para_list (Optional[List[int]]) – List of start paragraph

Returns:

title and paragraph

Return type:

Tuple[str, str]

search(query_list: list, num_selected: int, context: List[List[str]] | None = None) Tuple[List[List[float]], List[List[Entry]]]

search interface to for every BaseSearcher

pydantic model retrieval_qa_benchmark.transforms.searchers.FaissElSearchBM25UnionSearcher
Fields:
  • dataset_name (Sequence[str])

  • dataset_split (str)

  • el_auth (Tuple[str, str])

  • el_host (str)

  • embedding_name (str)

  • index_path (str)

  • nprobe (int)

  • template (str)

  • text_preprocess (Callable)

field dataset_name: Sequence[str] = ['Cohere/wikipedia-22-12-en-embeddings']

dataset name for plugin dataset

field dataset_split: str = 'train'

split for that dataset

field el_auth: Tuple[str, str] [Required]
field el_host: str [Required]
field embedding_name: str [Required]
field index_path: str [Required]
field nprobe: int = 128
field text_preprocess: Callable = <function text_preprocess>
bm25_filter(**kwargs: Any) Any
emb_filter(**kwargs: Any) Any
faiss_bm25_union_filter(query_list: List[str], num_selected: int) Tuple[List[List[float]], List[List[Entry]]]
para_id_list_to_entry(para_id_list: List[List[int]]) List[List[Entry]]

parse paragraph ID list into list of entry

Parameters:

para_id_list (List[List[int]]) – paragraph ids

Returns:

list of entry

Return type:

List[List[Entry]]

para_id_to_entry(para_id: int, start_para_list: List[int] | None) Tuple[str, str]

parse paragraph ID into Entry

Parameters:
  • para_id (int) – paragraph ID (row position)

  • start_para_list (Optional[List[int]]) – List of start paragraph

Returns:

title and paragraph

Return type:

Tuple[str, str]

search(query_list: list, num_selected: int, context: List[List[str]] | None = None) Tuple[List[List[float]], List[List[Entry]]]

search interface to for every BaseSearcher

pydantic model retrieval_qa_benchmark.transforms.searchers.FaissSearcher

FAISS searcher

Fields:
  • dataset_name (Sequence[str])

  • dataset_split (str)

  • embedding_name (str)

  • index_path (str)

  • nprobe (int)

  • template (str)

field dataset_name: Sequence[str] = ['Cohere/wikipedia-22-12-en-embeddings']

dataset name for plugin dataset

field dataset_split: str = 'train'

split for that dataset

field embedding_name: str [Required]

embedding model name

field index_path: str [Required]

path to faiss dumped index

field nprobe: int = 128

number of clusters to search for IVF indices

emb_filter(**kwargs: Any) Any
para_id_list_to_entry(para_id_list: List[List[int]]) List[List[Entry]]

parse paragraph ID list into list of entry

Parameters:

para_id_list (List[List[int]]) – paragraph ids

Returns:

list of entry

Return type:

List[List[Entry]]

para_id_to_entry(para_id: int, start_para_list: List[int] | None) Tuple[str, str]

parse paragraph ID into Entry

Parameters:
  • para_id (int) – paragraph ID (row position)

  • start_para_list (Optional[List[int]]) – List of start paragraph

Returns:

title and paragraph

Return type:

Tuple[str, str]

search(query_list: list, num_selected: int, context: List[List[str]] | None = None) Tuple[List[List[float]], List[List[Entry]]]

search interface to for every BaseSearcher

pydantic model retrieval_qa_benchmark.transforms.searchers.MyScaleSearcher

MyScale Searcher

Fields:
  • embedding_name (str)

  • host (str)

  • kw_topk (int)

  • num_filtered (int)

  • password (str)

  • port (int)

  • table_name (str)

  • template (str)

  • two_staged (bool)

  • username (str)

field embedding_name: str [Required]

embedding model name

field host: str [Required]

hostname to MyScale backend

field kw_topk: int = 10

keyword extraction only extract kw_topk keywords

field num_filtered: int = 100

number sample returned in first stage filter. Does not matter if two_staged is False

field password: str = ''

password to connect MyScale

field port: int [Required]

port to MyScale backend

field table_name: str = 'Wikipedia'

table name to search on

field two_staged: bool = False

If twostaged search (with keyword) is enabled

field username: str = 'default'

user name to connect MyScale

retrieve(**kwargs: Any) Any
search(**kwargs: Any) Any

search interface to for every BaseSearcher