Cohere Rerank 3.5
Review performance benchmarks for the cohere.rerank.3-5
(Cohere Rerank 3.5) model hosted on one RERANK_COHERE unit of a dedicated AI cluster in OCI
Generative AI.
A rerank model takes a query and a list of texts as input and ranks the texts based on their relevancy score to the query, that's, how well each text matches the query.
- Rerank 3.5 Benchmark Scenarios
-
- The query is 100 tokens for all scenarios.
- All scenarios have only one supporting document that's 10,000 tokens long.
- Each scenario chunks this 10,000-token document based on a
max_tokens_per_doc
parameter. These values are 64, 128, 256, 512, 1024, 2048, and 4096. - The maximum chunk size is 4096 tokens which is the maximum tokens that a Rerank 3.5 model can process in one pass.
- Because the document is 10,000 tokens long and the model's context length is 4096 tokens, in all the scenarios, the document is broken into chunks.
- Each chunk includes:
- Padding tokens: To ensure the input fits the model's expected format.
- The query: 100 tokens.
- A document section: For example, for a
max_tokens_per_doc
of 4096 tokens, each chunk includes one of the following document sections:- Document section 1: Document from 0 to 3,992 tokens.
- Document section 2: Document from 3,993 to 7,985 tokens.
- Document section 3: Document from 7,986 to 9,999 tokens. This section is smaller than the other two sections, because the document is only 10,000 tokens long.
- Each benchmark scenario is defined by R(max_tokens_per_doc, 100).
R(64,100)
Batch Size | Time to First Token (TTFT)(second) | Request-level Latency (second) | Request-level Throughput (Request per second) (RPS) |
---|---|---|---|
1 | 0.13 | 0.13 | 7.64 |
2 | 0.11 | 0.11 | 8.96 |
4 | 0.11 | 0.11 | 9.12 |
8 | 0.11 | 0.11 | 9.06 |
24 | 0.12 | 0.12 | 8.33 |
48 | 0.14 | 0.14 | 7.19 |
96 | 0.17 | 0.17 | 5.86 |
R(128,100)
Batch Size | Time to First Token (TTFT)(second) | Request-level Latency (second) | Request-level Throughput (Request per second) (RPS) |
---|---|---|---|
1 | 0.11 | 0.11 | 9.15 |
2 | 0.11 | 0.11 | 9.12 |
4 | 0.11 | 0.11 | 9.00 |
8 | 0.11 | 0.11 | 8.81 |
24 | 0.13 | 0.13 | 7.71 |
48 | 0.16 | 0.16 | 6.34 |
96 | 0.20 | 0.20 | 4.81 |
R(256,100)
Batch Size | Time to First Token (TTFT)(second) | Request-level Latency (second) | Request-level Throughput (Request per second) (RPS) |
---|---|---|---|
1 | 0.11 | 0.11 | 9.10 |
2 | 0.11 | 0.11 | 9.03 |
4 | 0.11 | 0.11 | 8.73 |
8 | 0.12 | 0.12 | 8.14 |
24 | 0.15 | 0.15 | 6.47 |
48 | 0.20 | 0.20 | 4.91 |
96 | 0.28 | 0.28 | 3.52 |
R(512,100)
Batch Size | Time to First Token (TTFT)(second) | Request-level Latency (second) | Request-level Throughput (Request per second) (RPS) |
---|---|---|---|
1 | 0.11 | 0.11 | 8.94 |
2 | 0.11 | 0.11 | 8.61 |
4 | 0.12 | 0.12 | 7.91 |
8 | 0.14 | 0.14 | 6.85 |
24 | 0.20 | 0.20 | 4.87 |
48 | 0.30 | 0.30 | 3.22 |
96 | 0.54 | 0.54 | 1.83 |
R(1024,100)
Batch Size | Time to First Token (TTFT)(second) | Request-level Latency (second) | Request-level Throughput (Request per second) (RPS) |
---|---|---|---|
1 | 0.12 | 0.12 | 8.11 |
2 | 0.13 | 0.13 | 7.22 |
4 | 0.15 | 0.15 | 6.24 |
8 | 0.19 | 0.19 | 4.99 |
24 | 0.45 | 0.45 | 2.20 |
48 | 0.73 | 0.73 | 1.34 |
96 | 1.38 | 1.38 | 0.72 |
R(2048,100)
Batch Size | Time to First Token (TTFT)(second) | Request-level Latency (second) | Request-level Throughput (Request per second) (RPS) |
---|---|---|---|
1 | 0.15 | 0.15 | 6.13 |
2 | 0.18 | 0.18 | 5.14 |
4 | 0.25 | 0.25 | 3.84 |
8 | 0.38 | 0.38 | 2.52 |
24 | 1.05 | 1.05 | 0.94 |
48 | 2.01 | 2.01 | 0.49 |
96 | 3.77 | 3.77 | 0.26 |
R(4096,100)
Batch Size | Time to First Token (TTFT)(second) | Request-level Latency (second) | Request-level Throughput (Request per second) (RPS) |
---|---|---|---|
1 | 7.35 | 7.35 | 4.65 |
2 | 7.35 | 7.35 | 3.71 |
4 | 7.35 | 7.35 | 2.43 |
8 | 7.35 | 7.35 | 1.24 |
24 | 7.35 | 7.35 | 0.49 |
48 | 7.35 | 7.35 | 0.26 |
96 | 7.35 | 7.35 | 0.14 |