Skip to content

Evaluation BeeChat

Executive Summary

BeeChat is a scalable chatbot platform leveraging Ollama and OpenWebUI for higher-education contexts. It combines domain-specific fact retrieval, summarization, and built-in ethical and security safeguards to deliver reliable, context-aware chatbot responses.

We conducted two types of tests to assess the system performance:

  1. Quality Evaluation

    We manually reviewed (n=15) outputs across diverse tasks and simple Q&A prompts, uncovering deficiencies in web-search integration, citation consistency, and inclusive language.

  2. Load Testing

    Using JMeter, we compared two hardware configurations:

    • Single-instance: Ollama 70B Q6 on Mac Studio (M2 Ultra, 192 GB)
    • Dual-instance: MacBook Pro (M3 Max, 128 GB) + Mac Studio (M2 Ultra, 192 GB)

The single-instance setup consistently sustained ~4 tokens/sec with four concurrent users/threads and exhibited stable latency. The dual-instance arrangement increased total throughput but introduced variable latency, especially for longer responses.

From these tests, we established each node’s maximum capacity (Cₘₐₓ) and recorded a 73.3 % compliance rate across realistic scenarios. While 4 tokens/sec represents our lower acceptable bound per node, achieving 10–20 tokens/sec or higher would enable a more responsive user experience.

Recommendations

  • Scaling: Add homogeneous nodes in parallel, using Cₘₐₓ to predict aggregate throughput. Conduct further load tests, when hardware setup changes. Conduct longer load tests and automate quality evaluations.

  • Upgrade Hardware: For major throughput gains, invest in higher-performance instances, prioritizing GPU compute and memory bandwidth.

  • Optimize Models: When hardware is limited, deploy smaller, specialized models or fine-tune sub-models for targeted use cases to maintain quality and efficiency.

  • Throughput Targets: Aim for 10–20 tokens/sec (or higher) per node as a benchmark for better user experience, compared to our current baseline of ~4 tokens/sec.

These tests are not conclusive but should be seen as part of an ongoing system lifecycle. To ensure long-term quality and scalability, they must be repeated and adapted as hardware, models, or user needs evolve. A full lifecycle should also include automated evaluations, long-term stability checks, failure recovery, load behavior analysis, user feedback integration, and domain-specific quality benchmarking.


1. Use-Case Driven Evaluation of BeeChat’s Answering Quality

A detailed description of the evaluation and findings is available in our paper.

To assess BeeChat’s ability to deliver reliable answers in a higher education context, we developed 15 realistic use cases and corresponding evaluation tests. Each test was manually executed by two independent reviewers, who assessed whether the chatbot's responses aligned with predefined success criteria.

The evaluation focused on seven core dimensions:

  • Knowledge Extraction Accuracy (e.g., extracting relevant details from publicly available documents)
  • Source Attribution (e.g., correctly displaying document sources)
  • General Knowledge Query via Web Search(e.g., retrieving up-to-date factual information)
  • Security Filter Robustness (e.g., rejecting harmful prompts)
  • Ethical Guardrails Enforcement (e.g., preventing generation of destruc- tive or unethical content)
  • Standards Compliance in Text Generation (e.g., ensuring generated texts conform to institutional standards)
  • Content Summarization Accuracy (e.g., effectively summarizing texts and documents)

BeeChat achieved a 73.3 % compliance rate (11 out of 15 test cases). It performed strongly in fact retrieval, summarization, and safeguarding functions. Weaknesses were identified in web-search integration, occasional missing source attributions, and minor lapses in inclusive phrasing.

These findings guide our next steps: improving the web-search module, refining response templates, and preparing for a broader student-facing pilot.


2. Load Testing

2.1 Introduction

To plan horizontal scaling for BeeChat, we measure end-to-end throughput (tokens/sec) as concurrency grows and inspect P95 latency for outliers. From the throughput plateau we derive Cₘₐₓ—the real-world max threads per instance—and then calculate how many nodes are needed (plus a 10 % buffer).

2.2 Method

Setup

We drive automated load tests with JMeter against OpenWebUI’s developer API, which fronts two Ollama inference instances running llama3.3:70B-Q6_K. Although each Ollama engine can theoretically handle up to four parallel sessions, OpenWebUI’s internal queuing means actual concurrency limits must be measured.

Hardware:

  • MacBook Pro (M3 Max, 128 GB)
  • Mac Studio (M2 Ultra, 192 GB)

Deployment Setups:

Setup OpenWebUI on MacBook Ollama on MacBook Ollama on Mac Studio
Single Yes No Yes
Dual Yes Yes Yes

JMeter Testing

Testing is done by first using JMeter to generate results and then visualizing them python.

JMeter Overview:
For JMeter we used following settings:

  • Concurrency sweep: 12 JMX scripts, one for each thread count (1–12)
  • CSV Data Set (config.csv): model,prompt pairs of varying lengths
  • HTTP Header: Authorization: Bearer <token> from OpenWebUI’s Swagger UI
  • Request Body:

    json { "model":"${model}", "messages":[{"role":"user","content":"${prompt}"}] }

  • Thread Group:

    Parameter Value
    Threads 1–12
    Ramp-Up 10 s
    Startup Delay 60 s
    Duration 660 s
  • Data Collection:

    • Simple Data Writer (JTL/XML): full response JSON for token counts
    • Summary Report (CSV): elapsed (ms) and allThreads (concurrency)

Jupyter Notebook for Visualization:

We setup a jupyter notebook which does following tasks:

  1. Merge all summary CSVs, tag by thread count, compute P95 latency per level and plot “Latency vs. Concurrency.”
  2. Parse JTL XMLs to extract completion_tokens and response_token/s; calculate E2E tokens/sec per request; aggregate mean and percentiles by concurrency and plot “Tokens/sec vs. Concurrency.”
  3. Use the highest thread count with E2E t/s ≥ SLA (e.g. 3 t/s) to set Cₘₐₓ.
  4. Apply instances = ceil(target_users/Cₘₐₓ) + 10% buffer for capacity planning.
  5. Set up monitoring and autoscaling based on live tokens/sec and P95 latency.

We visualized both throughput and latency curves to read off Cₘₐₓ. The notebooks are written in a conda environment in a jupyter notebook in python.

Note: If you want to use it, first install e.g. conda and needed libraries, like

conda create -n beechat python=3.10 -y

conda install pandas numpy matplotlib -y

The analysis scripts expect as input the JMeter-generated files from your

  • load tests—JTL logs (e.g. x.jtl)
  • and summary CSV reports (e.g. x-summary_report.csv).

They will read these from your results-directory and output combined CSVs and PNG plots into a timestamped eval_results-_ folder for manual inspection.

2.3 Results

We executed a total of 24 load test runs—12 for the single-instance setup and 12 for the dual-instance setup—each sweeping concurrencies from 1 to 12 threads.

As per our definitions about the looked for t/s we defined:

E2E Tokens/sec (t/s) Category
> 20 Excellent
10 - 20 Very Good
5 – 10 Good
1 - 5 Poor
< 1 Unacceptable

Those values are derived and adapted from the reddit post: www.reddit.com/r/LocalLLaMA/comments/162pgx9/what_do_yall_consider_acceptable_tokens_per/

Single-Instance Throughput t/s

Tokens/s vs. Concurrency (Single)

In the single-instance chart, the E2E tokens/sec (blue) falls off sharply as concurrency rises—from about 9 t/s at 1 thread to ~3.4 t/s at 3 threads, then down to ~1.3 t/s by 12 threads. A small bump at 3–4 threads likely reflects measurement noise, but overall the decline matches the expected behaviour due ollamas parallelism mode which supports 4 concurrent users.

By contrast, the backend-reported tokens/sec (orange) levels off around 4 t/s from 4 threads onward, confirming that Ollama’s raw LLM-inference speed remains constant—the end-to-end rate alone degrades. Given our SLA band (1–5 t/s “poor”), we identify Cₘₐₓ(single) = 4 threads: it still meets the lower bound of “acceptable,” but users may experience sluggishness on longer responses (see following timeline by token bin charts).

Single-Instance P95 Latency

P95 Latency vs. Concurrency (Single)

P95 latency starts at ~80 s for a single thread, climbs to ~190 s by 3 threads, dips slightly at 4, then rises steadily—peaking around 270 s at 12 threads. This shows that while typical responses remain under 200 s up to 5–6 users, the worst-case delays grow rapidly under heavier load.

Dual-Instance Throughput t/s

Tokens/s vs. Concurrency (Dual)

In the dual‐instance chart, the E2E tokens/sec curve (blue) drops from about 7.2 t/s at 1 thread to ~4 t/s at 3 threads, then holds above 3 t/s out to 6 threads—suggesting Cₘₐₓ(dual) = 6. The backend‐reported rate (orange) remains around 4.5–4.9 t/s up to 6 threads, showing that two Ollama engines can handle higher aggregate load without per‐request inference slowdowns.

However, the P95-latency vs. concurrency plots and the “timeline by token-size bin” charts (see next sections) reveal that beyond 4–5 users, tail-latencies spike and slow-token bins appear. This indicates that despite a steady average throughput, some users will encounter very sluggish chatbot behavior under heavier load.

Dual-Instance P95 Latency

P95 Latency vs. Concurrency (Dual)

P95 latency begins at ~100 s for a single thread and climbs more steeply than in the single setup—reaching ~380 s at 4 threads, hovering around 400–470 s at 6–9 threads, then spiking to 600 s at 12 threads. While we rely on tokens/sec for Cₘₐₓ (since long prompts naturally extend generation time), these latency curves are for spotting outliers and understanding worst-case delays—especially in a mixed-hardware cluster where the slower node amplifies tail latencies.

Timeline by Token-Size Bin

Below we compare how request durations and token-size bins evolve over time in single- and dual-instance setups (for one, three and eight concurrent user-threads). Each image shows a horizontal-bar timeline where white/gray/black bars indicate low/mid/high token-count requests.

Users Single-Instance Timeline Dual-Instance Timeline
1 1 User Single 1 User Dual
3 3 Users Single 3 Users Dual
8 8 Users Single 8 Users Dual

For single‐instance timelines, even at 3 users the bars grow noticeably longer and more varied as each request waits its turn, but overall response times remain relatively consistent. In the dual setup, however, we see two distinct behaviors: short‐token requests (white) complete very quickly—likely on the Mac Studio node—while long‐token requests (black) sometimes stall for extended periods when routed to the slower MacBook Pro instance. By 8 users, both setups show increased initial contention (steeper bars at the start), but the single node flattens out more uniformly, whereas the dual cluster remains bimodal: fast for small prompts, but occasionally very slow for large ones. These plots illustrate the trade-off between higher aggregate throughput (dual) and consistent response times (single), especially for long answers.

Cₘₐₓ & Capacity Planning

Based on our tested system with a 70B llama model, our end-to-end tokens/sec SLA (1-5 t/s), the single-instance (Mac Studio) setup achieves Cₘₐₓ = 4 threads before throughput degrades below acceptable levels. The dual-instance cluster (Mac Studio + MacBook Pro) shows higher average throughput but suffers from inconsistent response times on its slower node, so we standardize on the single-instance Cₘₐₓ for scaling.

Target Concurrent Users Raw Instances = ⌈Users / 4⌉ +10 % Buffer Final Instances
10 3 3.3 4
20 5 5.5 6
30 8 8.8 9
40 10 11.0 11
50 13 14.3 15
60 15 16.5 17
70 18 19.8 20
80 20 22.0 22
90 23 25.3 26
100 25 27.5 28

2.4 Discussion

Our single-instance (Mac Studio) tests show that up to 4 concurrent users can each stream at ca. 4 tokens/sec before end-to-end throughput which falls into acceptable but still somewhat “poor” range. Beyond that, user-visible performance might degrade sharply. Although the dual-instance cluster (Mac Studio + MacBook Pro) sustains higher aggregate throughput—up to 6 threads at ≥ 3 t/s—the heterogeneous setup introduces inconsistent waiting times whenever requests land on the slower MacBook node.

Key takeaways:

  • Throughput (E2E t/s) drives UX. We anchor our SLA to user-visible token rates rather than raw inference metrics, since even a fast model can feel sluggish once network, queuing, and framework overhead are included.
  • Homogeneous hardware might scale best. Mixed-node clusters exacerbate tail-latencies: a single weak node can delay a subset of users significantly. For predictable performance, all instances should match the Mac Studio’s memory bandwidth and compute profile.
  • Cₘₐₓ = 4 threads per Mac Studio instance strikes a balance between capacity and responsiveness. At this level, users still receive tokens at the lower bound of “acceptable” speed—albeit with a “poor” classification for longer answers.
  • Trade-offs and optimizations:
  • Consider smaller models (e.g. 30B variants of Llama) or heavier quantization (Q4_K) to boost per-user throughput.
  • Craft system prompts to elicit shorter responses or split long answers into multiple turns.
  • Recommend single-window, single-instance usage rather than multi-window/browser clients to avoid overloading OpenWebUI’s queuing layer.

As a rough rule of thumb, a single Mac Studio can deliver ~9 t/s for one user, dropping to ~4 t/s at four users. This per-user baseline can guide quick capacity estimates for similar hardware or models—but precise Cₘₐₓ values always require empirical load testing under real prompt and concurrency patterns.

Furthermore one can estimate the performance for different hardware options. Erdil et al. (2025, https://arxiv.org/abs/2506.04645) proposed a formula to estimate “tokens/sec” for autoregressive LLM inference, that can be used as Roofline-style model.

For a quick assessment we can modify the original formula:

original formula Erdil et al.

and simplify it for Memory-Bound Inference. Because on hardware where the model (tens of GB) cannot fit entirely in on-chip cache, streaming the full weight set dominates compute time:

original formula Erdil et al.

so we drop the compute term and use only:

simplyfied_formula

where as:

simplyfied formula parameters

This gives a quick estimate of the maximum tokens generated per second under memory-bound inference for one user in our case. For example here are some calculated model footprints that with the known memory-bandwith can be used to calculate t/s:

  • 7 B Q6: 7 × 10⁹ × 0.75 B = 5.25 GB
  • 70 B Q6: 70 × 10⁹ × 0.75 B = 52.5 GB
  • 70 B Q4: 70 × 10⁹ × 0.5 B = 35 GB

Example Estimation - Bandwidth vs. Theoretical Throughput:

Platform ca. Bandwidth (GB/s) 7 B Q6 (t/s = B / 5.25) 70 B Q6 (t/s = B / 52.5) 70 B Q4 (t/s = B / 35)
MacBook Pro (M3 Max, 128 GB) 400 76 8 11
Mac Studio (M2 Ultra, 192 GB) 800 152 15 23
Apple M4 Max (546 GB/s) 500 95 10 14
Apple M3 Ultra (819 GB/s) 800 152 15 23
NVIDIA H100 PCIe (80 GB HBM) 2000 381 38 57
2× H100 PCIe (shared for 70 B) 4000 762 76 114

3. Conclusion

This study highlights the trade-offs inherent in deploying LLM-based chat services on fixed hardware. While a homogeneous Mac Studio "cluster" provides predictable throughput (Cₘₐₓ = 4 at ~4 t/s), mixed setups introduce inconsistent tail-latency spikes that degrade user experience. Our preliminary use-case evaluation showed that BeeChat performs acceptably overall—demonstrating strengths in fact retrieval and summarization but revealing gaps in web-search integration, citation consistency, and inclusive language. When considering tokens-per-second as a capacity metric, we confront a clear trade-off: simply adding more hardware would boost performance but is often impractical, so we must either seek efficiencies through smaller or more aggressively quantized models, even though this may risk answer quality or prevent long answer generation.

Short recommendations include enriching the OpenWebUI pipeline with further optimizations, adopting specialist smaller models for mission-critical tasks to ensure speed, and selectively fine-tuning sub-models where high accuracy is paramount. Ultimately, the optimal approach depends strongly on the use case. For scenarios that demand both fast and precise responses—such as detailed educational dialogues—invest in RAG, multi-agent systems, larger or even cloud hosted models or fine-tuning.