Compare commits

..

3 Commits

Author SHA1 Message Date
Georgi Gerganov
ff4affb4c1 sync : ggml 2026-02-15 22:24:29 +02:00
Georgi Gerganov
55d58599c8 ggml : bump version to 0.9.7 (ggml/1425) 2026-02-15 22:24:29 +02:00
Georgi Gerganov
1a8c700bfd ggml : bump version to 0.9.6 (ggml/1423) 2026-02-15 22:24:29 +02:00
9 changed files with 2 additions and 2032 deletions

View File

@@ -1,190 +0,0 @@
# llama-eval Codebase Guidelines
## Overview
This directory contains Python evaluation tools for llama.cpp:
- `llama-eval.py` - Main evaluation tool with multiple datasets (AIME, AIME2025, GSM8K, GPQA)
- `llama-server-simulator.py` - Flask-based server simulator for testing
- `test-simulator.sh` - Test script for the simulator
## Build/Run Commands
### Virtual Environment
The project uses a virtual environment located at `venv/`:
```bash
source venv/bin/activate
```
### Running the Main Evaluator
```bash
python llama-eval.py \
--server http://127.0.0.1:8013 \
--model gpt-oss-20b-hf-low \
--dataset aime \
--n_cases 10 \
--grader-type llm \
--seed 42
```
### Running the Simulator (for testing)
```bash
python llama-server-simulator.py --port 8033 --success-rate 0.8
```
### Running Tests
```bash
./test-simulator.sh
```
## Code Style Guidelines
### Imports
- Standard library imports first (argparse, json, os, re, subprocess, sys, time)
- Third-party imports (requests, tqdm, datasets, flask) after standard library
- Relative imports not used
- Group imports by category with blank line between groups
### Formatting
- 4-space indentation
- Max line length: 125 characters (per parent project's .flake8)
- Use double quotes for strings
- Use triple double quotes for docstrings
- Binary operators at the beginning of continued lines
### Naming Conventions
- Classes: PascalCase (e.g., `AimeDataset`, `Grader`, `Processor`)
- Functions: snake_case (e.g., `normalize_number`, `get_prompt`)
- Variables: snake_case (e.g., `question_text`, `correct_count`)
- Constants: UPPER_SNAKE_CASE (e.g., `GRADER_PATTERNS`, `TEMPLATE_REGISTRY`)
- Private methods: prefix with underscore (e.g., `_load_dataset`, `_grade_regex`)
### Types
- Use type hints for all function signatures
- Import from `typing` module: `Dict`, `List`, `Optional`, `Any`, `Tuple`
- Use `@dataclass` for data structures
- Prefer `Optional[T]` over `Union[T, None]`
### Error Handling
- Use try/except for network requests and file operations
- Return `None` or `False` on errors when appropriate
- Use `ValueError` for invalid arguments
- Use `FileNotFoundError` for missing files
- CLI scripts should handle exceptions gracefully
### Dataclasses
- Use `@dataclass` for structured data
- Define fields with explicit types
- Use `Optional[T]` for nullable fields
- Provide default values where appropriate
### String Formatting
- Use f-strings for formatting (Python 3.6+)
- Use triple double quotes for multi-line strings
- Escape backslashes in regex patterns: `r'\\boxed{(\d+)}'`
### File Paths
- Use `pathlib.Path` instead of string paths
- Create directories with `mkdir(parents=True, exist_ok=True)`
- Use `Path.home()` for user home directory
### Logging
- Use `print()` for user-facing output
- Use `sys.stderr` for debug logging
- Simulator writes debug logs to `/tmp/simulator-debug.log`
### Testing
- Test script uses bash with `set -e` for strict error handling
- Simulator runs in background with PID tracking
- Tests verify correct answers, error cases, and edge cases
- Use `curl` for HTTP testing in shell scripts
### Whitespace Cleanup
- Remove trailing whitespace from all lines
- When making edits, do not leave trailing whitespace
## Dataset Support
### AIME Dataset
- 90 questions from 2025 AIME competition
- Answers in `\boxed{answer}` format
- Supports regex, CLI, and LLM grading
### AIME2025 Dataset
- 30 questions from 2025 AIME I & II
- Answers in `\boxed{answer}` format
- Requires loading two config parts
### GSM8K Dataset
- 7473 math word problems
- Answers numeric values with `####` separator
- Supports regex, CLI, and LLM grading
### GPQA Dataset
- 198 questions from GPQA Diamond
- Multiple choice with shuffled options (A, B, C, D)
- **Requires LLM grader** (returns letter A/B/C/D)
## Grading Types
### Regex Grader
- Built-in patterns per dataset
- Prioritizes `\boxed{}` for AIME datasets
- Extracts last number for GSM8K
### CLI Grader
- External script interface
- Call: `grader.sh --answer <pred> --expected <gold>`
- Exit code 0 = correct, non-zero = incorrect
### LLM Grader
- Uses judge model for answer extraction
- Includes few-shot examples
- Case-insensitive comparison
- Required for GPQA
## Configuration
### Sampling Parameters (Optional)
- `--temperature`: Sampling temperature
- `--top-k`: Top K sampling
- `--top-p`: Top P sampling
- `--min-p`: Min P sampling
- Only passed to API if explicitly specified
### Default Values
- `--n_predict`: -1 (infinite)
- `--grader-type`: llm
- `--seed`: 1234
- `--threads`: 32
- `--output`: llama-eval-state.json
## Output Format
### Progress Table
- Shows task ID, dataset, prompt (truncated to 43 chars), expected answer, status
- Uses `tqdm` for progress bars
### Results Summary
- Format: `Results: X/Y correct (Z%)`
- Displayed after all tasks complete
### JSON Output
- Complete eval state saved to output file
- Contains: task IDs, correctness, prompts, extracted answers, sampling config
- Uses `dataclasses.asdict()` for serialization
## HuggingFace Datasets
- Cache directory: `~/.cache/huggingface/datasets`
- Set via `HF_DATASETS_CACHE` environment variable
- Telemetry disabled via `HF_HUB_DISABLE_TELEMETRY=1`
- Datasets loaded with `datasets.load_dataset()`
## Flask Simulator
- Runs on configurable port (default: 5000)
- Endpoint: `/v1/chat/completions` (OpenAI-compatible)
- Uses Dice coefficient for question matching
- Configurable success rate for testing
- Debug logs to `/tmp/simulator-debug.log`

View File

@@ -1,94 +0,0 @@
# llama-eval Implementation Summary
## Overview
Simple evaluation tool for llama.cpp with support for multiple datasets (AIME, GSM8K, GPQA) and flexible grading (regex, CLI, LLM).
## Key Features
- **Multiple Datasets**: AIME, GSM8K, GPQA with proper answer extraction
- **Flexible Grading**: Regex, CLI, or LLM-based grading
- **Parallel Processing**: Configurable thread count for concurrent requests
- **Sampling Parameters**: Temperature, Top K, Top P, Min P (optional)
- **Real-time Feedback**: Progress tracking with detailed output
- **JSON Output**: Complete eval state saved for debugging
- **GPQA Support**: Answer shuffling with reproducible results
## Architecture
### Eval State
```python
@dataclass
class EvalState:
id: str
tasks: List[str]
task_states: Dict[str, Dict[str, Any]]
sampling_config: Dict[str, Any]
```
### Processor
- Handles processing, grading, and state management
- Thread-safe concurrent execution
- Configurable sampling parameters
### Grader
- Abstract grading interface supporting multiple types
- Regex grader with dataset-specific patterns
- CLI grader with external script interface
- LLM grader with configurable server and model
### Datasets
- `AimeDataset`: 90 AIME 2025 questions
- `Aime2025Dataset`: 30 AIME 2025 I & II questions
- `Gsm8kDataset`: 7473 math word problems
- `GpqaDataset`: 198 GPQA Diamond questions with shuffling
## Configuration
### Sampling Parameters (Optional)
- `--temperature`: Sampling temperature
- `--top-k`: Top K sampling
- `--top-p`: Top P sampling
- `--min-p`: Min P sampling
- Only passed if explicitly specified
### Grading Types
- **regex**: Built-in patterns for each dataset
- **cli**: External script with `--answer` and `--expected` args
- **llm**: LLM-based extraction with few-shot examples and configurable server/model
### Dataset Requirements
- **AIME**: Supports regex, CLI, or LLM grader
- **AIME2025**: Supports regex, CLI, or LLM grader
- **GSM8K**: Supports regex, CLI, or LLM grader
- **GPQA**: Requires LLM grader
## Output Format
### Progress Table
```
Task ID Dataset Prompt (first 43 chars) Expected Status
aime_000_001 AIME Complete the following reactions and sel... A pending
```
### Results Summary
```
============================================================
Results: 8/10 correct (80.0%)
============================================================
```
### JSON Output
Complete eval state with task IDs, correctness, prompts, extracted answers, and sampling configuration.
## Technical Details
- Default max tokens: -1 (infinite)
- Default grader type: llm
- Default seed: 1234
- Default threads: 32
- Prompt truncation: First 43 chars + padding + "..."
- Response truncation: Last 10 lines for grading
- GPQA requires LLM grader (returns letter A/B/C/D)
- Judge model defaults to evaluated model if not specified
- Sample answers defined in SAMPLE_ANSWERS dict for few-shot learning

View File

@@ -1,112 +0,0 @@
# llama-eval Evaluation Tool
Simple evaluation tool for llama.cpp with support for multiple datasets.
## Features
- **Multiple Datasets**: AIME, GSM8K, GPQA
- **Flexible Grading**: Regex, CLI, or LLM-based grading
- **Parallel Processing**: Configurable thread count
- **Real-time Feedback**: Progress tracking with detailed output
- **Sampling Parameters**: Temperature, Top K, Top P, Min P
- **JSON Output**: Complete eval state saved for debugging
## Usage
```bash
python llama-eval.py \
--server http://127.0.0.1:8013 \
--model gpt-oss-20b-hf-low \
--judge-model gpt-oss-20b-hf-medium \
--dataset aime \
--n_cases 10 \
--grader-type llm \
--seed 42
```
## CLI Arguments
- `--server`: llama-server URL (default: http://127.0.0.1:8013)
- `--model`: Model name for evaluation (default: llama)
- `--judge-model`: Model name for LLM judge (default: same as main model)
- `--judge-server`: Server URL for LLM judge (default: same as main server)
- `--dataset`: Dataset type (aime, aime2025, gsm8k, gpqa)
- `--n_cases`: Number of cases to evaluate (default: all)
- `--n_predict`: Max tokens to predict per prompt (default: -1, infinite)
- `--temperature`: Sampling temperature (default: not passed)
- `--top-k`: Top K sampling (default: not passed)
- `--top-p`: Top P sampling (default: not passed)
- `--min-p`: Min P sampling (default: not passed)
- `--threads`: Number of threads for parallel requests (default: 32)
- `--verbose`: Show detailed output for each case
- `--output`: Output file for eval state (default: llama-eval-state.json)
- `--grader-type`: Grader type (regex, cli, llm, default: llm)
- `--grader-script`: Path to CLI grader script (required for --grader-type cli)
- `--seed`: Random seed for shuffling (default: 1234)
## Datasets
### AIME
- 90 questions from 2025 AIME competition
- Answers in boxed format: `\boxed{answer}`
- Requires regex grader or LLM grader
### AIME2025
- 30 questions from 2025 AIME I & II competitions
- Answers in boxed format: `\boxed{answer}`
- Supports regex, CLI, or LLM grader
### GSM8K
- 7473 math word problems
- Answers are numeric values
- Requires regex grader or LLM grader
### GPQA
- 198 questions from GPQA Diamond dataset
- Multiple choice with shuffled options
- Requires LLM grader (returns letter A, B, C, or D)
## Grading Types
### Regex Grader
Built-in patterns for different datasets:
- AIME: `\boxed{(\d+)}|\b(\d+)\b`
- AIME2025: `\boxed{(\d+)}|\b(\d+)\b`
- GSM8K: `\b(\d+)\b`
- GPQA: Letter extraction (A, B, C, D)
### CLI Grader
External script interface:
```bash
./grader.sh --answer <pred> --expected <gold>
```
Returns exit code 0 if correct, non-zero if incorrect.
### LLM Grader
Uses LLM to extract and compare answers:
- Configurable server and model
- Includes few-shot examples from sample answers
- Case-insensitive comparison
- Required for GPQA dataset
## Output
### Progress Table
```
Task ID Dataset Prompt (first 43 chars) Expected Status
aime_000_001 AIME Complete the following reactions and sel... A pending
```
### Results
```
============================================================
Results: 8/10 correct (80.0%)
============================================================
```
### JSON Output
Complete eval state saved to output file with:
- Task IDs and correctness status
- Prompts and extracted answers
- Sampling configuration
- Processing metadata

File diff suppressed because it is too large Load Diff

View File

@@ -1,36 +0,0 @@
# llama-server-simulator
Standalone Python script simulating llama-server HTTP endpoint for testing.
## Features
- HTTP Server with OpenAI-compatible `/v1/chat/completions` endpoint
- AIME Dataset Integration - Loads 90 questions from HuggingFace
- Intelligent Question Matching - Uses exact matching, LaTeX removal, and Levenshtein distance
- Configurable Success Rate - Control correct/wrong answer generation (0-1)
- Debug Logging - Troubleshoot matching issues
## Usage
```bash
python llama-server-simulator.py --success-rate 0.8
```
## Arguments
- `--success-rate`: Probability of returning correct answer (0.0-1.0, default: 0.8)
- `--port`: Server port (default: 8033)
- `--debug`: Enable debug logging (default: False)
## Testing
```bash
./test-simulator.sh
```
## Implementation Details
- Uses Levenshtein distance for partial matching (threshold: 0.3)
- Automatic caching via HuggingFace datasets library
- Wrong answers generated by incrementing expected answer
- Debug output written to stderr

View File

@@ -1,283 +0,0 @@
#!/usr/bin/env python3
import argparse
import json
import random
import re
import time
import sys
import os
from typing import Dict, List, Optional
from dataclasses import dataclass, asdict
from pathlib import Path
import datasets
from flask import Flask, request, jsonify
# Set cache directory for HuggingFace datasets
cache_dir = Path.home() / ".cache" / "huggingface" / "datasets"
cache_dir.mkdir(parents=True, exist_ok=True)
os.environ["HF_DATASETS_CACHE"] = str(cache_dir)
def dice(s1: str, s2: str) -> float:
"""Calculate Dice coefficient between two strings based on bigram overlap."""
if not s1 and not s2:
return 1.0
def _bigrams(s: str):
return [s[i : i + 2] for i in range(len(s) - 1)]
bigrams1 = _bigrams(s1)
bigrams2 = _bigrams(s2)
if not bigrams1 and not bigrams2:
return 1.0
from collections import Counter
freq1 = Counter(bigrams1)
freq2 = Counter(bigrams2)
intersection = sum(min(freq1[bg], freq2[bg]) for bg in freq1)
dice_coeff = 2 * intersection / (len(bigrams1) + len(bigrams2))
return dice_coeff
def debug_log(message: str):
"""Log debug messages to both stdout and a file"""
print(message, file=sys.stderr)
with open("/tmp/simulator-debug.log", "a") as f:
f.write(message + "\n")
app = Flask(__name__)
@dataclass
class EvalState:
id: str
tasks: List[str]
task_states: Dict[str, Dict]
sampling_config: Dict
def normalize_number(s: str) -> Optional[int]:
match = re.match(r"\d+", s) # match digits from the start
if not match:
return None
return int(match.group(0))
class AimeDataset:
def __init__(self, split: str = "train"):
self.split = split
self.questions: List[Dict] = []
self._load_dataset()
def _load_dataset(self):
print(f"Loading AIME dataset (split: {self.split})...")
cache_path = Path.home() / ".cache" / "huggingface" / "datasets" / "AI-MO___aimo-validation-aime" / "default" / "0.0.0"
if cache_path.exists():
print(f"Using cached dataset from {cache_path}")
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split, cache_dir=str(cache_path))
else:
ds = datasets.load_dataset("AI-MO/aimo-validation-aime", split=self.split)
self.questions = list(ds)
print(f"AIME dataset loaded: {len(self.questions)} questions")
def find_question(self, request_text: str) -> Optional[Dict]:
best_match = None
best_distance = -1
best_index = -1
for i, question in enumerate(self.questions):
question_text = question["problem"]
request_lower = request_text.lower()
question_lower = question_text.lower()
# Exact match
if question_lower == request_lower:
debug_log(f"DEBUG: Found exact match at index {i}")
return question
# Remove LaTeX formatting for more flexible matching
question_no_latex = re.sub(r'\$[^$]+\$', '', question_text)
if question_no_latex.lower() == request_lower:
debug_log(f"DEBUG: Found match (no LaTeX) at index {i}")
return question
# Calculate Levenshtein distance for partial matches
# Only consider if request is at least 50% of question length
if len(request_lower) >= len(question_lower) * 0.5:
distance = dice(question_lower, request_lower)
if distance > best_distance:
best_distance = distance
best_match = question
best_index = i
if best_match and best_distance > 0.3: # Threshold for partial match
debug_log(f"DEBUG: Found best partial match at index {best_index} with distance {best_distance:.3f}")
return best_match
debug_log(f"DEBUG: No matching question found for: {request_text[:100]}...")
return None
def get_answer(self, question: Dict) -> str:
answer = question["answer"]
if isinstance(answer, str):
normalized = normalize_number(answer)
return str(normalized) if normalized is not None else answer
return str(answer)
class Simulator:
def __init__(
self,
port: int = 8033,
host: str = "localhost",
success_rate: float = 0.8,
dataset_split: str = "train"
):
self.port = port
self.host = host
self.success_rate = success_rate
self.dataset = AimeDataset(dataset_split)
self.eval_state = EvalState(
id="aime-2025",
tasks=["aime"],
task_states={},
sampling_config={"temperature": 0, "max_tokens": 2048}
)
def _generate_response(
self,
question: Dict,
should_be_correct: bool
) -> Dict:
expected_answer = self.dataset.get_answer(question)
if should_be_correct:
response_text = expected_answer
else:
response_text = self._generate_wrong_answer(question)
return {
"id": f"chatcmpl-{int(time.time())}",
"object": "chat.completion",
"created": int(time.time()),
"model": "llama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": response_text
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 50,
"total_tokens": 150
}
}
def _generate_wrong_answer(self, question: Dict) -> str:
expected_answer = self.dataset.get_answer(question)
if expected_answer.isdigit():
wrong_answer = str(int(expected_answer) + 1)
else:
wrong_answer = expected_answer + " (wrong)"
return wrong_answer
def _process_request(self, request_data: Dict) -> Dict:
messages = request_data.get("messages", [])
if not messages:
return {"error": "No messages in request"}
request_text = messages[0].get("content", "")
debug_log(f"DEBUG: Received request with content: {request_text[:150]}...")
question = self.dataset.find_question(request_text)
if not question:
debug_log(f"DEBUG: find_question returned None")
return {"error": "No matching question found"}
should_be_correct = random.random() < self.success_rate
response = self._generate_response(question, should_be_correct)
task_id = "aime"
self.eval_state.task_states[task_id] = {
"correct": should_be_correct,
"expected": self.dataset.get_answer(question),
"predicted": response["choices"][0]["message"]["content"]
}
return response
@app.route('/v1/chat/completions', methods=['POST'])
def chat_completions():
try:
request_data = request.get_json()
if not request_data:
return jsonify({"error": "Invalid JSON"}), 400
response = simulator._process_request(request_data)
return jsonify(response)
except Exception as e:
print(f"Error processing request: {e}")
return jsonify({"error": str(e)}), 500
def main():
parser = argparse.ArgumentParser(
description="llama-server simulator for testing eval scripts"
)
parser.add_argument(
"--port",
type=int,
default=8033,
help="Server port (default: 8033)"
)
parser.add_argument(
"--host",
type=str,
default="localhost",
help="Server host (default: localhost)"
)
parser.add_argument(
"--success-rate",
type=float,
default=0.8,
help="Success rate 0-1 (default: 0.8)"
)
parser.add_argument(
"--dataset-split",
type=str,
default="train",
help="AIME dataset split to use (default: train)"
)
args = parser.parse_args()
global simulator
simulator = Simulator(
port=args.port,
host=args.host,
success_rate=args.success_rate,
dataset_split=args.dataset_split
)
print("\n=== llama-server-simulator ===")
print(f"Server running on http://{args.host}:{args.port}")
print(f"Success rate: {args.success_rate}")
print(f"AIME dataset loaded: {len(simulator.dataset.questions)} questions")
print("\nPress Ctrl+C to stop\n")
app.run(host=args.host, port=args.port, debug=False)
if __name__ == "__main__":
main()

View File

@@ -1,86 +0,0 @@
#!/bin/bash
set -e
# Get the directory where this script is located
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
echo "=== llama-server-simulator Test Script ==="
echo ""
PORT=8033
SUCCESS_RATE=0.8
TEST_PORT=8034
echo "Starting simulator on port $PORT with success rate $SUCCESS_RATE..."
source "$SCRIPT_DIR/venv/bin/activate"
python3 "$SCRIPT_DIR/llama-server-simulator.py" --port $PORT --success-rate $SUCCESS_RATE > /tmp/simulator-test.log 2>&1 &
SIMULATOR_PID=$!
echo "Waiting for simulator to start..."
sleep 5
# Helper function to make a request and extract the answer
make_request() {
local question="$1"
curl -s -X POST http://localhost:$PORT/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"llama\",
\"messages\": [
{\"role\": \"user\", \"content\": \"$question\"}
],
\"temperature\": 0,
\"max_tokens\": 2048
}" | python3 -c "import sys, json; data = json.load(sys.stdin); print(data.get('choices', [{}])[0].get('message', {}).get('content', data.get('error', 'No response')))"
}
# Test question (repeated in multiple tests)
TEST_QUESTION="Quadratic polynomials P(x) and Q(x) have leading coefficients 2 and -2, respectively. The graphs of both polynomials pass through the two points (16,54) and (20,53). Find P(0) + Q(0)."
echo ""
echo "=== Test 1: Correct Answer ==="
echo "Sending request with known question..."
answer=$(make_request "$TEST_QUESTION")
echo "Answer: $answer"
echo "Expected: 116"
echo "Correct: $([ "$answer" == "116" ] && echo "Yes" || echo "No")"
echo ""
echo "=== Test 2: Wrong Answer ==="
echo "Sending request with known question (success rate 0.0)..."
answer=$(make_request "$TEST_QUESTION")
echo "Answer: $answer"
echo "Expected: 116"
echo "Correct: $([ "$answer" == "116" ] && echo "Yes" || echo "No")"
echo ""
echo "=== Test 3: No Matching Question ==="
echo "Sending request with non-matching text..."
response=$(make_request "What is the capital of France?")
echo "Response: $response"
echo "Expected: No matching question found"
echo "Correct: $([ "$response" == "No matching question found" ] && echo "Yes" || echo "No")"
echo ""
echo "=== Test 4: Success Rate Verification ==="
echo "Sending 10 requests to test success rate..."
correct_count=0
for i in {1..10}; do
answer=$(make_request "$TEST_QUESTION")
if [ "$answer" == "116" ]; then
correct_count=$((correct_count + 1))
fi
echo " Request $i: Answer = $answer"
done
echo "Correct answers: $correct_count/10"
echo "Expected: ~8/10 (80% success rate)"
echo "Success rate: $(echo "scale=1; $correct_count * 10" | bc)%"
echo ""
echo "=== Test Complete ==="
echo "Stopping simulator..."
kill $SIMULATOR_PID 2>/dev/null
wait $SIMULATOR_PID 2>/dev/null || true
echo "Simulator stopped."

View File

@@ -4,7 +4,7 @@ project("ggml" C CXX ASM)
### GGML Version
set(GGML_VERSION_MAJOR 0)
set(GGML_VERSION_MINOR 9)
set(GGML_VERSION_PATCH 5)
set(GGML_VERSION_PATCH 7)
set(GGML_VERSION_BASE "${GGML_VERSION_MAJOR}.${GGML_VERSION_MINOR}.${GGML_VERSION_PATCH}")
find_program(GIT_EXE NAMES git git.exe NO_CMAKE_FIND_ROOT_PATH)

View File

@@ -1 +1 @@
a8db410a252c8c8f2d120c6f2e7133ebe032f35d
d6754f3d0e6d0acd21c12442353c9fd2f94188e7