Track 1 Scoring Rules

How submissions are scored and how the final ranking is determined for the Semantic Alignment Evaluation for Autoformalization (FormalRx) track.

Overview

Each test sample gives an informal mathematical statement and a candidate Lean 4 formalization. For every sample the submission produces a four-part diagnostic output: a binary alignment verdict, an error category from the SCI taxonomy, the erroneous code segment (localization), and a corrected formal statement.

The test set 🤗 FormalRx-Test, 7,030 informal and Lean 4 statement pairs annotated under the SCI Error Taxonomy, a hierarchical set of 28 categories covering Semantic, Constraint, and Implementation errors. Of the 7,030 samples, 1,791 (25.5%) are aligned and 5,239 (74.5%) are misaligned across all 28 categories.

The four sub-tasks

Each sub-task is scored on the samples where it applies. For an aligned formalization the three diagnosis fields must be left as N/A, and any filled value is penalized.

Sub-task	Scored on	Metric
Verdict	All samples	Macro F1
Categorization	All (N/A on aligned)	Macro F1 (28 classes)
Localization	All (N/A on aligned)	Accuracy
Correction	All (N/A on aligned)	Accuracy

Overall score: per-sample joint accuracy

The headline metric that determines the ranking is the per-sample joint accuracy. A sample scores 1 only when all of its applicable sub-tasks are answered correctly, and 0 otherwise. The overall score is the fraction of samples that score 1.

For an aligned sample, the verdict must be aligned and the categorization, localization, and correction fields must all be null. Filling any of them is wrong.
For a misaligned sample, the verdict must be misaligned and the category, localization, and correction must all be correct.

The four per-sub-task scores in the table above are reported alongside the overall score as diagnostics, but only the overall joint accuracy is used for ranking.

How each sub-task is judged

Verdict. Binary classification with misaligned as the positive class. Macro F1 is computed over both classes.

Categorization. A 28-class classification over the SCI taxonomy. A misaligned sample must be assigned the exact ground-truth error type. An aligned sample must be predicted as N/A, and assigning any category to it counts as a false positive. The counts feeding the macro F1 are assigned as follows.

Prediction vs ground truth	GT = Misaligned	GT = Aligned
Pred = Misaligned, correct type	TP	FP
Pred = Misaligned, wrong type	FN	FP
Pred = Aligned	FN	TN

Localization and Correction are each scored by a two-stage protocol.

Step 1 · Exact match.The prediction is compared to the reference by exact string match. If they match, the sample is counted correct and stops here.
Step 2 · LLM judge.Non-matching pairs go to a DeepSeek LLM judge (deepseek-v4-flash) that decides semantic equivalence. For localization, it checks whether the predicted and reference segments point to the same error location, allowing different wording or granularity. For correction, it checks whether the two Lean 4 statements express the same mathematical meaning, allowing syntactic variation and logically equivalent reformulations.

"N/A" and JSON null are treated as equivalent in all comparisons.

Final ranking

Overall joint accuracy (descending).The submission that gets all four sub-tasks right on the largest fraction of samples wins.
Submission time (ascending).On the rare exact tie, the earlier submission ranks first.

Why joint accuracy

The point of FormalRx is actionable diagnosis, not a single yes or no. A system that flags a formalization as wrong but mislabels the error, points at the wrong line, or proposes a broken fix has not actually helped. Joint accuracy rewards only the cases where the whole diagnosis holds together, which is the behavior the track is trying to measure. The per-sub-task F1 and accuracy numbers stay on the leaderboard so teams can see where they lose points.

Submission format and limits

Predictions are submitted as a .zip containing a single .jsonl file, one JSON object per test sample, with these fields.

Field	Type	Description
`idx`	string	Sample ID from the test set, e.g. `formalrx_0001`
`verdict`	string	`"aligned"` or `"misaligned"`
`error_category`	string or null	One of the 28 SCI categories, or `null` if aligned
`error_segment`	string or null	The erroneous code fragment, or `null` if aligned
`corrected_statement`	string or null	The corrected Lean 4 statement, or `null` if aligned

When the verdict is "aligned", the other three fields must be JSON null (the string "N/A" is treated as equivalent). All 7,030 test sample IDs must appear in every submission. The limit is 5 submissions per day and 100 in total.

Eligibility and integrity rules

These rules are checked when we review source code and technical reports. A violation leads to disqualification regardless of leaderboard position.

Reproducibility is required. Every submission must come with code or a clear description of the method. For the top 3 finalists, reproducibility is a condition for receiving the ranking and the prize. After the leaderboard closes we will re-run shortlisted submissions on a separate validation set to confirm that your reported results reproduce.
No manual annotation of test samples. Hand labeling test rows is not a valid approach and leads to disqualification.
No test input lookup at inference. Your method must not search for or retrieve the informal or formal statement of any test row, or pull from any source containing a copy of FormalRx or its derivatives. General lookups of Mathlib docs, Lean syntax, theorem references, and general mathematical knowledge are allowed.
Each test row is predicted independently. A row's prediction must be computable from that row's input alone. No placing multiple test rows in one context, using one row to disambiguate another, or running cross row similarity or deduplication on the test set. Standard batched inference with an independent forward pass per row is fine.
No leaderboard overfitting. The per sample results file shows which rows you got right or wrong. Iterating on it is allowed, but technical reports must be transparent about it, and gains explained only by repeated per sample probing rather than genuine method improvement lead to disqualification.

If a rule is unclear for your setup, ask on the forum before the deadline rather than after.

Questions

Post on the Codabench forum for the track, or email ai4mathicml@gmail.com.

Open Track 1 challenge page 🤗 FormalRx-Test dataset