How submissions are scored and how the final ranking is determined for the Semantic Alignment Evaluation for Autoformalization (FormalRx) track.
Each test sample gives an informal mathematical statement and a candidate Lean 4 formalization. For every sample the submission produces a four-part diagnostic output: a binary alignment verdict, an error category from the SCI taxonomy, the erroneous code segment (localization), and a corrected formal statement.
The test set 🤗 FormalRx-Test, 7,030 informal and Lean 4 statement pairs annotated under the SCI Error Taxonomy, a hierarchical set of 28 categories covering Semantic, Constraint, and Implementation errors. Of the 7,030 samples, 1,791 (25.5%) are aligned and 5,239 (74.5%) are misaligned across all 28 categories.
Each sub-task is scored on the samples where it applies. For an aligned formalization the three diagnosis fields must be left as N/A, and any filled value is penalized.
| Sub-task | Scored on | Metric |
|---|---|---|
| Verdict | All samples | Macro F1 |
| Categorization | All (N/A on aligned) | Macro F1 (28 classes) |
| Localization | All (N/A on aligned) | Accuracy |
| Correction | All (N/A on aligned) | Accuracy |
The headline metric that determines the ranking is the per-sample joint accuracy. A sample scores 1 only when all of its applicable sub-tasks are answered correctly, and 0 otherwise. The overall score is the fraction of samples that score 1.
The four per-sub-task scores in the table above are reported alongside the overall score as diagnostics, but only the overall joint accuracy is used for ranking.
Verdict. Binary classification with misaligned as the positive class. Macro F1 is computed over both classes.
Categorization. A 28-class classification over the SCI taxonomy. A misaligned sample must be assigned the exact ground-truth error type. An aligned sample must be predicted as N/A, and assigning any category to it counts as a false positive. The counts feeding the macro F1 are assigned as follows.
| Prediction vs ground truth | GT = Misaligned | GT = Aligned |
|---|---|---|
| Pred = Misaligned, correct type | TP | FP |
| Pred = Misaligned, wrong type | FN | FP |
| Pred = Aligned | FN | TN |
Localization and Correction are each scored by a two-stage protocol.
deepseek-v4-flash) that decides semantic equivalence. For localization, it checks whether the predicted and reference segments point to the same error location, allowing different wording or granularity. For correction, it checks whether the two Lean 4 statements express the same mathematical meaning, allowing syntactic variation and logically equivalent reformulations."N/A" and JSON null are treated as equivalent in all comparisons.
The point of FormalRx is actionable diagnosis, not a single yes or no. A system that flags a formalization as wrong but mislabels the error, points at the wrong line, or proposes a broken fix has not actually helped. Joint accuracy rewards only the cases where the whole diagnosis holds together, which is the behavior the track is trying to measure. The per-sub-task F1 and accuracy numbers stay on the leaderboard so teams can see where they lose points.
Predictions are submitted as a .zip containing a single .jsonl file, one JSON object per test sample, with these fields.
| Field | Type | Description |
|---|---|---|
idx | string | Sample ID from the test set, e.g. formalrx_0001 |
verdict | string | "aligned" or "misaligned" |
error_category | string or null | One of the 28 SCI categories, or null if aligned |
error_segment | string or null | The erroneous code fragment, or null if aligned |
corrected_statement | string or null | The corrected Lean 4 statement, or null if aligned |
When the verdict is "aligned", the other three fields must be JSON null (the string "N/A" is treated as equivalent). All 7,030 test sample IDs must appear in every submission. The limit is 5 submissions per day and 100 in total.
These rules are checked when we review source code and technical reports. A violation leads to disqualification regardless of leaderboard position.
If a rule is unclear for your setup, ask on the forum before the deadline rather than after.
Post on the Codabench forum for the track, or email ai4mathicml@gmail.com.