ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

§03

Synthesis

## The Core Finding

Large language models claim to reason logically, but this ability crumbles when the same logical problem is rewritten in Chinese. The authors built ChLogic—a benchmark pairing English and Chinese versions of identical logical structures—and found that leading models like Qwen3, Ministral, and GLM perform measurably worse on Chinese than English, even though the underlying logic is identical. Surprisingly, translating Chinese back to English sometimes helps, sometimes hurts, suggesting the problem isn't just language but how models handle translation artifacts and linguistic variation.

## What ChLogic Tests

The benchmark contains three datasets totaling roughly 500 test items. The **General aligned set** takes 60 logical propositions across nine template families and expresses each in English plus five different Chinese realizations—capturing how the same logical idea can be phrased in varied ways. The **Difficult aligned set** does the same for 40 harder problems. A third **Chinese-only set** probes 15 language-specific phenomena that don't map neatly to English.

The key design choice: starting from formal logical templates ensures the latent logical structure is truly identical across languages. This rules out the excuse that poor Chinese performance is due to different problems; it isolates the effect of surface realization.

## Why This Matters

Logical reasoning is foundational—banks, law firms, and AI systems rely on it. If a model reasons correctly in English but fails on the same logical structure in Chinese, it's not truly understanding logic; it's pattern-matching to English-heavy training data. This finding exposes a blind spot in how we evaluate LLMs. Most benchmarks test a single language, so the gap goes unnoticed until a Chinese user hits the wall.

The back-translation results are especially telling. When Chinese text is machine-translated back to English before inference, performance on the General set often improves—showing translation to English helps. But on the Difficult set, this same trick sometimes makes models perform *worse*, including regressions for Qwen3-32B and GLM-5.1. This suggests the relationship between language choice, problem difficulty, and model behavior is tangled: back-translation may introduce artifacts that confuse models on harder reasoning tasks, or the original Chinese phrasing may carry subtle cues that aid reasoning in ways machine translation destroys.

## The Stress Test

ChLogic is deliberately challenging. It's built on formal logic, not natural examples, so models can't rely on common-sense shortcuts. By forcing models to prove their logical reasoning works across languages, it reveals where multilingual robustness actually breaks down.

The takeaway: state-of-the-art performance on English benchmarks does not guarantee robust logical reasoning. Models need testing across languages and linguistic variations to claim genuine reasoning capability. For practitioners deploying LLMs in multilingual settings—especially in logic-heavy domains—this benchmark provides a practical way to measure and hopefully improve that robustness.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator