No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages

Source · arxiv.org/watch?v=2606.16827 ↗

0:00 0:00

§02

Snippets

№01

LLMs struggle with code generation in no-resource languages that have virtually no representation in training data, a problem common in proprietary industry systems.

Most code generation research ignores domain-specific and proprietary languages, leaving companies without practical solutions for in-house code recommenders.
№02

Further pre-training on target languages yields the largest performance gains, but directly applying it to instruction-tuned models degrades instruction-following ability.

This reveals a critical trade-off: improving domain knowledge via pre-training conflicts with maintaining the reasoning and instruction-following skills of tuned models.
№03

Weight diff transfer from instruction models enables efficient instruction-following injection after target language pre-training, avoiding expensive instruction fine-tuning.

This technique lets companies deploy specialized models for proprietary languages at low cost without repeating full instruction-tuning pipelines.
№04

Three new code generation benchmarks are introduced for no-resource languages based on recently proposed programming languages with minimal existing training data.

These benchmarks enable future research on genuinely underexplored code generation scenarios, addressing a gap between academic focus and industrial needs.

§03

Synthesis

## The Problem: LLMs Can't Code in Languages They've Never Seen

Most code-generation research focuses on popular languages like Python and Java, which have mountains of training data. But what about proprietary or domain-specific languages that companies build in-house? LLMs have essentially zero exposure to these "no-resource languages," making them useless for code generation tasks in those contexts. This paper investigates whether and how to fix that gap—a practical problem for organizations that can't rely on tools like GitHub Copilot.

## The Approach: Three Benchmarks and a Recipe

The authors created three new benchmarks for code generation in no-resource languages, using two real programming languages with minimal training data available. They then tested several strategies to teach LLMs these languages:

- **Prompt-based techniques**: Clever input formatting to nudge the model toward correct behavior. - **Further pre-training**: Exposing the base model to whatever little data exists for the target language. - **Fine-tuning**: Standard supervised training on task examples.

The key finding: further pre-training works best for pure performance. But applying it directly to instruction-tuned models (models already trained to follow natural language instructions) backfires—they forget how to obey instructions after absorbing the new language.

## The Solution: A Three-Stage Pipeline

To sidestep this trade-off, the authors introduced a clever workaround:

1. Start with a base model (not instruction-tuned). 2. Further pre-train it on the no-resource language data. 3. Transfer instruction-following ability from an instruction-tuned model using **weight diff transfer**—a technique that extracts the "instruction-following" modifications from one model and applies them to another.

This approach preserves both capabilities: the model learns the target language *and* retains the ability to follow instructions, which is essential for practical code generation tasks. Critically, this avoids the computational cost of instruction fine-tuning from scratch, making it feasible for companies with limited resources.

## Why It Matters

Code generation is increasingly valuable for developer productivity, but its benefits have been locked behind languages with abundant training data. This work opens a path for organizations using proprietary or emerging languages to deploy their own capable code recommenders without either (a) building massive training datasets or (b) spending enormous compute budgets on full instruction fine-tuning.

The benchmarks themselves are also a contribution—standardized evaluation tools for no-resource languages that future work can build on. For practitioners, the weight diff transfer trick offers a practical shortcut: rather than retraining from scratch, they can mix and match capabilities from existing models.

The underlying insight is simple but powerful: you don't need to choose between learning a new language and following instructions. With the right architecture, you can have both cheaply.

Mine your own.

Lode is a workbench, not a feed. Paste a YouTube URL. The model proposes a transcript, a set of quote-grounded snippets, a synthesis essay, and the fan-out. You decide what stays.

Open the curator