arxiv:2512.17385

UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

Published on Dec 19

· Submitted by

Authors:

Abstract

IPC is an unsupervised framework that uses internal probing of large language models to generate code without labeled datasets, achieving competitive performance with reduced resource dependency.

AI-generated summary

Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.

View arXiv page View PDF Add to collection

Community

CSJianYang

Paper submitter 1 day ago

This paper introduces UCoder, an unsupervised framework for training code-generating large language models without requiring any external datasets, including unlabeled code snippets. The approach, called IPC (Internal Probing of LLMs for Code generation), leverages latent programming knowledge already present in pre-trained models through a six-stage self-bootstrapping process: (1-3) problem space probing that generates diverse algorithmic problems with specifications, (4) test understanding probing to create comprehensive test suites, (5) solution space probing using dense sampling (128 candidates per problem), and (6) knowledge consolidation through supervised fine-tuning on high-quality solutions. The key innovation is execution-driven consensus clustering, which identifies correct implementations by finding clusters of behaviorally identical solutions—correct code naturally clusters together while incorrect solutions fail heterogeneously. Experiments on UCoder models (7B, 14B, 32B parameters) demonstrate competitive performance with supervised baselines across multiple benchmarks (HumanEval, MBPP, BigCodeBench, LiveCodeBench, FullStackBench), with smaller models showing greater improvement gains (inverse scaling). The work proves that self-generated data maintains lexical, semantic, and structural diversity sufficient for effective learning, opening possibilities for resource-efficient LLM training without human annotation.