Open-World Self-Evolution

OpenSkill: Open-World Self-Evolution for LLM Agents

An agent that builds both its skills and its own verification signals from scratch — using only a task prompt and open-world resources, with no target-task supervision.

1Lehigh University  ·  2University of Illinois Chicago  ·  3University of British Columbia  ·  4Vector Institute  ·  5Salesforce AI Research  ·  6Massachusetts General Hospital & Harvard Medical School
* Equal contribution    Corresponding author
Links are placeholders — swap in your arXiv / GitHub / demo URLs.
Lehigh University University of Illinois Chicago University of British Columbia Vector Institute Salesforce AI Research Massachusetts General Hospital Harvard Medical School
Abstract

Can an LLM agent self-evolve in the open world?

Self-evolving agents require adaptation after deployment, but existing approaches assume a usable learning loop — curated skills, successful trajectories, or verifier signals. Real open-world deployments may provide none of these, offering only a task prompt. We study open-world self-evolution, where an agent must build both its skills and its own verification signals from scratch, using open-world resources but no target-task supervision. We propose OpenSkill, a framework that bootstraps this loop: it acquires grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizes them into transferable skills, and refines those skills against self-built virtual tasks grounded in the anchors rather than in target answers. Across three benchmarks and two target agents, OpenSkill attains the best automated pass rate while satisfying the no-supervision constraint. Its skills transfer across models without model-specific adaptation, and its self-built verifier aligns with ground-truth outcomes despite never accessing them.

Scalable

Skills are sourced from the open world, not bounded by a human's or model's prior knowledge.

Grounded

Knowledge and verification anchors come from real documentation, repositories, and the web.

Supervision-free

No gold answers, rewards, or verifier outputs during learning — a leakage barrier keeps them out.

The Idea

A new paradigm for self-evolving skills

Unlike human-curated, LLM-generated, or supervised self-evolution, OpenSkill acquires skills from the open world and verifies them with self-built virtual tasks — making it simultaneously scalable, grounded, and supervision-free.

Four paradigms for self-evolving agent skills: Human-Curated, LLM-Generated, Supervised Self-Evolution, and Ours: Open-World.
Paradigms for self-evolving agent skills. Prior paradigms each miss at least one property; OpenSkill (right) is the only one that is scalable, grounded, and supervision-free at once.
Method

How OpenSkill works

Given only a task prompt, a base model, tool access, and open-world resources, OpenSkill bootstraps a learning loop from scratch in three stages.

STAGE 01

Open-world knowledge acquisition

Retrieves task-relevant knowledge and independent verification anchors from docs, repos, papers, and the web — then drafts a structured skill plan.

STAGE 02

Leakage-free skill evolution

Drafts skills and refines them in a sandbox against self-built virtual tests grounded in the anchors, fixing bugs and knowledge gaps over up to three rounds.

STAGE 03

Zero-shot target evaluation

Deploys the frozen skill to the target agent. Ground-truth tests are unlocked only here, at final evaluation — never during construction.

OpenSkill framework overview diagram showing task inputs, open-world knowledge acquisition, leakage-free evolution loop with a virtual-task verifier and diagnostic retriever, and final evaluation.
Overview of the OpenSkill framework. A base agent acquires open-world knowledge to build a skill plan, then iteratively generates, executes, and refines the skill in a sandbox using a virtual-task verifier and diagnostic retriever. A leakage barrier keeps target supervision out of skill construction, unlocking it only for final evaluation.
Results

Best automated pass rate on every setting

On SkillsBench (11 domains) OpenSkill beats the strongest closed-world baseline by +8.9 / +8.8 points and lands within 1–3 points of the human upper bound — while honoring the no-supervision constraint.

43.6%
Overall pass rate on Opus 4.6
(+8.9 over best baseline)
42.1%
Overall pass rate on GPT 5.2
(+8.8 over best baseline)
88.9%
of ground-truth test intents covered by the self-built verifier
8 / 11
domains best or tied-best on Opus 4.6

SkillsBench — average pass rate (%) by domain

Best automated method per row in bold, second best underlined. The OpenSkill column is shaded; Human is a reference upper bound, excluded from ranking.

Domain No Skill Self-Gen CoT Skill-Creator AutoSkill Memento OpenSkill Human
Opus 4.6  (Claude Code)
Software32.637.934.951.336.034.459.938.8
Office17.016.717.121.425.731.450.050.0
Science25.631.330.036.233.335.035.046.7
Media36.127.920.438.523.621.839.636.4
Cybersecurity17.818.820.424.616.628.844.155.0
Finance17.516.720.027.525.025.025.030.0
Robotics27.613.316.036.04.032.036.036.0
Energy41.211.140.060.033.360.060.066.7
Manufacturing0.00.00.00.00.00.00.046.7
Health24.819.819.231.214.525.069.680.0
Math43.230.030.050.00.030.050.050.0
Overall25.523.923.934.724.730.143.644.5
Δ vs. No Skill−1.6−1.6+9.2−0.8+4.6+18.1+19.0
GPT 5.2  (Codex)
Software33.248.447.244.416.719.549.142.5
Office32.931.026.226.29.414.344.348.6
Science30.430.329.821.95.513.848.648.3
Media31.331.031.830.915.218.230.458.2
Cybersecurity25.020.834.736.84.112.552.542.5
Finance0.029.225.020.88.412.525.027.5
Robotics16.026.740.020.013.426.640.040.0
Energy0.033.355.622.211.022.380.053.3
Manufacturing0.011.10.00.00.00.00.00.0
Math30.033.333.350.033.50.050.040.0
Health29.230.230.224.320.016.527.990.0
Overall25.032.233.329.211.215.642.144.8
Δ vs. No Skill+7.2+8.3+4.2−13.8−9.4+17.1+19.8
OpenSkill (ours) Human reference upper bound bold = best automated   underline = second

Beyond SkillsBench, OpenSkill is also the best automated method on SocialMaze (82.7% / 70.7%) and ScienceWorld (90.0% / 85.3%) across both target agents.

Analysis

Skills transfer; the self-built verifier aligns; each component matters

The same skill files produced by Opus 4.6 transfer as-is to weaker models. The virtual verifier covers most hidden test intents without seeing ground-truth tests, and the refinement loop peaks at a few iterations.

Bar chart of average reward when transferring Opus 4.6-generated skills to four weaker models.

RQ1 — Transferability

OpenSkill-generated skills yield the highest reward across four weaker target models, improving by +5.5 to +14.8 points over no-skill with no model-specific adaptation.

Two ablation panels on SocialMaze: reward vs refinement iterations, and component contributions.

RQ3 — Component Contribution

On SocialMaze, reward peaks at three refinement rounds. Open-world query and the virtual verifier each improve over a parametric-only baseline and are largely complementary.

RQ2 — Virtual Verifier Quality

Without access to ground-truth tests, the virtual verifier still provides a meaningful proxy signal for skill refinement. It aligns with GT outcomes and covers most human-authored test intents.

The remaining gaps mainly come from benchmark-specific anti-cheat checks and deeper semantic-quality tests that require domain expertise beyond the task specification.

GT reward > 0GT reward = 0
Proxy pass39.29%29.76%
Proxy fail9.52%21.43%
80.5%
recall against GT-positive outcomes
60.7%
overall proxy / GT agreement
88.9%
GT test intents covered
3.4×
median test count vs. GT suite
BibTeX

Citation

@article{openskill2026,
  title = {OpenSkill: Open-World Self-Evolution for LLM Agents},
  author = {Yan, Zhiling and Song, Dingjie and Zhang, Hanrong and Liang, Wei and Zhang, Yuxuan and Dai, Yutong and He, Lifang and Yu, Philip S. and Xu, Ran and Li, Xiang and Sun, Lichao},
  journal = {arXiv preprint},
  year = {2026}
}