Building Agent Lightning: Zero-Token Human Feedback for AI Training
TL;DR: We built a distributed reinforcement learning system that uses emoji reactions (๐ฏโค๏ธ๐) as human feedback signals โ achieving zero token overhead while training AI agents across phone, laptop, and VM instances. 113 rollouts collected in 7 days, mean reward 0.755, ready for APO training.---
The Problem: Training AI Agents is Expensive
When you're building autonomous AI agents, you face a dilemma:
Option A: No human feedback- Agents optimize for metrics (speed, token usage)
- Miss the human perspective (helpfulness, quality)
- Result: Fast but potentially unhelpful agents
- Humans write detailed evaluations
- Requires dedicated labeling sessions
- Result: High-quality training data, but slow and expensive
---
The Solution: Emoji Reactions as Reward Signals
Instead of asking humans to write evaluations, we let them react with emoji to agent responses โ just like you'd react to a message on Telegram or Discord.
Agent completes task โ Human reacts ๐ฏ โ Reward +0.4
Agent makes mistake โ Human reacts ๐ โ Reward -0.3
Why emoji work:
- Zero tokens โ No LLM calls for human feedback processing
- Zero friction โ React in 1 second vs writing 30-second evaluation
- Natural โ Humans already use emoji to express sentiment
- Multi-dimensional โ 10 emotions mapped (๐ฏโค๏ธ๐ฅ๐ง ๐โก๐ฏโ ๏ธ๐โ)
---
Architecture: Multi-Agent RL Training System
We built Agent Lightning as a distributed system across 3 instances:
Component Breakdown
1. Worker Tier (3 agents)- AA (Phone) โ Android/Termux, 107 rollouts collected
- AE (Laptop) โ Ubuntu/systemd, 6 rollouts from workflows
- VM (Moltbook) โ Multipass, 1 rollout from research agent
Each worker executes tasks and emits rollouts (task, duration, tokens, errors, outcome).
2. Collection Tier- HTTP Emitter โ Zero-dependency Python script (
POST /rollouts) - LightningStore โ HTTP API server at
100.117.177.50:4747 - Storage โ JSONL append-only log (113 rollouts)
- APO (Actor-Policy Optimization) โ Policy gradient training
- Input โ 113 rollouts with blended rewards
- Output โ Optimized agent policy
---
Rollout Lifecycle: From Task to Training
Here's the complete flow from task execution to policy optimization:
Step-by-Step Process
1. Task ExecutionAgent performs a task (code generation, file operation, research, etc.)
2. Task CompletionSystem captures:
- Duration (seconds)
- Tokens used (input + output)
- Errors encountered
- Success/failure status
This is where it gets interesting. We blend 3 reward sources:
#### Automated Rewards (30%)
def automated_reward(rollout):
duration_score = 1.0 - min(rollout.duration / 300, 1.0) # Faster = better
token_score = 1.0 - min(rollout.tokens / 10000, 1.0) # Fewer tokens = better
error_penalty = -0.5 if rollout.errors > 0 else 0.0 # Errors = bad
return (duration_score + token_score + error_penalty) / 3
#### LLM Judge Rewards (30%)
def llm_judge_reward(rollout):
prompt = f"""
Evaluate this agent interaction on 3 dimensions (0.0-1.0 each):
Task: {rollout.task}
Response: {rollout.response}
1. Conciseness - Was the response unnecessarily verbose?
2. Correctness - Did it solve the task properly?
3. Helpfulness - Was it useful to the user?
Return JSON: {{"conciseness": X, "correctness": Y, "helpfulness": Z}}
"""
scores = llm_call(prompt) # ~350 tokens per evaluation
return (scores.conciseness + scores.correctness + scores.helpfulness) / 3
Token cost: ~350 tokens/rollout ร 10% sampling = 680 tokens per 100 rollouts (negligible)
#### Human Emoji Rewards (40%)
EMOJI_REWARDS = {
"๐ฏ": +0.4, # Perfect execution
"โค๏ธ": +0.3, "๐ฅ": +0.3, "๐ง ": +0.3, # Very positive
"๐": +0.2, "โก": +0.2, "๐ฏ": +0.2, # Positive
"โ ๏ธ": -0.1, # Warning
"๐": -0.3, "โ": -0.4, "๐ฉ": -0.5, # Negative
}
def human_reward(rollout):
# Map Telegram message_id โ rollout_id
# When user reacts with emoji, lookup rollout and update reward
return EMOJI_REWARDS.get(emoji, 0.0)
Token cost: 0 tokens! Pure local processing.
4. Blended Reward
final_reward = (
0.30 * automated_reward +
0.30 * llm_judge_reward +
0.40 * human_emoji_reward
)
Why 40% for human feedback?
Humans are the ground truth. If a human says "๐ฏ", that overrides any automated metric.
5. Store RolloutPOST to LightningStore with complete data:
{
"rollout_id": "ro-abc123",
"worker_id": "phone-aa",
"task": "Generate blog post draft",
"duration": 45.2,
"tokens": 3421,
"errors": 0,
"reward": 0.78,
"timestamp": "2026-02-12T18:00:00Z"
}
6. Training Threshold Check
Once we hit 50+ rollouts, trigger APO training.
7. APO TrainingPolicy gradient optimization using collected rollouts:
python3 train-apo.py --rollouts rollouts.jsonl --output policy.pt
---
Results: 7 Days of Data Collection
Metrics (Feb 5-12, 2026)
| Metric | Value |
|--------|-------|
| Total Rollouts | 113 |
| Mean Reward | 0.755 |
| Median Reward | 0.849 |
| Worker Distribution | AA: 107, AE: 6, VM: 1 |
| Task Diversity | 15 types across 7 categories |
| Token Budget | 680 tokens per 100 rollouts (LLM judge) |
| Human Feedback Overhead | 0 tokens |
Reward Distribution
Perfect (1.0): โโโโโโโโโโโโโโโโ 18%
Excellent (0.8+): โโโโโโโโโโโโโโโโโโโโโโโโ 32%
Good (0.6-0.8): โโโโโโโโโโโโโโ 24%
Poor (<0.6): โโโโโโโโ 14%
Failed (0.0): โโโโ 12%
Key insight: 74% of rollouts scored 0.6+ (good to perfect) โ agents are already performing well before training.
Worker Specialization Patterns
AA (Phone): Diverse tasks, 0.74 avg reward- File operations (high success)
- Code generation (variable quality)
- Research tasks (mixed results)
- Code audits (perfect execution)
- Test verification (100% success)
- Fleet operations (reliable)
- Moltbook scraping (flawless)
---
Technical Implementation Details
1. HTTP Emitter (Zero Dependencies)
import json
import http.client
from datetime import datetime
def emit_rollout(task, duration, tokens, errors, reward):
rollout = {
"rollout_id": f"ro-{generate_id()}",
"worker_id": os.getenv("WORKER_ID", "unknown"),
"task": task,
"duration": duration,
"tokens": tokens,
"errors": errors,
"reward": reward,
"timestamp": datetime.utcnow().isoformat() + "Z"
}
conn = http.client.HTTPConnection("100.117.177.50:4747")
conn.request(
"POST",
"/rollouts",
body=json.dumps(rollout),
headers={"Content-Type": "application/json"}
)
response = conn.getresponse()
return response.status == 200
Why zero dependencies? Works on Termux (Android) without pip install.
2. Message Tracking (Emoji โ Rollout Mapping)
When agent sends a message
def track_sent_message(message_id, rollout_id):
tracker = load_tracker() # JSONL file
tracker.append({
"message_id": message_id,
"rollout_id": rollout_id,
"timestamp": now()
})
save_tracker(tracker)
When user reacts with emoji
def process_emoji_reaction(message_id, emoji):
tracker = load_tracker()
entry = [e for e in tracker if e["message_id"] == message_id][0]
reward = EMOJI_REWARDS.get(emoji, 0.0)
# Update rollout in LightningStore
update_rollout_reward(entry["rollout_id"], reward)
3. LLM Judge (Sessions Spawn Integration)
def judge_rollout(rollout):
prompt = generate_judge_prompt(rollout)
# Use OpenClaw sessions_spawn (uses subscription token)
result = sessions_spawn(
task=prompt,
model="anthropic/claude-haiku-4-5", # Fast + cheap
timeout_seconds=30
)
scores = parse_json(result)
return (scores.conciseness + scores.correctness + scores.helpfulness) / 3
Token efficiency:
- Use Haiku (cheapest model)
- Truncate context to 500 chars
- Batch process (5 rollouts per call)
- Result: 680 tokens per 100 rollouts
---
Lessons Learned
What Worked
โ Emoji feedback is fast and natural
Users react in 1 second vs 30-second written feedback. Engagement rate: 60%+
โ Multi-dimensional rewards prevent gaming
Can't optimize just for speed (automated) or just for quality (LLM judge) โ need human approval too
โ Zero-dependency emitter works everywhere
Runs on Android, Ubuntu, Multipass VMs without setup
โ Async collection enables scale
Workers emit rollouts independently, LightningStore aggregates centrally
What We'd Change
โ ๏ธ LLM judge sampling could be smarter
Current: Random 10% sampling
Better: Edge case detection (high variance, low confidence, human disagreement)
โ ๏ธ Emoji mapping needs calibration
Is ๐ฏ worth 2ร more than ๐? Should โ ๏ธ be neutral or negative?
โ ๏ธ Worker specialization needs routing logic
Manual task assignment is fine for 3 workers, but won't scale to 10+
---
Next Steps: APO Training & Deployment
Phase 1: Initial Training (This Week)
Collect 50+ rollouts (โ
Done - 113 collected)
Train baseline policy
python3 train-apo.py --rollouts rollouts.jsonl --epochs 100 --output policy-v1.pt
Evaluate on validation set
python3 evaluate-policy.py --policy policy-v1.pt --validation val.jsonl
Phase 2: Live Deployment (Next Week)
Deploy policy to AA phone
scp policy-v1.pt phone:/opt/agent-lightning/policy.pt
Switch to policy-guided mode
export AGENT_MODE=rl_guided
Monitor performance vs baseline
Phase 3: Continuous Learning (Ongoing)
Collect more rollouts with new policy
Retrain every 100 rollouts
Compare v1 โ v2 โ v3 policies
---
Conclusion: Human-in-the-Loop Without the Overhead
Agent Lightning proves you can have human feedback in AI training without sacrificing speed or burning tokens.
The key innovation: Treat emoji reactions as first-class reward signals โ 40% weight on human sentiment, 0 tokens overhead. The result: 113 rollouts in 7 days, ready for policy training, with real human feedback baked in. Next challenge: Scale from 3 workers to 10+, maintain quality as task diversity increases, and prove that APO training actually improves agent performance.Stay tuned for part 2: Agent Lightning Training Results โ where we share the policy optimization outcomes and measure real-world performance gains.
---
Want to Try Agent Lightning?
GitHub: github.com/openclaw/agent-lightning (coming soon) Documentation: Full setup guide for multi-agent RL training Discord: Join our community for questions & discussion Blog Series:1. This post โ System architecture & implementation
2. Coming next โ Training results & performance analysis
3. Coming soon โ Scaling to 10+ agents
---
_Built by Andre Frank & Archonic Arbiter | OptinAmpOut.com_
Ready to Take Action?
Find out how ready your organization is for AI automation.