I Built an AI Personal Evolution System: Voice-Driven, Auto-Distilled, Gets Smarter Over Time
TL;DR I built an AI personal system based on Apple’s built-in Voice Memos, processing 20 to 40 recordings daily. The core isn’t speech-to-text, but a complete pipeline: recording → local transcription → AI classification & execution → trust-level routing → email notification → SSH follow-up. The entire system runs on a $200/month Claude Max subscription. The AI automatically accumulates your behavioral patterns, decision preferences, and methodologies over time. The longer you use it, the better it knows you.
AI is getting cheaper and smarter. But one problem remains unsolved: it doesn’t know you.
Every time you open ChatGPT, Claude, or Gemini, it’s a brand new conversation. The AI doesn’t know what project you’re working on, what your tech stack is, what decisions you made yesterday, or what writing style you hate. All it can give you is an average-level answer designed for everyone.
I’ve spent the past few months solving this problem. What I ended up building is a system that continuously evolves around personal cognition: the AI accumulates context about me through daily use, periodically distills observations into methodology rules, then automatically applies those rules in subsequent interactions. Every conversation makes the next one better.
The daily entry point is Apple’s built-in Voice Memos. iPhone, iPad, Apple Watch, Mac — any Apple device works. A flash of inspiration while running, a to-do while cooking, a technical solution while walking — just open Voice Memos and say it. The AI processes it in the background, and results arrive via email within three minutes. Deep work at the desk and lightweight interaction on the go, two paths running in parallel.
Below, I break down every layer of this system.
Why AI Only Gives You “Correct Nonsense”
The essence of LLM training is Next Token Prediction: output the highest-probability next token. Highest probability means most people would agree — that’s consensus. RLHF stacks another layer on top: safety alignment penalizes controversial, strongly-opinionated outputs and rewards balanced, comprehensive, non-committal answers. Two mechanisms stacked together, and the LLM’s default behavior is regression to the mean.
This means one thing: AI model upgrades solve information asymmetry (things you didn’t know, now you do), but they can’t solve cognitive asymmetry. Facing the same industry report, a twenty-year veteran and a fresh hire see completely different things. The veteran has a judgment system built from years of trial and error, knowing which data is noise and which anomalies signal a trend. The newcomer lacks that filter — even with a report ten times as long, they can’t make the same quality decisions. The AI’s default output is essentially at that newcomer’s level: everything’s correct, but there’s no judgment.
Put it another way: AI has shifted from CPU-bound to Memory-bound. Once model intelligence crosses a threshold, further upgrades yield diminishing returns. What determines the nature of the output is no longer model intelligence, but context. Just like in computing history, once CPUs got fast enough, the bottleneck shifted to memory architecture. Every model upgrade makes intelligence cheaper and available to everyone. Your personal context, on the other hand, belongs only to you — model upgrades don’t depreciate it. Continuously investing in a depreciating dimension (model intelligence) yields diminishing returns; investing in a non-depreciating dimension (personal context) compounds.
So the highest-leverage behavior when using AI today is consciously accumulating all your interaction data and periodically distilling it into your own methodology. My system has an automatic distillation pipeline for this: an Observer module monitors behavioral patterns daily, and a Reflector module periodically distills observations into persistent axioms and rules, writing them back into the cognitive framework files. Through continuous accumulation, I now have 45 decision axioms that get automatically loaded during decision-related tasks.
Understanding this, the voice pipeline I’ll describe next isn’t just an efficiency tool. It’s the primary input channel for this accumulation process: 20 to 40 voice memos per day, each one providing new observational data to the system.
Why Voice Memos
I tried many approaches for the recording entry point and kept coming back to Apple’s built-in Voice Memos.
The reason is simple: it’s the lowest-friction recording method in the Apple ecosystem, natively integrated into the system. Available on iPhone, iPad, Apple Watch, and Mac out of the box. Recordings sync automatically to Mac via iCloud with zero extra configuration. On Apple Watch, just raise your wrist and tap. On iPhone, launch it straight from Control Center. The ecosystem stability has been proven over more than a decade.
Of course, you could build a custom app to replace it. But Voice Memos is already a 90% solution. Developing and maintaining an app for that last 5% improvement has terrible ROI.
The principle behind this choice: build on top of excellent existing solutions. Save your energy for parts that actually create differentiated value.
Architecture: From Recording to Action
The full pipeline data flow:
Voice Memos recording (10 seconds)
↓ iCloud sync
macOS Voice Memos App
↓ Daemon scans every 60 seconds
Local transcription (mlx_whisper, Apple Silicon accelerated)
↓ Vocabulary cleanup + spelling detection
AI classification + execution (Claude Code headless)
↓ Trust-level routing + approval queue
Email notification → iPhone/iPad
↓ Optional
SSH follow-up conversation
Local Transcription
A LaunchAgent daemon scans the Voice Memos SQLite database every 60 seconds. When it finds a new recording, it calls mlx_whisper for transcription. The model is whisper-large-v3-turbo in MLX format, running at roughly 5 to 10x real-time on an M4 Pro, at zero cost, with data never leaving the machine.
After transcription, there’s a vocabulary cleanup step. I maintain a personal vocabulary file (JSON) with three types of corrections: common typos, spelling detection (saying “spelled A-B-C” during recording is automatically recognized), and proper noun context (preventing the same person’s name from being re-confirmed every day). This vocabulary accumulates through use, and transcription accuracy keeps improving.
Six Classification Types
Transcribed text is sent to Claude, which classifies and executes:
| Type | Typical Input | Processing |
|---|---|---|
| action | “Check Samsung T9 Mac compatibility” | Search multiple sources, cross-verify, output conclusion |
| task | “Remind me to call Fynn tomorrow at 3pm” | Parse time, create scheduled task, auto-trigger at time |
| idea | “The voice pipeline could be open-sourced” | Write to Brain Dump, auto-fill frontmatter |
| curiosity | “What’s Mars’s atmosphere made of” | Quick concise answer, no deep research |
| log | “Ran 5km today, felt good” | Summarize and persist to monthly log |
| decision | “Approve that Blog draft” | Read approval queue, execute the blocked operation |
Classification only answers “what to do.” An orthogonal dimension, “trust level,” answers “do I need to approve this”: T0 fully automatic with email notification, T1 produces knowledge entries for post-review, T2 queues external-facing drafts for voice approval, T3 marks irreversible operations without ever auto-executing.
The benefit of orthogonal dimensions: the same action type can be T0 (checking weather) or T2 (writing a Blog draft). They evolve independently without needing special logic for every combination.
Context Injection
This is the highest-ROI part of the entire system. Each time the AI processes a recording, six layers of context are injected:
- User identity (core information from the cognitive profile file)
- Communication style preferences (constraining AI expression)
- Concept dictionary (a “private language” between me and the AI, compressing communication cost)
- Proper noun context (names and project names from the vocabulary)
- Today’s TODO (knowing what I’m working on)
- Last 24 hours of recording summaries (short-term thought-stream continuity)
When I say “Did you check that CJK bug in Resonance,” the AI knows Resonance is my open-source project, the CJK bug refers to token counting issues with Chinese characters, and this task is in today’s TODO. It goes straight to checking progress instead of asking “What’s Resonance?”
Same model, with context versus without — two completely different worlds of output quality.
Email Notifications
After each recording is processed, an HTML email is sent to my Gmail. It contains the classification tag, trust level, AI execution results, cleaned transcript, and an SSH command.
Why email? It’s the only channel that can push instantly to iPhone/iPad without any extra app. Gmail App push latency is sub-second. Get it running first — no need to set up a Telegram Bot or WebSocket.
Scheduled Tasks: From Reactive to Proactive
The pipeline started as reactive: you record one, it processes one. The task classification pushed it toward proactive service.
“Check Karpathy’s Twitter every Sunday and summarize any new content for me.”
This sentence becomes a LaunchAgent plist file through the pipeline, triggering Claude to run the check every Sunday at 10am, with results delivered via email.
The daily recap is also proactive. Every day at 23:00, a scheduled task aggregates all data sources (voice memos, Git commits, Brain Dumps, knowledge base changes, competitor reports), generates a structured daily report, sends it by email, then distills the next day’s TODO list from the day’s activities. The next morning, the inbox has yesterday’s review and today’s suggestions.
Follow-Up Conversations: SSH Remote Operations on iPhone/iPad
I spent significant time personally testing remote terminal solutions on iPhone and iPad. This experience is worth detailing because there are real pitfalls.
Final Solution
Install Termius (SSH client) on iPad/iPhone, connecting directly to Mac’s SSH service. Every result email has a command at the bottom (~/s 113230-2c811548). Paste and execute in Termius to enter a Claude Code conversation with full context.
Pitfalls Encountered
Mosh (failed). Initially used Mosh instead of SSH because Mosh supports automatic reconnection, theoretically better for mobile scenarios. In practice, Mosh has serious CJK character rendering bugs (GitHub Issue #1041, unfixed for 7 years). Chinese characters get truncated and misaligned. Claude Code’s TUI interface is essentially unusable under Mosh.
tmux (partially failed). Wanted to use tmux for session persistence, but encountered PTY allocation issues under Termius SSH. tmux new-session repeatedly errored with “open terminal failed: not a terminal.” Two days of debugging revealed it was a PTY compatibility issue between Termius and the tmux version.
Final choice: pure SSH, no Mosh, no tmux. Mosh’s reconnection and tmux’s session persistence are essentially redundant, and Mosh’s CJK problem is an architectural defect (mosh uses its own wcwidth implementation that’s inconsistent with terminal emulators) — unfixable in the short term. SSH disconnection is handled by “command-line conversation recovery”: just re-paste the command, and Claude Code automatically loads previous execution results.
The lesson from this experience: when facing terminal rendering issues, first draw the complete chain (Termius → Mosh → tmux → Claude Code TUI), identify redundant layers, and remove them rather than tuning parameters. The longer the chain, the more likely the weakest link becomes the bottleneck.
Cost: $200/Month Powers the Entire Pipeline
This might be the most common question. The pipeline’s running cost:
- Claude Max subscription: $200/month
- Local transcription: $0 (mlx_whisper runs on Mac)
- Email sending: $0 (Gmail API free tier is sufficient)
- Infrastructure: One Mac + any Apple device + existing iCloud/Gmail
- Software to install: mlx_whisper, Claude CLI
The key: all AI processing runs through Claude Code’s headless mode (claude -p), using Max subscription quota without consuming API tokens. The Max subscription provides virtually unlimited usage — 20 to 40 daily recordings are easily supported.
Claude Code Headless Mode Stability
Honestly, using claude -p non-interactive mode for pipeline automation isn’t Anthropic’s officially recommended primary use case. After months of real-world testing, it runs normally in most cases, but occasionally encounters session limits or response timeouts.
My assessment: this approach will work fine within a 3 to 6 month window. If Anthropic later adjusts Max subscription policies or headless mode develops stability issues, switching to direct API or similar modes in other coding tools is straightforward. For now, the $200/month Max subscription is the optimal solution.
Bottleneck Migration: How Trust Levels Emerged
After the pipeline was built, the processing flow looked like this:
Inspiration → Voice capture (~10s) → AI processing (~3min) → Human approval (???) → Execution complete
The first three stages add up to under four minutes. But “human approval” is unbounded. I might not check email for six hours, or not respond until the next day.
The voice pipeline solved the “idea capture” bottleneck, and the constraint immediately migrated to “human approval bandwidth.” This is the Theory of Constraints (TOC) applied to personal systems: optimize one stage, and the bottleneck shifts to the next weakest link.
Trust levels (T0 through T3) are the engineering response to this bottleneck. By letting 80% of operations (T0) skip approval entirely, human bandwidth is reserved for the 20% that actually need judgment (T2/T3).
Going further, the decision classification enables “voice approval.” Say “approve that Blog draft” into Voice Memos, and the pipeline automatically matches the corresponding item in the approval queue and executes it. From discovering a problem to deploying a fix, entirely through voice.
Self-Reference: The Pipeline Improving Itself
This system has an interesting property: I use it to improve itself.
Recently, I noticed through voice that my TODO list was empty every day. I recorded “Why is TODO empty.” After processing, the AI diagnosed the root cause: the daily recap script only generates reviews, not next-day TODOs. It proposed two solutions: mechanical copying (Option A) and AI distillation (Option B). I recorded another memo: “Option B sounds good.” The pipeline classified it as a decision, directly created the distillation script, and integrated it into the daily recap workflow.
This is “using the workflow to improve the workflow itself.” Friction encountered during use gets captured, analyzed, and fixed through the same pipeline.
The System’s Essence
Looking at the entire system, core value concentrates in three points:
Personalized cognitive context. Cognitive profile, communication style constraints, 45 decision axioms, concept dictionary — every AI interaction carries your complete cognitive context. Not just “AI knows your name,” but “AI knows how you think.”
Automated scheduled task orchestration. A LaunchAgent-driven task network: scanning recordings every 60 seconds, recommending writing topics at 16:00 daily, competitor briefing at 18:00, daily recap + TODO distillation at 23:00. These tasks run while you’re away from the computer, pushing AI from “reactive response” to “proactive service.”
Email notification loop. Email is the shortest notification path from AI system to human — no extra apps needed, instantly reachable on all devices. Combined with SSH follow-up capability, it forms a complete loop: voice input → AI processing → email notification → SSH follow-up.
Advice for Builders
Start with the minimum pipeline. Recording → transcription → classification → email notification. Skip trust levels, vocabulary, and context injection first. Verify that the shortest path of “voice in, email out” works.
Choose local transcription. Whisper’s MLX variant performs excellently on Apple Silicon. Zero cost, fast, privacy-friendly. Unless you need speaker diarization (multi-person meetings), there’s no reason to use cloud services.
Context injection is the highest-ROI investment. Even just injecting your role description and today’s TODO into the prompt will dramatically improve AI output quality. This is more effective than switching to a more expensive model.
Pure SSH, skip Mosh. If you plan to operate remotely from iPad/iPhone via Termius, remember that Mosh’s CJK rendering bug is architectural and unfixable in the short term. Pure SSH + Claude Code’s session recovery mechanism is sufficient.
Watch for bottleneck migration. After solving an efficiency problem, ask yourself: where has the constraint migrated to now? Trust levels and voice approval are designs that naturally emerged after bottleneck migration.
Standing on the Shoulders of Giants
The cognitive framework layer of this system wasn’t built from scratch. It’s based on grapeot’s open-source project Context Infrastructure, with the design philosophy detailed in this article. The framework provides a complete cognitive file structure (SOUL.md for AI personality, USER.md for user profile, COMMUNICATION.md for output style constraints) and an automatic distillation pipeline (Observer/Reflector). I forked the repository, filled in my own content, and connected it to the voice pipeline.
These template file structures originally came from Peter Steinberger’s OpenClaw project. OpenClaw is an always-on AI companion framework providing real-time messaging, daemon self-healing, and heartbeat monitoring. grapeot distilled a more general cognitive framework from it, focusing on the core problem of “helping AI understand you.”
My contribution was: building on grapeot’s framework, solving the data input problem with the voice pipeline, solving the proactive service problem with scheduled tasks, and solving the mobile scenario problem with the email loop. The cognitive framework is the brain, the voice pipeline is the senses, scheduled tasks are habits, and email is the nervous system.
The material for this article was collected through the voice pipeline described herein.
If you found this helpful, consider buying me a coffee to support more content like this.
Buy me a coffee