I Installed an AI Vision Assistant on My iPhone That Can Control Claude on My Mac

Two days ago, a 366-star open-source project turned my iPhone into an AI vision assistant – point it at anything, talk, and it not only understands what it sees but can also dispatch tasks to Claude running on my Mac.

The project is called VisionClaw, built on Meta Ray-Ban smart glasses SDK. But even without the glasses, you get the full experience using just the iPhone camera.

What Is This

VisionClaw’s core is a dual-model architecture:

iPhone Camera + Microphone
        |
        v
  Gemini 2.5 Flash ---- Real-time voice + vision ("eyes and mouth")
        |
        | tool call
        v
  OpenClaw Gateway (Mac) ---- Claude executes tasks ("hands and brain")
        |
        v
  Clawdbot / Claude Code

Why two models? Because each excels at different things.

Gemini’s real-time streaming audio-visual capability is currently the best in class – you talk to your phone while it sees the camera feed, understanding what you’re looking at and what you’re asking about. No other model delivers this kind of multimodal real-time interaction as smoothly.

Claude, on the other hand, is stronger at tool execution and reasoning. When Gemini determines your request requires action (searching, sending a message, operating the computer), it passes the task through OpenClaw Gateway to Claude on your Mac.

The combined effect: you point your phone camera at something and say “look up the price of this,” Gemini handles understanding what you’re seeing and what you want, Claude handles actually doing the lookup.

Security Audit

The project has only 366 stars and was created just 2 days ago. Before installing anything that requires camera and microphone permissions, a code audit is non-negotiable.

I ran 6 checks. All passed:

Check	Result
Network requests	Only connects to Google Gemini API + local Mac OpenClaw Gateway
Data transmission	Only camera frames (JPEG 50% quality, 1fps) + audio + tool call text
Third-party dependencies	Only Meta Wearables DAT SDK (official Facebook library)
System permissions	Camera + microphone (core functionality), no contacts / location
OpenClaw communication	Only passes task text and auth token, no Clawdbot session / memory data leakage
System prompt	No instructions guiding AI to access private data

In short: it sends only the necessary visual and audio data to Gemini and doesn’t touch anything else on your phone. The OpenClaw Gateway side is also isolated – Claude only receives task descriptions and can’t see your full conversation history with Gemini.

Installation Pitfalls

Here are the main gotchas, so you don’t have to waste time on them.

Xcode signing configuration. You need to replace the original author’s Team ID and Bundle ID with your own. An easy trap: the certificate ID shown in macOS Keychain is not the same thing as the Xcode Personal Team ID. Don’t confuse them.

iPhone Developer Mode. iOS 16+ requires manually enabling Developer Mode for sideloading. Path: Settings, Privacy & Security, Developer Mode. Requires a restart after enabling.

No Meta glasses? No problem – almost. This is where most people get stuck. The project assumes you own a pair of Meta Ray-Bans. Without them, you won’t see the “Start on iPhone” button. The workaround is the Mock Device Kit – execute Power On, Don, and Unfold in sequence to simulate a virtual pair of glasses. Once the registration flow is bypassed, the button appears.

Gateway configuration. OpenClaw Gateway defaults to binding on loopback (127.0.0.1), which the iPhone can’t reach. Change it to LAN binding (0.0.0.0) and make sure the chatCompletions endpoint is added.

Real-World Experience

Honestly, the pure voice-vision conversation experience isn’t that different from opening the Claude App and turning on the camera. Gemini’s comprehension is solid, but “look at this and tell me about it” isn’t exactly groundbreaking on its own.

The differentiation is in tool calling. You can point at something and say “search this for me” or “send a message to so-and-so.” When Gemini identifies this as an actionable request, it routes the task through OpenClaw to Claude on your Mac. Regular vision-chat apps can’t do this.

In practice, though, this chain – Gemini recognizes intent, passes the task through Gateway, Claude executes, result returns, Gemini speaks it back – is still rough around the edges. Latency is noticeable, and error recovery in the middle of the chain isn’t graceful.

Using it outside your home is even more constrained. The Gateway runs on your Mac, so your iPhone must be on the same WiFi network. You could use Tailscale to bridge the networks, but the cost is a persistent VPN on your iPhone, an extra 7-15% daily battery drain, and a permanent VPN icon in the status bar. That’s too much friction for daily use.

Why I Think It Has Limited Utility Right Now

This is the part I most want to talk about.

For someone like me who has heavily customized their setup and written dozens of Agent Skills, the desktop Claude Code experience – where you see every step of the Agent’s output in real time, which file it’s reading, what tool it called, what intermediate results it generated – is the gold standard.

VisionClaw’s fundamental problem is this: after issuing a task through your phone or glasses, you can’t see the AI Agent’s execution process.

You don’t know what it’s doing. If it goes wrong, you can’t stop it in real time. This leads directly to two problems:

First, wasted tokens. The Agent might run down the wrong path for a long time without you knowing. By the time the result comes back and you realize it’s wrong, all the tokens spent getting there are gone.

Second, safety concerns. If the Agent executes something you didn’t want – sends the wrong message, deletes the wrong file – you can’t intervene in time. On a desktop with Claude Code, you can at least watch every step and hit Ctrl+C if something looks off. With voice commands through a phone, that safety net disappears.

This points to a deeper issue: the combination of voice-only interface + AI Agent currently lacks a critical component – a channel for real-time feedback and human intervention.

If deployed on a Meta Ray-Ban Display ($799, full-color screen) or the future Meta Orion (true AR glasses, 70-degree FOV, consumer version expected 2027), being able to see the Agent’s live output on the lens would dramatically improve the experience. But until then, using an iPhone screen for this feedback loop is inferior to just using a computer.

Add the network constraints (same WiFi or Tailscale) on top, and practical utility is genuinely limited right now.

But I’m Still Bullish on This Direction

After all these caveats, why did I still spend half a day setting this project up?

Because what excites me about VisionClaw isn’t the product’s current polish – it’s the ecosystem shift it represents.

OpenClaw/Clawdbot has opened a door – it frees the AI Agent from the terminal window, making it callable from any device through any interface. This is exactly what I spent last year researching and wanting to build but couldn’t pull off. Seeing someone actually ship it, as open source no less, is a reminder of how fast things move.

From “Claude Code on a desktop” to “Clawdbot on Telegram” to “VisionClaw on glasses” – the entry point changes, but the Agent capability behind it is the same stack. This “one brain, many entry points” architecture is where things are heading.

Today’s VisionClaw is rough. But it validates a key hypothesis: the AI Agent’s interaction surface can break free from the screen. Once display technology catches up (AR glasses) and network issues get solved (local models or better remote solutions), products like this will become genuinely practical.

I hope to see more experiments like this.