Vision Tracking
Arkestra uses Apple’s Vision framework to detect faces, hands, and body pose from a live camera feed or any track’s rendered output. All detected values are normalised 0–1 floats that can be mapped to any parameter exactly like an LFO or audio signal.
Detection runs on the Neural Engine at ~15 fps with negligible CPU and GPU overhead.
Opening the Vision Tracking panel
Section titled “Opening the Vision Tracking panel”Click the Vision Tracking icon (figure silhouette) in the right sidebar. The panel is titled Vision Tracking.
Input source
Section titled “Input source”At the top of the panel, choose where Vision reads its frames from:
| Mode | What it analyses |
|---|---|
| Camera | A connected camera or webcam |
| Track | The rendered output of a track in the current project |
Click Camera or Track to switch. The choice is saved per project.
Camera mode
Section titled “Camera mode”Under Camera Source, each available camera appears as a row. Click a row to activate it — the row highlights and a camera icon appears when the session is running. Click the same row again to stop.
Use the Refresh button if a newly connected camera does not appear in the list.
A live camera Preview appears below the controls once a camera is active. The preview shows a real-time overlay of detected landmarks:
- Face — bounding box + centre dot; eye and mouth dots when Landmarks are enabled
- Left hand — emerald lines from wrist to each fingertip, curl-sized dots at tips, yellow line between thumb and index when pinching
- Right hand — same layout in purple
- Body — orange skeleton connecting neck, shoulders, elbows, wrists, hips, knees, and ankles
Click Hide / Show next to the Preview header to collapse the preview while keeping tracking active.
Track mode
Section titled “Track mode”Under Track Source, pick any track in the project from the dropdown. Vision analyses each frame that track renders, instead of a camera. Use this to react to your own visuals — for example, drive a shader parameter from the brightness or motion of another track.
The camera preview is hidden in Track mode (there is no separate camera feed).
Active Detectors
Section titled “Active Detectors”Enable only the detectors you need. Each group adds CPU cost:
| Toggle | What it enables | Default |
|---|---|---|
| Face | Face position, size, head angles, presence | On |
| + Landmarks | Eye-open and mouth-open values (requires Face) | Off |
| Hand | Both left and right hand — wrist, index tip, pinch, presence | Off |
| Body | Body centre, both wrists, presence | Off |
Smoothing
Section titled “Smoothing”The Smoothing slider (0–95%) applies an exponential moving average to all outputs each analysis frame.
- 0% (Off) — raw values, may jitter
- 60% — the default; a good balance for most live use
- Higher values — more inertia, slower response
Flip X
Section titled “Flip X”By default all X values are mirrored so that X = 0 is the left side of the screen in both camera and track modes (consistent with Vision’s natural coordinate system and how shaders expect X). This mirrors the performer’s movements as seen by the camera.
Enable Flip X to disable the mirror. Use this for rear-facing cameras, external cameras already mounted correctly, or when the natural camera X matches your visual setup.
Live Values
Section titled “Live Values”The Live Values section shows a real-time readout of every active channel as a labelled bar chart. Active groups glow when a subject is detected. Use this to understand and tune your mappings without needing to open the parameter editor.
Mapping Vision to a parameter
Section titled “Mapping Vision to a parameter”- Select a track or effect and open the Parameter Inspector.
- Click the source pill on any parameter and choose Vision.
- A three-row picker appears for the mapping type:
| Row | Parent type | Extended sub-type |
|---|---|---|
| Face | Face (position, size, angles) | + Landmarks (eye open, mouth open) |
| Hand | L. Hand (wrist, index, pinch) | L. Fingers (all tips + per-finger curl) |
| Hand | R. Hand | R. Fingers |
| Body | Body (centre, both wrists) | Skeleton (full joint positions) |
Select the sub-type row, then pick a specific landmark from the grid below.
Available channels
Section titled “Available channels”All values are 0–1. Coordinate origin: X=0 is left, X=1 is right, Y=0 is bottom, Y=1 is top.
| Channel | Label | Notes |
|---|---|---|
| X | Horizontal position | 0 = left, 1 = right |
| Y | Vertical position | 0 = bottom, 1 = top |
| Size | Bounding box area (√width×height) | Larger when closer to camera |
| Roll | Head tilt left/right | 0.5 = upright |
| Yaw | Head turn left/right | 0.5 = facing forward |
| Pitch | Head tilt up/down | 0.5 = level |
| Presence | Confidence of detection | 0 = not detected, 1 = detected; slow decay on loss |
Face Landmarks (requires + Landmarks)
Section titled “Face Landmarks (requires + Landmarks)”| Channel | Label | Notes |
|---|---|---|
| Eye L | Left eye openness | 0 = closed, 1 = wide open |
| Eye R | Right eye openness | |
| Mouth | Mouth openness | 0 = closed, 1 = open |
L. Hand / R. Hand
Section titled “L. Hand / R. Hand”| Channel | Label |
|---|---|
| Wrist X / Y | Wrist position |
| Index X / Y | Index fingertip position |
| Pinch | Distance between thumb tip and index tip — 0 = pinching, 1 = open |
| Presence | Detection confidence |
L. Fingers / R. Fingers (extended)
Section titled “L. Fingers / R. Fingers (extended)”| Channel | Label |
|---|---|
| Mid X / Y | Middle fingertip |
| Ring X / Y | Ring fingertip |
| Little X / Y | Pinky tip |
| Thumb Curl | 0 = extended, 1 = fully curled |
| Index Curl | |
| Mid Curl | |
| Ring Curl | |
| Little Curl |
| Channel | Label |
|---|---|
| Center X / Y | Hip midpoint |
| L.Wrist X / Y | Left wrist |
| R.Wrist X / Y | Right wrist |
| Presence | Detection confidence |
Skeleton (extended body)
Section titled “Skeleton (extended body)”Full joint set: Neck, Left/Right Shoulder, Left/Right Elbow, Left/Right Knee, Left/Right Ankle — each as X/Y pairs.
- Start with Face only — it is the cheapest detector and covers most interactive use cases.
- Use Presence as a modulator to drive a parameter to zero when no subject is in frame, avoiding frozen values.
- Pinch (0 = pinching, 1 = open) can gate effects with hand gestures. Invert it in the mapping if you want a pinch to drive a value up.
- In Track mode, try pointing Vision at a feedback loop or particle track to create visual feedback that reacts to its own motion.
- The Live Values panel is the fastest way to verify a mapping before performing.