Vision Tracking

Arkestra uses Apple’s Vision framework to detect faces, hands, and body pose from a live camera feed or any track’s rendered output. All detected values are normalised 0–1 floats that can be mapped to any parameter exactly like an LFO or audio signal.

Detection runs on the Neural Engine at ~15 fps with negligible CPU and GPU overhead.

Opening the Vision Tracking panel

Click the Vision Tracking icon (figure silhouette) in the right sidebar. The panel is titled Vision Tracking.

Input source

At the top of the panel, choose where Vision reads its frames from:

Mode	What it analyses
Camera	A connected camera or webcam
Track	The rendered output of a track in the current project

Click Camera or Track to switch. The choice is saved per project.

Camera mode

Under Camera Source, each available camera appears as a row. Click a row to activate it — the row highlights and a camera icon appears when the session is running. Click the same row again to stop.

Use the Refresh button if a newly connected camera does not appear in the list.

A live camera Preview appears below the controls once a camera is active. The preview shows a real-time overlay of detected landmarks:

Face — bounding box + centre dot; eye and mouth dots when Landmarks are enabled
Left hand — emerald lines from wrist to each fingertip, curl-sized dots at tips, yellow line between thumb and index when pinching
Right hand — same layout in purple
Body — orange skeleton connecting neck, shoulders, elbows, wrists, hips, knees, and ankles

Click Hide / Show next to the Preview header to collapse the preview while keeping tracking active.

Track mode

Under Track Source, pick any track in the project from the dropdown. Vision analyses each frame that track renders, instead of a camera. Use this to react to your own visuals — for example, drive a shader parameter from the brightness or motion of another track.

The camera preview is hidden in Track mode (there is no separate camera feed).

Active Detectors

Enable only the detectors you need. Each group adds CPU cost:

Toggle	What it enables	Default
Face	Face position, size, head angles, presence	On
+ Landmarks	Eye-open and mouth-open values (requires Face)	Off
Hand	Both left and right hand — wrist, index tip, pinch, presence	Off
Body	Body centre, both wrists, presence	Off

Smoothing

The Smoothing slider (0–95%) applies an exponential moving average to all outputs each analysis frame.

0% (Off) — raw values, may jitter
60% — the default; a good balance for most live use
Higher values — more inertia, slower response

Flip X

By default all X values are mirrored so that X = 0 is the left side of the screen in both camera and track modes (consistent with Vision’s natural coordinate system and how shaders expect X). This mirrors the performer’s movements as seen by the camera.

Enable Flip X to disable the mirror. Use this for rear-facing cameras, external cameras already mounted correctly, or when the natural camera X matches your visual setup.

Live Values

The Live Values section shows a real-time readout of every active channel as a labelled bar chart. Active groups glow when a subject is detected. Use this to understand and tune your mappings without needing to open the parameter editor.

Mapping Vision to a parameter

Select a track or effect and open the Parameter Inspector.
Click the source pill on any parameter and choose Vision.
A three-row picker appears for the mapping type:

Row	Parent type	Extended sub-type
Face	Face (position, size, angles)	+ Landmarks (eye open, mouth open)
Hand	L. Hand (wrist, index, pinch)	L. Fingers (all tips + per-finger curl)
Hand	R. Hand	R. Fingers
Body	Body (centre, both wrists)	Skeleton (full joint positions)

Select the sub-type row, then pick a specific landmark from the grid below.

Available channels

All values are 0–1. Coordinate origin: X=0 is left, X=1 is right, Y=0 is bottom, Y=1 is top.

Face

Channel	Label	Notes
X	Horizontal position	0 = left, 1 = right
Y	Vertical position	0 = bottom, 1 = top
Size	Bounding box area (√width×height)	Larger when closer to camera
Roll	Head tilt left/right	0.5 = upright
Yaw	Head turn left/right	0.5 = facing forward
Pitch	Head tilt up/down	0.5 = level
Presence	Confidence of detection	0 = not detected, 1 = detected; slow decay on loss

Face Landmarks (requires + Landmarks)

Channel	Label	Notes
Eye L	Left eye openness	0 = closed, 1 = wide open
Eye R	Right eye openness
Mouth	Mouth openness	0 = closed, 1 = open

L. Hand / R. Hand

Channel	Label
Wrist X / Y	Wrist position
Index X / Y	Index fingertip position
Pinch	Distance between thumb tip and index tip — 0 = pinching, 1 = open
Presence	Detection confidence

L. Fingers / R. Fingers (extended)

Channel	Label
Mid X / Y	Middle fingertip
Ring X / Y	Ring fingertip
Little X / Y	Pinky tip
Thumb Curl	0 = extended, 1 = fully curled
Index Curl
Mid Curl
Ring Curl
Little Curl

Body

Channel	Label
Center X / Y	Hip midpoint
L.Wrist X / Y	Left wrist
R.Wrist X / Y	Right wrist
Presence	Detection confidence

Skeleton (extended body)

Full joint set: Neck, Left/Right Shoulder, Left/Right Elbow, Left/Right Knee, Left/Right Ankle — each as X/Y pairs.

Tips

Start with Face only — it is the cheapest detector and covers most interactive use cases.
Use Presence as a modulator to drive a parameter to zero when no subject is in frame, avoiding frozen values.
Pinch (0 = pinching, 1 = open) can gate effects with hand gestures. Invert it in the mapping if you want a pinch to drive a value up.
In Track mode, try pointing Vision at a feedback loop or particle track to create visual feedback that reacts to its own motion.
The Live Values panel is the fastest way to verify a mapping before performing.