Skip to content

Vision Tracking

Arkestra uses Apple’s Vision framework to detect faces, hands, and body pose from a live camera feed or any track’s rendered output. All detected values are normalised 0–1 floats that can be mapped to any parameter exactly like an LFO or audio signal.

Detection runs on the Neural Engine at ~15 fps with negligible CPU and GPU overhead.

Click the Vision Tracking icon (figure silhouette) in the right sidebar. The panel is titled Vision Tracking.

At the top of the panel, choose where Vision reads its frames from:

ModeWhat it analyses
CameraA connected camera or webcam
TrackThe rendered output of a track in the current project

Click Camera or Track to switch. The choice is saved per project.

Under Camera Source, each available camera appears as a row. Click a row to activate it — the row highlights and a camera icon appears when the session is running. Click the same row again to stop.

Use the Refresh button if a newly connected camera does not appear in the list.

A live camera Preview appears below the controls once a camera is active. The preview shows a real-time overlay of detected landmarks:

  • Face — bounding box + centre dot; eye and mouth dots when Landmarks are enabled
  • Left hand — emerald lines from wrist to each fingertip, curl-sized dots at tips, yellow line between thumb and index when pinching
  • Right hand — same layout in purple
  • Body — orange skeleton connecting neck, shoulders, elbows, wrists, hips, knees, and ankles

Click Hide / Show next to the Preview header to collapse the preview while keeping tracking active.

Under Track Source, pick any track in the project from the dropdown. Vision analyses each frame that track renders, instead of a camera. Use this to react to your own visuals — for example, drive a shader parameter from the brightness or motion of another track.

The camera preview is hidden in Track mode (there is no separate camera feed).

Enable only the detectors you need. Each group adds CPU cost:

ToggleWhat it enablesDefault
FaceFace position, size, head angles, presenceOn
+ LandmarksEye-open and mouth-open values (requires Face)Off
HandBoth left and right hand — wrist, index tip, pinch, presenceOff
BodyBody centre, both wrists, presenceOff

The Smoothing slider (0–95%) applies an exponential moving average to all outputs each analysis frame.

  • 0% (Off) — raw values, may jitter
  • 60% — the default; a good balance for most live use
  • Higher values — more inertia, slower response

By default all X values are mirrored so that X = 0 is the left side of the screen in both camera and track modes (consistent with Vision’s natural coordinate system and how shaders expect X). This mirrors the performer’s movements as seen by the camera.

Enable Flip X to disable the mirror. Use this for rear-facing cameras, external cameras already mounted correctly, or when the natural camera X matches your visual setup.

The Live Values section shows a real-time readout of every active channel as a labelled bar chart. Active groups glow when a subject is detected. Use this to understand and tune your mappings without needing to open the parameter editor.


  1. Select a track or effect and open the Parameter Inspector.
  2. Click the source pill on any parameter and choose Vision.
  3. A three-row picker appears for the mapping type:
RowParent typeExtended sub-type
FaceFace (position, size, angles)+ Landmarks (eye open, mouth open)
HandL. Hand (wrist, index, pinch)L. Fingers (all tips + per-finger curl)
HandR. HandR. Fingers
BodyBody (centre, both wrists)Skeleton (full joint positions)

Select the sub-type row, then pick a specific landmark from the grid below.


All values are 0–1. Coordinate origin: X=0 is left, X=1 is right, Y=0 is bottom, Y=1 is top.

ChannelLabelNotes
XHorizontal position0 = left, 1 = right
YVertical position0 = bottom, 1 = top
SizeBounding box area (√width×height)Larger when closer to camera
RollHead tilt left/right0.5 = upright
YawHead turn left/right0.5 = facing forward
PitchHead tilt up/down0.5 = level
PresenceConfidence of detection0 = not detected, 1 = detected; slow decay on loss
ChannelLabelNotes
Eye LLeft eye openness0 = closed, 1 = wide open
Eye RRight eye openness
MouthMouth openness0 = closed, 1 = open
ChannelLabel
Wrist X / YWrist position
Index X / YIndex fingertip position
PinchDistance between thumb tip and index tip — 0 = pinching, 1 = open
PresenceDetection confidence
ChannelLabel
Mid X / YMiddle fingertip
Ring X / YRing fingertip
Little X / YPinky tip
Thumb Curl0 = extended, 1 = fully curled
Index Curl
Mid Curl
Ring Curl
Little Curl
ChannelLabel
Center X / YHip midpoint
L.Wrist X / YLeft wrist
R.Wrist X / YRight wrist
PresenceDetection confidence

Full joint set: Neck, Left/Right Shoulder, Left/Right Elbow, Left/Right Knee, Left/Right Ankle — each as X/Y pairs.


  • Start with Face only — it is the cheapest detector and covers most interactive use cases.
  • Use Presence as a modulator to drive a parameter to zero when no subject is in frame, avoiding frozen values.
  • Pinch (0 = pinching, 1 = open) can gate effects with hand gestures. Invert it in the mapping if you want a pinch to drive a value up.
  • In Track mode, try pointing Vision at a feedback loop or particle track to create visual feedback that reacts to its own motion.
  • The Live Values panel is the fastest way to verify a mapping before performing.