Visual Understanding

Module Description

Perform visual-language understanding on images and videos: answer visual questions, describe scenes, extract structured output, and segment videos into shots and analyze each segment separately.

Module ID: visual_understanding

Supported media: image, video

Supported Models (`model_key`)

The module supports the following model_key values:

qwen2.5_vl_3b_instruct - Qwen 2.5 VL 3B instruct (default)
qwen2.5_vl_7b_instruct - Qwen 2.5 VL 7B instruct
qwen3_vl_2b_instruct - Qwen 3 VL 2B instruct
qwen3_vl_2b_thinking - Qwen 3 VL 2B thinking
qwen3_vl_4b_instruct - Qwen 3 VL 4B instruct
qwen3_vl_4b_thinking - Qwen 3 VL 4B thinking
qwen3_vl_8b_instruct - Qwen 3 VL 8B instruct
qwen3_vl_8b_thinking - Qwen 3 VL 8B thinking
smolvlm_instruct - SmolVLM instruct

Notes: - Larger models typically yield stronger results but may increase latency. - “thinking” variants are optimized for deeper reasoning and may behave differently than “instruct”.

Module Parameters

Name	Type	Default	Description
`prompt`	string	`"Describe the content"`	The user prompt to apply to the image/video (or to each shot if shot-detection is enabled).
`model_key`	string	`"qwen2.5_vl_3b_instruct"`	Which VLM to use. See Supported Models above.
`structured_output_schema`	object	`{}`	JSON schema-like object describing the structure you want back. If provided, the module will attempt to return `structured_response` as JSON matching your schema.
`temperature`	number	`0.7`	Sampling temperature (response randomness). Higher = more creative, lower = more deterministic.
`top_p`	number	`0.95`	Nucleus sampling parameter controlling diversity. Higher = more creative, lower = more deterministic.
`enable_image_or_video_metadata`	boolean	`false`	If `true`, the module extracts basic image/video metadata and provides it to the model as extra context.
`frame_sampling_method`	string	`"linear"`	Frame sampling strategy for shots only (video support only `linear`). Supported: `linear`, `random`, `motion_topk`.
`enable_shot_detection`	boolean	`false`	If `true`, the module splits the video into shots and applies the prompt per shot. If `false`, the prompt is applied to the whole video.
`fixed_shot_length_in_seconds`	integer	`0`	If `> 0`, splits the video into fixed-length segments (in seconds) and analyzes each segment. Cannot be used together with `enable_shot_detection=true`.
`shot_detection_method`	string	`"content"`	Shot detection algorithm. Supported: `content`, `adaptive`, `threshold`, `histogram`, `hash`.
`content_threshold`	number	`28.0`	Threshold for `content` detector (higher = fewer cuts).
`adaptive_threshold`	number	`3.0`	Threshold for `adaptive` detector (higher = fewer cuts).
`brightness_threshold`	integer	`12`	Threshold for `threshold` detector (brightness-based fade/cut detection) (higher = fewer cuts).
`histogram_threshold`	number	`0.05`	Threshold for `histogram` detector (higher = fewer cuts).
`hash_threshold`	number	`0.395`	Threshold for `hash` detector (higher = fewer cuts).

Validation / Constraints

fixed_shot_length_in_seconds must be >= 0.
fixed_shot_length_in_seconds > 0 cannot be combined with enable_shot_detection=true.
If structured_output_schema is set, results are returned primarily in structured_response (and response may be null).
Frame sampling are subject to model-specific max frame limits (server-configured).
When shot-detection is enabled:
- Each detected shot is processed independently.
- Each shot produces a separate result segment.
When fixed-length segmentation is enabled:
- The video is split into equal time segments.
- Each segment produces a separate result segment.
Metadata extraction only occurs if:
- enable_image_or_video_metadata = true
Extracted metadata is provided to the model as additional context only:
- It does not guarantee the metadata will appear in the output.
For full-video processing (shot-detection disabled):
- Frames are automatically subsampled to avoid excessive memory usage.
If structured_output_schema is provided:
- The module will attempt to return results in structured_response.
- response may be null.
If the model fails to generate valid JSON matching the schema:
- The raw model output may be returned in response instead.

Example

Send the following JSON as request body via POST to the /jobs/ endpoint:

{
  "sources": [
    "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"
  ],
  "modules": {
    "visual_understanding": {
      "prompt": "Describe the actions happening in this video scene.",
      "model_key": "qwen2.5_vl_7b_instruct"
    }
  }
}

In order to get the results you can request the /jobs/{JOB_ID}/detailed-results/ endpoint, the response looks like this:

{
  "data": [
    {
      "detections": [],
      "frame_end": 300,
      "frame_start": 0,
      "id": "1d27611a-fc62-4e31-b6a3-cf1df4f3a9e9",
      "media_type": "video",
      "meta": {
        "indexed_identity": null,
        "prompt": "Describe the actions happening in this video scene.",
        "response": "A tree grows out of a grassy mound with a hole in it.",
        "structured_output_schema": {},
        "structured_response": null
      },
      "module": "visual_understanding",
      "source": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4",
      "tc_end": "00:00:10:00",
      "tc_start": "00:00:00:00",
      "thumbnail": null,
      "time_end": 10,
      "time_start": 0
    }
  ],
  "limit": 100,
  "next": null,
  "offset": 0,
  "prev": null,
  "total": 1
}

Each detailed result element contains the response text inside the meta field.

Without enable_shot_detection enabled (set to true), the results consist of one segment describing the whole video. In order to get results per segment (shot), you need to set enable_shot_detection = true.

Shot Detection Methods

Method	Description
content	Detects shot changes using weighted average of pixel changes in the HSV colorspace.
adaptive	Performs rolling average on differences in HSV colorspace. In some cases, this can improve handling of fast motion.
threshold	Detects scene transitions based on brightness changes across frames. Useful for detecting fade-in and fade-out transitions.
histogram	Detects shot changes by comparing luminance (brightness) histograms between frames. More robust to noise, flashes, and illumination changes.
hash	Detects scene changes using perceptual image hashing between frames. Robust to compression artifacts, resizing, watermarks, logos, and encoding changes.

Structured Output

You can define a structured output schema to ensure the model returns results in a specific format. The schema should be a JSON object that mirrors the expected output structure. For example:

{
  "sources": [
    "storage://ubXZFeryA7zoF0N0hDgr"
  ],
  "modules": {
    "visual_understanding": {
      "enable_shot_detection": true,
      "model_key": "qwen2.5_vl_7b_instruct",
      "prompt": "You are a scene analysis assistant. Analyze the given image and return a JSON object describing the scene. Populate each field precisely according to the given instructions in the JSON schema below. Use exact categories, avoid vague language, and ensure each entry follows the required format",
      "structured_output_schema": {
        "camera_setting": "Choose the best fitting camera shot from this list: [ 'Establishing Shot', 'Wide Shot', 'Full Shot', 'Cowboy Shot', 'Medium Shot', 'Medium Close Up Shot', 'Close Up / Extreme Close Up Shot', 'Two / Three / Group Shot', 'OTS / Over The Shoulder Shot', 'POV / Point Of View Shot', 'Dutch Angle Shot', 'Low Angle Shot / High Angle Shot' ]",
        "daytime": "What time of day is shown? Choose one: [ 'Night', 'Day', 'Morning', 'Evening', 'Midday' ]",
        "location": "Is the scene indoors or outdoors? Choose one: [ 'Interior', 'Exterior' ]",
        "persons_appearing": "Are people clearly visible and central in the scene? Answer 'Yes' or 'No'.",
        "scene_description": "Describe the scene in one complete sentence. Include key actions, visible people, setting, important objects, atmosphere, and inferred context (e.g., event type, disaster, location). Do not mention the viewer or video itself.",
        "scene_tags": "List relevant, lowercase tags describing people, actions, objects, setting, mood, time of day, and inferred context. Include readable text if meaningful. No vague or redundant tags.",
        "text_appearing": "Transcribe any clearly readable text in the scene. Leave blank if none.",
        "weather": "Describe the weather condition or say 'interior' if indoors. Choose one: [ 'Sun', 'Cloudy', 'Storm', 'Fog', 'Snow', 'Rain', 'Mist', 'Sandstorm', 'Overcast', 'interior' ]"
      }
    }
  }
}

Example response from the /jobs/{JOB_ID}/detailed-results/ endpoint:

{
    "data": [
        {
            "detections": [],
            "frame_end": 62,
            "frame_start": 0,
            "id": "5ea16764-35f9-4799-ade5-8facb10dbd3e",
            "media_type": "video",
            "meta": {
                "indexed_identity": null,
                "prompt": "You are a scene analysis assistant. Analyze the given image and return a JSON object describing the scene. Populate each field precisely according to the given instructions in the JSON schema below. Use exact categories, avoid vague language, and ensure each entry follows the required format",
                "response": null,
                "structured_output_schema": {
                    "camera_setting": "Choose the best fitting camera shot from this list: [ 'Establishing Shot', 'Wide Shot', 'Full Shot', 'Cowboy Shot', 'Medium Shot', 'Medium Close Up Shot', 'Close Up / Extreme Close Up Shot', 'Two / Three / Group Shot', 'OTS / Over The Shoulder Shot', 'POV / Point Of View Shot', 'Dutch Angle Shot', 'Low Angle Shot / High Angle Shot' ]",
                    "daytime": "What time of day is shown? Choose one: [ 'Night', 'Day', 'Morning', 'Evening', 'Midday' ]",
                    "location": "Is the scene indoors or outdoors? Choose one: [ 'Interior', 'Exterior' ]",
                    "persons_appearing": "Are people clearly visible and central in the scene? Answer 'Yes' or 'No'.",
                    "scene_description": "Describe the scene in one complete sentence. Include key actions, visible people, setting, important objects, atmosphere, and inferred context (e.g., event type, disaster, location). Do not mention the viewer or video itself.",
                    "scene_tags": "List 5\u201315 relevant, lowercase tags describing people, actions, objects, setting, mood, time of day, and inferred context. Include readable text if meaningful. No vague or redundant tags.",
                    "text_appearing": "Transcribe any clearly readable text in the scene. Leave blank if none.",
                    "weather": "Describe the weather condition or say 'interior' if indoors. Choose one: [ 'Sun', 'Cloudy', 'Storm', 'Fog', 'Snow', 'Rain', 'Mist', 'Sandstorm', 'Overcast', 'interior' ]"
                },
                "structured_response": {
                    "camera_setting": "Close Up / Extreme Close Up Shot",
                    "daytime": "Day",
                    "location": "Interior",
                    "persons_appearing": "No",
                    "scene_description": "A modern kitchen with an open refrigerator displaying neatly arranged food items, illuminated by overhead lights.",
                    "scene_tags": [
                        "kitchen",
                        "refrigerator",
                        "food storage",
                        "modern design",
                        "overhead lighting",
                        "organized",
                        "interior",
                        "daylight",
                        "clean",
                        "contemporary"
                    ],
                    "text_appearing": "LIEBHERR",
                    "weather": "interior"
                }
            },
            "module": "visual_understanding",
            "source": "storage://ubXZFeryA7zoF0N0hDgr",
            "tc_end": "00:00:02:12",
            "tc_start": "00:00:00:00",
            "thumbnail": null,
            "time_end": 2.48,
            "time_start": 0.0
        },
        ...
    ]
}