Visual Understanding
Module Description
Perform visual-language understanding on images and videos: answer visual questions, describe scenes, extract structured output, and segment videos into shots and analyze each segment separately.
Module ID: visual_understanding
Supported media: image, video
Supported Models (model_key)
The module supports the following model_key values:
qwen2.5_vl_3b_instruct-Qwen 2.5 VL 3B instruct(default)qwen2.5_vl_7b_instruct-Qwen 2.5 VL 7B instructqwen3_vl_2b_instruct-Qwen 3 VL 2B instructqwen3_vl_2b_thinking-Qwen 3 VL 2B thinkingqwen3_vl_4b_instruct-Qwen 3 VL 4B instructqwen3_vl_4b_thinking-Qwen 3 VL 4B thinkingqwen3_vl_8b_instruct-Qwen 3 VL 8B instructqwen3_vl_8b_thinking-Qwen 3 VL 8B thinkingsmolvlm_instruct-SmolVLM instruct
Notes: - Larger models typically yield stronger results but may increase latency. - “thinking” variants are optimized for deeper reasoning and may behave differently than “instruct”.
Module Parameters
| Name | Type | Default | Description |
|---|---|---|---|
prompt |
string | "Describe the content" |
The user prompt to apply to the image/video (or to each shot if shot-detection is enabled). |
model_key |
string | "qwen2.5_vl_3b_instruct" |
Which VLM to use. See Supported Models above. |
structured_output_schema |
object | {} |
JSON schema-like object describing the structure you want back. If provided, the module will attempt to return structured_response as JSON matching your schema. |
temperature |
number | 0.7 |
Sampling temperature (response randomness). Higher = more creative, lower = more deterministic. |
top_p |
number | 0.95 |
Nucleus sampling parameter controlling diversity. Higher = more creative, lower = more deterministic. |
enable_image_or_video_metadata |
boolean | false |
If true, the module extracts basic image/video metadata and provides it to the model as extra context. |
frame_sampling_method |
string | "linear" |
Frame sampling strategy for shots only (video support only linear). Supported: linear, random, motion_topk. |
enable_shot_detection |
boolean | false |
If true, the module splits the video into shots and applies the prompt per shot. If false, the prompt is applied to the whole video. |
fixed_shot_length_in_seconds |
integer | 0 |
If > 0, splits the video into fixed-length segments (in seconds) and analyzes each segment. Cannot be used together with enable_shot_detection=true. |
shot_detection_method |
string | "content" |
Shot detection algorithm. Supported: content, adaptive, threshold, histogram, hash. |
content_threshold |
number | 28.0 |
Threshold for content detector (higher = fewer cuts). |
adaptive_threshold |
number | 3.0 |
Threshold for adaptive detector (higher = fewer cuts). |
brightness_threshold |
integer | 12 |
Threshold for threshold detector (brightness-based fade/cut detection) (higher = fewer cuts). |
histogram_threshold |
number | 0.05 |
Threshold for histogram detector (higher = fewer cuts). |
hash_threshold |
number | 0.395 |
Threshold for hash detector (higher = fewer cuts). |
Validation / Constraints
fixed_shot_length_in_secondsmust be>= 0.fixed_shot_length_in_seconds > 0cannot be combined withenable_shot_detection=true.- If
structured_output_schemais set, results are returned primarily instructured_response(andresponsemay benull). - Frame sampling are subject to model-specific max frame limits (server-configured).
- When shot-detection is enabled:
- Each detected shot is processed independently.
- Each shot produces a separate result segment.
- When fixed-length segmentation is enabled:
- The video is split into equal time segments.
- Each segment produces a separate result segment.
- Metadata extraction only occurs if:
enable_image_or_video_metadata = true
- Extracted metadata is provided to the model as additional context only:
- It does not guarantee the metadata will appear in the output.
- For full-video processing (shot-detection disabled):
- Frames are automatically subsampled to avoid excessive memory usage.
- If
structured_output_schemais provided:- The module will attempt to return results in
structured_response. responsemay benull.
- The module will attempt to return results in
- If the model fails to generate valid JSON matching the schema:
- The raw model output may be returned in
responseinstead.
- The raw model output may be returned in
Example
Send the following JSON as request body via POST to the /jobs/ endpoint:
{
"sources": [
"https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"
],
"modules": {
"visual_understanding": {
"prompt": "Describe the actions happening in this video scene.",
"model_key": "qwen2.5_vl_7b_instruct"
}
}
}
In order to get the results you can request the /jobs/{JOB_ID}/detailed-results/ endpoint, the response looks like this:
{
"data": [
{
"detections": [],
"frame_end": 300,
"frame_start": 0,
"id": "1d27611a-fc62-4e31-b6a3-cf1df4f3a9e9",
"media_type": "video",
"meta": {
"indexed_identity": null,
"prompt": "Describe the actions happening in this video scene.",
"response": "A tree grows out of a grassy mound with a hole in it.",
"structured_output_schema": {},
"structured_response": null
},
"module": "visual_understanding",
"source": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4",
"tc_end": "00:00:10:00",
"tc_start": "00:00:00:00",
"thumbnail": null,
"time_end": 10,
"time_start": 0
}
],
"limit": 100,
"next": null,
"offset": 0,
"prev": null,
"total": 1
}
Each detailed result element contains the response text inside the meta field.
Without enable_shot_detection enabled (set to true), the results consist of one segment describing the whole video.
In order to get results per segment (shot), you need to set enable_shot_detection = true.
Shot Detection Methods
| Method | Description |
|---|---|
| content | Detects shot changes using weighted average of pixel changes in the HSV colorspace. |
| adaptive | Performs rolling average on differences in HSV colorspace. In some cases, this can improve handling of fast motion. |
| threshold | Detects scene transitions based on brightness changes across frames. Useful for detecting fade-in and fade-out transitions. |
| histogram | Detects shot changes by comparing luminance (brightness) histograms between frames. More robust to noise, flashes, and illumination changes. |
| hash | Detects scene changes using perceptual image hashing between frames. Robust to compression artifacts, resizing, watermarks, logos, and encoding changes. |
Structured Output
You can define a structured output schema to ensure the model returns results in a specific format. The schema should be a JSON object that mirrors the expected output structure. For example:
{
"sources": [
"storage://ubXZFeryA7zoF0N0hDgr"
],
"modules": {
"visual_understanding": {
"enable_shot_detection": true,
"model_key": "qwen2.5_vl_7b_instruct",
"prompt": "You are a scene analysis assistant. Analyze the given image and return a JSON object describing the scene. Populate each field precisely according to the given instructions in the JSON schema below. Use exact categories, avoid vague language, and ensure each entry follows the required format",
"structured_output_schema": {
"camera_setting": "Choose the best fitting camera shot from this list: [ 'Establishing Shot', 'Wide Shot', 'Full Shot', 'Cowboy Shot', 'Medium Shot', 'Medium Close Up Shot', 'Close Up / Extreme Close Up Shot', 'Two / Three / Group Shot', 'OTS / Over The Shoulder Shot', 'POV / Point Of View Shot', 'Dutch Angle Shot', 'Low Angle Shot / High Angle Shot' ]",
"daytime": "What time of day is shown? Choose one: [ 'Night', 'Day', 'Morning', 'Evening', 'Midday' ]",
"location": "Is the scene indoors or outdoors? Choose one: [ 'Interior', 'Exterior' ]",
"persons_appearing": "Are people clearly visible and central in the scene? Answer 'Yes' or 'No'.",
"scene_description": "Describe the scene in one complete sentence. Include key actions, visible people, setting, important objects, atmosphere, and inferred context (e.g., event type, disaster, location). Do not mention the viewer or video itself.",
"scene_tags": "List relevant, lowercase tags describing people, actions, objects, setting, mood, time of day, and inferred context. Include readable text if meaningful. No vague or redundant tags.",
"text_appearing": "Transcribe any clearly readable text in the scene. Leave blank if none.",
"weather": "Describe the weather condition or say 'interior' if indoors. Choose one: [ 'Sun', 'Cloudy', 'Storm', 'Fog', 'Snow', 'Rain', 'Mist', 'Sandstorm', 'Overcast', 'interior' ]"
}
}
}
}
Example response from the /jobs/{JOB_ID}/detailed-results/ endpoint:
{
"data": [
{
"detections": [],
"frame_end": 62,
"frame_start": 0,
"id": "5ea16764-35f9-4799-ade5-8facb10dbd3e",
"media_type": "video",
"meta": {
"indexed_identity": null,
"prompt": "You are a scene analysis assistant. Analyze the given image and return a JSON object describing the scene. Populate each field precisely according to the given instructions in the JSON schema below. Use exact categories, avoid vague language, and ensure each entry follows the required format",
"response": null,
"structured_output_schema": {
"camera_setting": "Choose the best fitting camera shot from this list: [ 'Establishing Shot', 'Wide Shot', 'Full Shot', 'Cowboy Shot', 'Medium Shot', 'Medium Close Up Shot', 'Close Up / Extreme Close Up Shot', 'Two / Three / Group Shot', 'OTS / Over The Shoulder Shot', 'POV / Point Of View Shot', 'Dutch Angle Shot', 'Low Angle Shot / High Angle Shot' ]",
"daytime": "What time of day is shown? Choose one: [ 'Night', 'Day', 'Morning', 'Evening', 'Midday' ]",
"location": "Is the scene indoors or outdoors? Choose one: [ 'Interior', 'Exterior' ]",
"persons_appearing": "Are people clearly visible and central in the scene? Answer 'Yes' or 'No'.",
"scene_description": "Describe the scene in one complete sentence. Include key actions, visible people, setting, important objects, atmosphere, and inferred context (e.g., event type, disaster, location). Do not mention the viewer or video itself.",
"scene_tags": "List 5\u201315 relevant, lowercase tags describing people, actions, objects, setting, mood, time of day, and inferred context. Include readable text if meaningful. No vague or redundant tags.",
"text_appearing": "Transcribe any clearly readable text in the scene. Leave blank if none.",
"weather": "Describe the weather condition or say 'interior' if indoors. Choose one: [ 'Sun', 'Cloudy', 'Storm', 'Fog', 'Snow', 'Rain', 'Mist', 'Sandstorm', 'Overcast', 'interior' ]"
},
"structured_response": {
"camera_setting": "Close Up / Extreme Close Up Shot",
"daytime": "Day",
"location": "Interior",
"persons_appearing": "No",
"scene_description": "A modern kitchen with an open refrigerator displaying neatly arranged food items, illuminated by overhead lights.",
"scene_tags": [
"kitchen",
"refrigerator",
"food storage",
"modern design",
"overhead lighting",
"organized",
"interior",
"daylight",
"clean",
"contemporary"
],
"text_appearing": "LIEBHERR",
"weather": "interior"
}
},
"module": "visual_understanding",
"source": "storage://ubXZFeryA7zoF0N0hDgr",
"tc_end": "00:00:02:12",
"tc_start": "00:00:00:00",
"thumbnail": null,
"time_end": 2.48,
"time_start": 0.0
},
...
]
}