Skip to content

Visual Understanding

Module Description

Perform visual-language understanding on images and videos: answer visual questions, describe scenes, extract structured output, and segment videos into shots and analyze each segment separately.

Module ID: visual_understanding

Supported media: image, video


Supported Models (model_key)

The module supports the following model_key values:

  • qwen2.5_vl_3b_instruct - Qwen 2.5 VL 3B instruct (default)
  • qwen2.5_vl_7b_instruct - Qwen 2.5 VL 7B instruct
  • qwen3_vl_2b_instruct - Qwen 3 VL 2B instruct
  • qwen3_vl_2b_thinking - Qwen 3 VL 2B thinking
  • qwen3_vl_4b_instruct - Qwen 3 VL 4B instruct
  • qwen3_vl_4b_thinking - Qwen 3 VL 4B thinking
  • qwen3_vl_8b_instruct - Qwen 3 VL 8B instruct
  • qwen3_vl_8b_thinking - Qwen 3 VL 8B thinking
  • smolvlm_instruct - SmolVLM instruct


Notes: - Larger models typically yield stronger results but may increase latency. - “thinking” variants are optimized for deeper reasoning and may behave differently than “instruct”.


Module Parameters

Name Type Default Description
prompt string "Describe the content" The user prompt to apply to the image/video (or to each shot if shot-detection is enabled).
model_key string "qwen2.5_vl_3b_instruct" Which VLM to use. See Supported Models above.
structured_output_schema object {} JSON schema-like object describing the structure you want back. If provided, the module will attempt to return structured_response as JSON matching your schema.
temperature number 0.7 Sampling temperature (response randomness). Higher = more creative, lower = more deterministic.
top_p number 0.95 Nucleus sampling parameter controlling diversity. Higher = more creative, lower = more deterministic.
enable_image_or_video_metadata boolean false If true, the module extracts basic image/video metadata and provides it to the model as extra context.
frame_sampling_method string "linear" Frame sampling strategy for shots only (video support only linear). Supported: linear, random, motion_topk.
enable_shot_detection boolean false If true, the module splits the video into shots and applies the prompt per shot. If false, the prompt is applied to the whole video.
fixed_shot_length_in_seconds integer 0 If > 0, splits the video into fixed-length segments (in seconds) and analyzes each segment. Cannot be used together with enable_shot_detection=true.
shot_detection_method string "content" Shot detection algorithm. Supported: content, adaptive, threshold, histogram, hash.
content_threshold number 28.0 Threshold for content detector (higher = fewer cuts).
adaptive_threshold number 3.0 Threshold for adaptive detector (higher = fewer cuts).
brightness_threshold integer 12 Threshold for threshold detector (brightness-based fade/cut detection) (higher = fewer cuts).
histogram_threshold number 0.05 Threshold for histogram detector (higher = fewer cuts).
hash_threshold number 0.395 Threshold for hash detector (higher = fewer cuts).

Validation / Constraints

  • fixed_shot_length_in_seconds must be >= 0.
  • fixed_shot_length_in_seconds > 0 cannot be combined with enable_shot_detection=true.
  • If structured_output_schema is set, results are returned primarily in structured_response (and response may be null).
  • Frame sampling are subject to model-specific max frame limits (server-configured).
  • When shot-detection is enabled:
    • Each detected shot is processed independently.
    • Each shot produces a separate result segment.
  • When fixed-length segmentation is enabled:
    • The video is split into equal time segments.
    • Each segment produces a separate result segment.
  • Metadata extraction only occurs if:
    • enable_image_or_video_metadata = true
  • Extracted metadata is provided to the model as additional context only:
    • It does not guarantee the metadata will appear in the output.
  • For full-video processing (shot-detection disabled):
    • Frames are automatically subsampled to avoid excessive memory usage.
  • If structured_output_schema is provided:
    • The module will attempt to return results in structured_response.
    • response may be null.
  • If the model fails to generate valid JSON matching the schema:
    • The raw model output may be returned in response instead.

Example

Send the following JSON as request body via POST to the /jobs/ endpoint:

{
  "sources": [
    "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"
  ],
  "modules": {
    "visual_understanding": {
      "prompt": "Describe the actions happening in this video scene.",
      "model_key": "qwen2.5_vl_7b_instruct"
    }
  }
}

In order to get the results you can request the /jobs/{JOB_ID}/detailed-results/ endpoint, the response looks like this:

{
  "data": [
    {
      "detections": [],
      "frame_end": 300,
      "frame_start": 0,
      "id": "1d27611a-fc62-4e31-b6a3-cf1df4f3a9e9",
      "media_type": "video",
      "meta": {
        "indexed_identity": null,
        "prompt": "Describe the actions happening in this video scene.",
        "response": "A tree grows out of a grassy mound with a hole in it.",
        "structured_output_schema": {},
        "structured_response": null
      },
      "module": "visual_understanding",
      "source": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4",
      "tc_end": "00:00:10:00",
      "tc_start": "00:00:00:00",
      "thumbnail": null,
      "time_end": 10,
      "time_start": 0
    }
  ],
  "limit": 100,
  "next": null,
  "offset": 0,
  "prev": null,
  "total": 1
}

Each detailed result element contains the response text inside the meta field.

Without enable_shot_detection enabled (set to true), the results consist of one segment describing the whole video. In order to get results per segment (shot), you need to set enable_shot_detection = true.

Shot Detection Methods

Method Description
content Detects shot changes using weighted average of pixel changes in the HSV colorspace.
adaptive Performs rolling average on differences in HSV colorspace. In some cases, this can improve handling of fast motion.
threshold Detects scene transitions based on brightness changes across frames. Useful for detecting fade-in and fade-out transitions.
histogram Detects shot changes by comparing luminance (brightness) histograms between frames. More robust to noise, flashes, and illumination changes.
hash Detects scene changes using perceptual image hashing between frames. Robust to compression artifacts, resizing, watermarks, logos, and encoding changes.

Structured Output

You can define a structured output schema to ensure the model returns results in a specific format. The schema should be a JSON object that mirrors the expected output structure. For example:

{
  "sources": [
    "storage://ubXZFeryA7zoF0N0hDgr"
  ],
  "modules": {
    "visual_understanding": {
      "enable_shot_detection": true,
      "model_key": "qwen2.5_vl_7b_instruct",
      "prompt": "You are a scene analysis assistant. Analyze the given image and return a JSON object describing the scene. Populate each field precisely according to the given instructions in the JSON schema below. Use exact categories, avoid vague language, and ensure each entry follows the required format",
      "structured_output_schema": {
        "camera_setting": "Choose the best fitting camera shot from this list: [ 'Establishing Shot', 'Wide Shot', 'Full Shot', 'Cowboy Shot', 'Medium Shot', 'Medium Close Up Shot', 'Close Up / Extreme Close Up Shot', 'Two / Three / Group Shot', 'OTS / Over The Shoulder Shot', 'POV / Point Of View Shot', 'Dutch Angle Shot', 'Low Angle Shot / High Angle Shot' ]",
        "daytime": "What time of day is shown? Choose one: [ 'Night', 'Day', 'Morning', 'Evening', 'Midday' ]",
        "location": "Is the scene indoors or outdoors? Choose one: [ 'Interior', 'Exterior' ]",
        "persons_appearing": "Are people clearly visible and central in the scene? Answer 'Yes' or 'No'.",
        "scene_description": "Describe the scene in one complete sentence. Include key actions, visible people, setting, important objects, atmosphere, and inferred context (e.g., event type, disaster, location). Do not mention the viewer or video itself.",
        "scene_tags": "List relevant, lowercase tags describing people, actions, objects, setting, mood, time of day, and inferred context. Include readable text if meaningful. No vague or redundant tags.",
        "text_appearing": "Transcribe any clearly readable text in the scene. Leave blank if none.",
        "weather": "Describe the weather condition or say 'interior' if indoors. Choose one: [ 'Sun', 'Cloudy', 'Storm', 'Fog', 'Snow', 'Rain', 'Mist', 'Sandstorm', 'Overcast', 'interior' ]"
      }
    }
  }
}

Example response from the /jobs/{JOB_ID}/detailed-results/ endpoint:

{
    "data": [
        {
            "detections": [],
            "frame_end": 62,
            "frame_start": 0,
            "id": "5ea16764-35f9-4799-ade5-8facb10dbd3e",
            "media_type": "video",
            "meta": {
                "indexed_identity": null,
                "prompt": "You are a scene analysis assistant. Analyze the given image and return a JSON object describing the scene. Populate each field precisely according to the given instructions in the JSON schema below. Use exact categories, avoid vague language, and ensure each entry follows the required format",
                "response": null,
                "structured_output_schema": {
                    "camera_setting": "Choose the best fitting camera shot from this list: [ 'Establishing Shot', 'Wide Shot', 'Full Shot', 'Cowboy Shot', 'Medium Shot', 'Medium Close Up Shot', 'Close Up / Extreme Close Up Shot', 'Two / Three / Group Shot', 'OTS / Over The Shoulder Shot', 'POV / Point Of View Shot', 'Dutch Angle Shot', 'Low Angle Shot / High Angle Shot' ]",
                    "daytime": "What time of day is shown? Choose one: [ 'Night', 'Day', 'Morning', 'Evening', 'Midday' ]",
                    "location": "Is the scene indoors or outdoors? Choose one: [ 'Interior', 'Exterior' ]",
                    "persons_appearing": "Are people clearly visible and central in the scene? Answer 'Yes' or 'No'.",
                    "scene_description": "Describe the scene in one complete sentence. Include key actions, visible people, setting, important objects, atmosphere, and inferred context (e.g., event type, disaster, location). Do not mention the viewer or video itself.",
                    "scene_tags": "List 5\u201315 relevant, lowercase tags describing people, actions, objects, setting, mood, time of day, and inferred context. Include readable text if meaningful. No vague or redundant tags.",
                    "text_appearing": "Transcribe any clearly readable text in the scene. Leave blank if none.",
                    "weather": "Describe the weather condition or say 'interior' if indoors. Choose one: [ 'Sun', 'Cloudy', 'Storm', 'Fog', 'Snow', 'Rain', 'Mist', 'Sandstorm', 'Overcast', 'interior' ]"
                },
                "structured_response": {
                    "camera_setting": "Close Up / Extreme Close Up Shot",
                    "daytime": "Day",
                    "location": "Interior",
                    "persons_appearing": "No",
                    "scene_description": "A modern kitchen with an open refrigerator displaying neatly arranged food items, illuminated by overhead lights.",
                    "scene_tags": [
                        "kitchen",
                        "refrigerator",
                        "food storage",
                        "modern design",
                        "overhead lighting",
                        "organized",
                        "interior",
                        "daylight",
                        "clean",
                        "contemporary"
                    ],
                    "text_appearing": "LIEBHERR",
                    "weather": "interior"
                }
            },
            "module": "visual_understanding",
            "source": "storage://ubXZFeryA7zoF0N0hDgr",
            "tc_end": "00:00:02:12",
            "tc_start": "00:00:00:00",
            "thumbnail": null,
            "time_end": 2.48,
            "time_start": 0.0
        },
        ...
    ]
}