SAM 2: Segment Anything in Images and Videos

by Nikhila Ravi et al.

Audio version created with Paper2Audio.

Listen on Paper2Audio

sam 2: Segment Anything in Images and Videos

Nikhila Ravi et al.

Audio by Paper2Audio.

We present Segment Anything Model 2 (sam 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. sam 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3× fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6× faster than the Segment Anything Model (sam). We believe that our data model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing our main model, dataset, as well as code for model training and our demo.

Demo: sam2 dot metademolab dot com U.R.L U.R.L

Code: github dot com U.R.L U.R.L

Website: ai dot meta dot com U.R.L U.R.L

1 Introduction

infinity Meta

Segment Anything (S.A) introduced a foundation model for promptable segmentation in images. However, an image is only a static snapshot of the real world in which visual segments can exhibit complex motion, and with the rapid growth of multimedia content, a significant portion is now recorded with a temporal dimension, particularly in video data. Many important applications in A.R/V.R, robotics, autonomous vehicles, and video editing require temporal localization beyond image-level segmentation. We believe a universal visual segmentation system should be applicable to both images and videos.

Segmentation in video aims to determine the spatio-temporal extent of entities, which presents unique challenges beyond those in images. Entities can undergo significant changes in appearance due to motion, deformation, occlusion, lighting changes, and other factors. Videos often have lower quality than images due to camera motion, blur, and lower resolution. Further, efficient processing of a large number of frames is a key challenge. While S.A successfully addresses segmentation in images, existing video segmentation models and datasets fall short in providing a comparable capability to "segment anything in videos".

We introduce the Segment Anything Model 2 (sam 2), a unified model for video and image segmentation (we consider an image as a single-frame video). Our work includes a task, model, and dataset (see figure 1).

Figure 1 summary: This figure is a conceptual diagram illustrating a system architecture and data pipeline. The content depicts a promptable visual segmentation task where a model processes video frames and prompts, such as boxes, points, or masks, to generate segmentation masks across multiple frames. The model architecture consists of an image encoder, a prompt encoder, a memory attention mechanism, a memory bank, and a mask decoder. Additionally, the figure shows a data engine loop where a model is used to annotate data, which is then used to train the model, resulting in the creation of the SA-V dataset containing a vast amount of masklets, masks, and video hours. It can be inferred that the system utilizes a streaming memory approach to maintain consistency across video frames by storing previous prompts and predictions. The overall workflow demonstrates a self-improving cycle where the model and the large-scale dataset mutually enhance the capability of the system to perform interactive visual segmentation.

We focus on the Promptable Visual Segmentation (P.V.S) task that generalizes image segmentation to the video domain. The task takes as input points, boxes, or masks on any frame of the video to define a segment of interest for which the spatio-temporal mask (i.e., a 'masklet') is to be predicted. Once a masklet is predicted, it can be iteratively refined by providing prompts in additional frames.

Our model (§4) produces segmentation masks of the object of interest, in single images and across video frames. sam 2 is equipped with a memory that stores information about the object and previous interactions, which allows it to generate masklet predictions throughout the video, and also effectively correct these based on the stored memory context of the object from previously observed frames. Our streaming architecture is a natural generalization of sam to the video domain, processing video frames one at a time, equipped with a memory attention module to attend to the previous memories of the target object. When applied to images, the memory is empty and the model behaves like sam.

We employ a data engine ( 5 ) to generate training data by using our model in the loop with annotators to interactively annotate new and challenging data. Different from most existing video segmentation datasets, our data engine is not restricted to objects of specific categories, but instead targeted to provide training data for segmenting any object with a valid boundary, including parts and subparts. Compared to existing model-assisted approaches, our data engine with sam 2 in the loop is 8.4times faster at comparable quality.

Our final Segment Anything Video (S.A-V) dataset ( 5.2 ) consists of 35.5 million masks across 50.9 thousand videos, 53times more masks than any existing video segmentation dataset. S.A-V is challenging with small objects and parts that get occluded and re-appear throughout the video. Our S.A-V dataset is geographically diverse, and a fairness evaluation of sam 2 indicates minimal performance discrepancy in video segmentation based on perceived gender, and little variance among the three perceived age groups we evaluated.

Our experiments (§6) show that sam 2 delivers a step-change in the video segmentation experience. sam 2 can produce better segmentation accuracy while using 3times fewer interactions than prior approaches. Further, sam 2 outperforms prior work in established video object segmentation benchmarks, under multiple evaluation settings, and delivers better performance compared to sam on image segmentation benchmarks, while being 6times faster. sam 2 is shown to be effective across a variety of video and image distributions as observed through numerous zero-shot benchmarks including 17 for video segmentation and 37 for single-image segmentation.

We are releasing our work under permissive open licences, including the S.A-V dataset (C.C by 4.0), the sam 2 model checkpoints, training code (Apache 2.0), and code for our interactive online demo (Apache 2.0).

2 Related work

Image segmentation. Segment Anything introduces a promptable image segmentation task where the goal is to output a valid segmentation mask given an input prompt such as a bounding box or a point that refers to the object of interest. sam trained on the S.A-1B dataset allows for zero-shot segmentation which enabled its adoption to a wide range of applications. Recent work has extended sam, for example, by introducing a High-Quality output token to train on fine-grained masks, or improve sam's efficiency. More broadly, sam is used in a wide range of applications, including medical imaging, remote sensing, motion segmentation, and camouflaged object detection.

Interactive Video Object Segmentation (ivoss). Interactive video object segmentation has emerged as a crucial task to efficiently obtain object segmentations in videos (masklets) with user guidance, often in the form of scribbles, clicks, or bounding boxes. A few early approaches deploy graph-based optimization to guide the segmentation annotation process. More recent approaches often adopt a modular design, converting user inputs into a mask representation on a single frame and then propagating it to other frames.

Click-based input is easier to collect for interactive video segmentation. Recent works have used a combination of sam on images with video trackers based on masks or points. However, these approaches have limitations: the tracker may not work for all objects, sam may not perform well on video frames, and there is no mechanism to interactively refine a model's mistakes, other than re-annotating using sam in each frame and restarting the tracking from there.

Our work shares a similar goal to these works to segment objects across videos interactively, and we build a strong unified model that directly takes prompts for interactive video segmentation, along with a large and diverse dataset in pursuit of solving this goal.

Video Object Segmentation (voss). The voss task begins with an object mask as input in the first frame, which must be accurately tracked throughout the video. The task is referred to as"semi-supervised voss" since the input mask can be seen as supervision signal of the object which is available only in the first frame. This task has drawn significant attention due to its relevance in applications, including video editing or robotics.

Early deep learning based approaches have often used online fine-tuning on the first video frame or on all frames to adapt the model to the target object. Faster inference has been achieved with offline-trained models, conditioned either only on the first frame, or also integrating the previous frame. This multi-conditioning has been extended to all frames with R.N.N's and transformers.

Semi-supervised voss can be seen as a special case of our Promptable Visual Segmentation (P.V.S) task, with only a mask prompt in the first video frame. Notably, annotating the required high-quality object mask in the first frame in voss is practically challenging and time-consuming for inference.

Video segmentation datasets. Many datasets have been proposed to support the voss task. Early voss datasets, such as Davis, include high-quality annotations but their size limits deep-learning based approaches. YouTube-voss is the first large-scale dataset for voss. As algorithms became better and benchmark performance started to saturate, researchers have looked at increasing the difficulty of the voss task by specifically focusing on occlusions, long videos, extreme transformations, object diversity or scene diversity.

We find that current video segmentation datasets lack sufficient coverage to achieve the capability of “segmenting anything in videos”. Their annotations typically cover entire objects (not parts) and datasets are often centered around specific object classes, such as people, vehicles, and animals. In comparison to these datasets, our released S.A-V dataset not only focuses on whole objects but also extensively covers object parts and contains over an order of magnitude more masks.

3 Task: promptable visual segmentation

Our P.V.S task allows providing prompts to the model on any frame of a video. Prompts can be positive/negative clicks, boxes, or masks, either to define an object to segment or to refine a model-predicted one. To provide an interactive experience, upon receiving a prompt on a specific frame, the model should immediately respond. with a valid segmentation mask of the object on this frame. After receiving initial prompts (either on the same frame or different frames), the model should propagate these prompts to obtain the masklet of the object across the entire video, localizing the segmentation mask of the target on every video frame. Additional prompts can be provided to the model on any frame to refine the segment throughout the video (example in figure 2). For details on the task, see B.

Figure 2 summary: This figure is a sequential diagram illustrating a two-step interactive segmentation process. The content depicts a series of video frames featuring a dog, showing how a target object, specifically the tongue, is identified and tracked across time. In the first step, positive and negative prompts are used in an initial frame to define the object, and the segmentation is automatically propagated to subsequent frames. When the tracking fails in later frames, a second step shows a refinement process where a single additional positive prompt is used to recover the object and correct the tracking for the remaining sequence. The figure demonstrates that the system can maintain memory of the object, allowing for efficient correction with minimal user input compared to methods that would require restarting the annotation process from scratch.

sam 2 ( 4 ) is applied as a data collection tool to the P.V.S task for building our S.A-V dataset ( 5 ). We evaluate the model ( 6 ) by simulating interactive video segmentation scenarios across multiple frames, in the conventional semi-supervised voss setting where annotations are limited to the first frame, and for image segmentation on the S.A benchmarks.

4 Model

sam 2 (Fig. 3) can be seen as a generalization of sam to the video (and image) domain, taking point, box, and mask prompts on individual frames to define the spatial extent of the object to be segmented spatio-temporally. Spatially, the model behaves similarly to sam. A promptable and light-weight mask decoder takes an image embedding and prompts (if any) and outputs a segmentation mask for the frame. Prompts can be iteratively added on a frame in order to refine the masks.

Figure 3 summary: This figure is a schematic architectural diagram. It illustrates the workflow of the SAM 2 system, showing a sequential process where images are passed through an image encoder, a memory attention module, and a mask decoder that incorporates inputs from a prompt encoder. The output is then processed by a memory encoder and stored in a memory bank, which feeds back into the memory attention module for subsequent frames. The architecture demonstrates a streaming approach to video processing where current segmentation predictions are informed by both immediate prompts and historical object memories stored from previous frames.

The frame embedding used by the sam 2 decoder is not directly from an image encoder and is instead conditioned on memories of past predictions and prompted frames. It is possible for prompted frames to also come “from the future” relative to the current frame. Memories of frames are created by the memory encoder based on the current prediction and placed in a memory bank for use in subsequent frames. The memory attention operation takes the per-frame embedding from the image encoder and conditions it on the memory bank, before the mask decoder ingests it to form a prediction.

We describe individual components and training below and provide more details in Appendix D.

Image encoder. For real-time processing of arbitrarily long videos, we take a streaming approach, consuming video frames as they become available. The image encoder is only run once for the entire interaction and its role is to provide unconditioned tokens (feature embeddings) representing each frame. We use an mae pre-trained Hiera image encoder, which is hierarchical, allowing us to use multiscale features during decoding.

Memory attention. The role of memory attention is to condition the current frame features on the past frames features and predictions as well as on any new prompts. We stack L transformer blocks, the first one taking the image encoding from the current frame as input. Each block performs self-attention, followed by cross-attention to memories of (prompted/unprompted) frames and object pointers (see below), stored in a memory bank (see below), followed by an M.L.P. We use vanilla attention operations for self-and cross-attention, allowing us to benefit from recent developments in efficient attention kernels.

Prompt encoder and mask decoder. Our prompt encoder is identical to sam's and can be prompted by clicks (positive or negative), boxes, or masks to define the extent of the object in a given frame. Sparse prompts are represented by positional encodings summed with learned embeddings for each prompt type, while masks are embedded using convolutions and summed with the frame embedding.

Our decoder design largely follows sam. We stack “two-way” transformer blocks that update prompt and frame embeddings. As in sam, for ambiguous prompts (i.e., a single click) where there may be multiple compatible target masks, we predict multiple masks. This design is important to ensure that the model outputs valid masks.

In video, where ambiguity can extend across video frames, the model predicts multiple masks on each frame. If no follow-up prompts resolve the ambiguity, the model only propagates the mask with the highest predicted IoU for the current frame.

Unlike sam where there is always a valid object to segment given a positive prompt, in the P.V.S task it is possible for no valid object to exist on some frames (e.g. due to occlusion). To support this new output mode, we add an additional head that predicts whether the object of interest is present on the current frame. Another novelty are skip connections from our hierarchical image encoder (bypassing the memory attention) to incorporate high-resolution embeddings for mask decoding (see D).

Memory encoder. The memory encoder generates a memory by downsampling the output mask using a convolutional module and summing it element-wise with the unconditioned frame embedding from the image-encoder (not shown in figure 3), followed by light-weight convolutional layers to fuse the information.

Memory bank. The memory bank retains information about past predictions for the target object in the video by maintaining a fifo queue of memories of up to N recent frames and stores information from prompts in a fifo queue of up to M prompted frames. For instance, in the voss task where the initial mask is the only prompt, the memory bank consistently retains the first frame's memory along with memories of up to N recent (unprompted) frames. Both sets of memories are stored as spatial feature maps.

In addition to the spatial memory, we store a list of object pointers as lightweight vectors for high-level semantic information of the object to segment, based on mask decoder output tokens of each frame. Our memory attention cross-attends to both spatial memory features and these object pointers.

We embed temporal position information into the memories of N recent frames, allowing the model to represent short-term object motion, but not into those of prompted frames, because the training signal from prompted frames is sparser and it is more difficult to generalize to the inference setting where prompted frames may come from a very different temporal range than seen during training.

Training. The model is trained jointly on image and video data. Similar to previous work, we simulate interactive prompting of the model. We sample sequences of 8 frames and randomly select up to 2 frames to prompt and probabilistically receive corrective clicks which are sampled using the ground-truth masklet and model predictions during training.

The training task is to sequentially (and “interactively”) predict the ground-truth masklet. Initial prompts to the model can be the ground-truth mask with probability 0.5, a positive click sampled from the ground-truth mask with probability 0.25, or a bounding box input with probability 0.25. See D for more details.

5 Data

To develop the capability to “segment anything” in video, we built a data engine to collect a large and diverse video segmentation dataset. We employ an interactive model in the loop setup with human annotators. Similar to Kirillov et al. (2023), we do not impose semantic constraints on the annotated masklets, and focus on both whole objects (e.g., a person) and parts (e.g., a person's hat). Our data engine went through three phases, each categorized based on the level of model assistance provided to annotators. Next, we describe each data engine phase and our S.A-V dataset.

5.1 Data engine

Phase 1: sam per frame. The initial phase used the image-based interactive sam to assist human annotation. Annotators are tasked with annotating the mask of a target object in every frame of the video at 6 frames per second (F.P.S) using sam, and pixel-precise manual editing tools such as a"brush" and"eraser".

There is no tracking model involved to assist with the temporal propagation of masks to other frames. As this is a per-frame method, and all frames require mask annotation from scratch, the process is slow, with an average annotation time of 37.8 seconds per frame in our experiment. However, this yields high-quality spatial annotations per frame.

In this phase, we collected 16 thousand masklets across 1.4 thousand videos. We further use this approach to annotate our S.A-V val and test sets to mitigate potential biases of sam 2 during evaluation.

Phase 2: sam + sam 2 Mask. The second phase added sam 2 into the loop, where sam 2 only accepted masks as prompts. We refer to this version as sam 2 Mask. Annotators used sam and other tools as in Phase 1 to generate spatial masks in the first frame, and then use sam 2 Mask to temporally propagate the annotated mask to other frames to get the full spatio-temporal masklets. At any subsequent video frame, annotators can spatially modify the predictions made by sam 2 Mask by annotating a mask from scratch with sam, a "brush" and/or "eraser", and re-propagate with sam 2 Mask, repeating this process until the masklet is correct. sam 2 Mask was initially trained on the Phase 1 data and publicly available datasets. During Phase 2, we re-trained and updated sam 2 Mask in the annotation loop twice using the collected data. In Phase 2, we collected 63.5 thousand masklets. The annotation time went down to 7.4 s/frame, a 5.1 x speed up over Phase 1.

Despite an improvement in annotation time, this approach requires annotating masks in intermediate frames from scratch without previous memory. We then advanced to develop the fully-featured sam 2, capable of both interactive segmentation and mask propagation in a unified model.

Phase 3: sam 2. In the final phase, we utilize the fully-featured sam 2, which accepts various types of prompts, including points and masks. sam 2 benefits from memories of objects across the temporal dimension to generate mask predictions. This means annotators only need to provide occasional refinement clicks to sam 2 to edit the predicted masklets in intermediate frames, as opposed to annotating from scratch with a spatial sam which has no such memory context.

During Phase 3, we re-trained and updated sam 2 using the collected annotations five times. With sam 2 in the loop, the annotation time per frame went down to 4.5 seconds, a 8.4x speed up over Phase 1. In Phase 3, we collected 197.0 thousand masklets.

Quality verification. To uphold a high standard for annotation, we introduce a verification step. A separate set of annotators are tasked with verifying the quality of each annotated masklet as “satisfactory” (correctly and consistently tracking the target object across all frames) or “unsatisfactory” (target object is well defined with a clear boundary but the masklet is not correct or consistent). Unsatisfactory masklets were sent back to the annotation pipeline for refinement. Any masklets tracking not well defined objects were rejected entirely.

Auto masklet generation. Ensuring diversity in annotation is important to enable the anything capability of our model. As human annotators might typically focus more on salient objects, we augment the annotations with automatically generated masklets (referred to as “Auto”). This serves a dual purpose of increasing the coverage of annotations and helping identify model failure cases.

To generate auto masklets, we prompt sam 2 with a regular grid of points in the first frame and generate candidate masklets. These are then sent to the masklet verification step for filtering. Automatic masklets tagged as “satisfactory” are added to the S.A-V dataset. Masklets identified as “unsatisfactory” (i.e., model failure cases) are sampled and presented to annotators to refine with sam 2 in the loop (Phase 3 of the data engine).

These automatic masklets cover large salient central objects but also objects of varying sizes and positions in the background.

Analysis. Table 1 shows a comparison of the annotation protocol in each data engine phase through a controlled experiment (details in section E.2.2) We compare the average annotation time per frame, the average percentage of manually edited frames per masklet, and the average number of clicks per clicked frame. For quality evaluation, we define the Phase 1 Mask Alignment Score as the percentage of masks whose IoU compared to the corresponding masks in Phase 1 exceeds 0.75. Phase 1 data is chosen as a reference as it has per-frame high quality manual annotations. Phase 3 with sam 2 in the loop leads to increased efficiency and comparable quality: it is 8.4× faster than Phase 1, has the lowest edited frame percentage and clicks per frame, and results in better alignment.

Table 1 summary: The evolution across the three data engine phases demonstrates a consistent improvement in annotation efficiency and accuracy. As the system transitioned from using only SAM to integrating SAM 2, there was a substantial reduction in the time required per frame, the percentage of frames needing manual edits, and the number of clicks required for adjustments. Furthermore, the later phases achieved high mask alignment scores across all object sizes compared to the initial phase, with the best performance observed for large objects.

In Table 2, we show the performance comparison of sam 2 trained on the available data at the end of each phase keeping the number of iterations fixed, therefore measuring solely the impact of the additional data. We evaluate on our own S.A-V val set and also on 9 zero-shot benchmarks (see section F.1 for details) using the standard J&F accuracy metric (the higher the better) when prompting with 3-clicks on the first frame. We note a consistent improvement after iteratively including the data from each phase, not only on the in-domain S.A-V val set, but also on the 9 zero-shot benchmarks.

Table 2 summary: The segmentation accuracy improves steadily as data from successive phases of the data engine are added to the initial training set. The most substantial gains occur during the early phases, with the performance continuing to increase slightly through the final automated phase across both the validation set and zero-shot evaluations.

5.2 S.A-V Dataset

The S.A-V dataset collected with our data engine comprises 50.9 thousand videos with 642.6 thousand masklets. In Table 3 we compare the S.A-V composition to common voss datasets across the number of videos, masklets, and masks. Notably, the number of annotated masks is 53times ( 15times without auto) larger than any existing voss dataset, providing a substantial resource for future work. We are releasing S.A-V under a permissive license.

Table 3 summary: A comparison of various video object segmentation datasets reveals that the SA-V datasets, particularly the version combining manual and automatic labels, are substantially larger than open source alternatives across nearly every metric, including the total number of videos, duration, masklets, masks, and frames. While some open source datasets exhibit lower disappearance rates, the SA-V datasets maintain a high volume of data with disappearance rates that are generally comparable to other large-scale datasets like VOST and MOSE.

Videos. We collected a new set of 50.9 thousand videos captured by crowdworkers. Videos comprise 54% indoor and 46% outdoor scenes with an average duration of 14 seconds. Videos feature “in-the-wild” diverse environments, and cover various everyday scenarios.

Masklets. The annotations comprise 190.9 thousand manual masklet annotations and 451.7 thousand automatic masklets collected using our data engine. Example videos with masklets overlaid (manual and automatic) are shown in figure 4. S.A-V has 53times ( 15times without auto annotations) more masks than the largest voss dataset. The disappearance rate in S.A-V Manual (the percentage of annotated masklets that disappear in at least one frame and then re-appear) is 42.5%, competitive among existing datasets.

Figure 4 summary: This figure consists of a series of image sequences organized in rows. Each row displays a temporal progression of frames from different videos within the SA-V dataset, featuring overlaid masklets that track specific objects or individuals over time. The content illustrates various scenes, including aquatic life in an aquarium, people moving within indoor and outdoor environments, and animals in a natural setting. The masklets are applied to these subjects to delineate their boundaries across consecutive frames. From these sequences, it can be inferred that the dataset provides consistent tracking of multiple distinct entities across time. The alignment of the masks suggests that the segmentation process effectively maintains the identity of individual objects as they move or change orientation throughout the video clips.

S.A-V training, validation and test splits. We split S.A-V based on the video authors (and their geographic locations) to ensure minimal overlap of similar objects. To create S.A-V val and S.A-V test sets, we focus on challenging scenarios in selecting videos, and ask annotators to identify challenging targets that are fast-moving, have complex occlusions by other objects as well as disappearance/re-appearance patterns. These targets were annotated at 6 F.P.S using the data engine Phase 1 setup in §5.1. There are 293 masklets and 155 videos in the S.A-V val split, and 278 masklets and 150 videos in the S.A-V test split.

Internal dataset. We also used internally available licensed video data to further augment our training set. Our internal dataset is comprised of 62.9 thousand videos and 69.6 thousand masklets annotated in Phase 2 and Phase 3 (see §5.1) for training, and 96 videos and 189 masklets annotated using Phase 1 for testing (Internal-test).

See Appendix E for more details on the data engine and S.A-V dataset, including a fairness evaluation.

6 Zero-shot experiments

Here, we compare sam 2 with previous work on zero-shot video and image tasks. We report the standard J&F metric for video and mIoU metric for image tasks. Unless otherwise mentioned, the results in this section follow our default setup using Hiera-B+ image encoder with a resolution of 1024 and trained on the full combination of datasets, that is, sam 2 (Hiera-B+) in Table 6 (see also §D 2. for details).

Table 6 summary: SAM 2 demonstrates superior performance in video object segmentation compared to previous methods across multiple datasets. It achieves the highest accuracy scores for both the J&F and G metrics. Notably, SAM 2 shows a substantial improvement over other models on the SA-V validation and test sets, while maintaining a consistent lead across the MOSE, DAVIS, LVOS, and YTVOS benchmarks.

6.1 Promptable video segmentation

We first evaluate promptable video segmentation, which involves simulating an interactive setting that resembles the user experience. We have two settings, offline evaluation, where multiple passes are made through a video to select frames to interact with based on the largest model error, and online evaluation, where the frames are annotated in a single forward pass through the video. These evaluations are conducted on 9 densely annotated zero-shot video datasets using N click = 3 clicks per frame (see F.1 for details).

We create two strong baselines, sam+X.Mem++ and sam+Cutie, based on two state-of-the-art models for video object segmentation, X.Mem++ and Cutie.

We use X.Mem++ to generate a video segmentation based on mask inputs on one or multiple frames. sam is used to provide an initial mask or to refine an output (by feeding the current segmentation as a mask prompt to sam). For the sam+Cutie baseline, we modify Cutie to allow taking mask inputs on multiple frames.

In figure 5, we report the average J F accuracy over N frame equals 1 through 8 interacted frames. sam 2 outperforms sam+X.Mem++ and sam+Cutie for both offline and online evaluation settings. Across all 9 datasets (see per-dataset results in section F.1), sam 2 dominates both methods, generating high-quality video segmentation from a few clicks while allowing continued refinement with prompts. Overall, sam 2 can generate better segmentation accuracy, with greater than 3 times fewer interactions.

Figure 5 summary: This figure is a line chart. It illustrates the average J&F score across multiple datasets as a function of the number of annotated frames using a three-click interaction method, comparing the performance of SAM 2 against SAM combined with XMem++ and SAM combined with Cutie. The data indicates that for all evaluated methods, performance improves as the number of annotated frames increases. SAM 2 consistently achieves the highest accuracy across all numbers of annotated frames, maintaining a significant lead over the other two baseline combinations. While both SAM + Cutie and SAM + XMem++ show similar upward trends, SAM 2 exhibits a more pronounced increase in accuracy, demonstrating superior zero-shot performance in the interactive evaluation setting.

6.2 Semi-supervised video object segmentation

Table 4 summary: SAM2 consistently outperforms SAM+XMem++ and SAM+Cutie in zero-shot accuracy across all prompt types. For all methods, performance improves as the number of clicks increases, with ground-truth masks yielding the highest accuracy and single clicks resulting in the lowest.

We evaluate the semi-supervised video object segmentation (voss) setting with click, box, or mask prompts only on the first frame of the video. When using click prompts, we interactively sample either 1, 3 or 5 clicks on the first video frame.

Similar to the interactive setting in §6.1, we compare to X.Mem++ and Cutie, using sam for click and box prompts, and in their default setting when using mask prompts. We report the standard J&F accuracy, except for on vost, where we report the J metric following its protocol. The results are in Table 4. sam 2 outperforms both methods on the 17 datasets. The results underline that sam 2 also excels at the conventional, non-interactive voss task with mask input, for which these other works are specifically designed. Details are in §F 1..3.

6.3 Image segmentation

We evaluate sam 2 on the Segment Anything task across 37 zero-shot datasets, including 23 datasets previously used by sam for evaluation. 1-click and 5-click mIoUs are reported in Table 5 and we show the average mIoU by dataset domain and model speed in frames per second (F.P.S) on a single A.100 GPU.

Table 5 summary: The results demonstrate that SAM 2 outperforms SAM in zero-shot accuracy across various image and video datasets. Specifically, SAM 2 trained on a custom data mix achieves the highest performance across all evaluated domains and datasets compared to both the original SAM and SAM 2 trained on the standard SA-1B dataset. Additionally, SAM 2 provides a substantial increase in processing speed measured in frames per second.

The first column (S.A-23 All) shows accuracy on the 23 datasets from sam. sam 2 achieves higher accuracy (58.9 mIoU with 1 click) than sam (58.1 mIoU with 1 click), without using any extra data and while being 6 times faster. This can be mainly attributed to the smaller but more effective Hiera image encoder in sam 2.

The bottom row shows how training on our S.A-1B and video data mix can further improve accuracy to 61.4% on average on the 23 datasets. We also see exceptional gains on the video benchmarks from S.A-23 (video datasets are evaluated as images, identical to Kirillov et al. (2023)), and the 14 new video datasets we added. More detailed results including a breakdown by dataset are in §F 3..

7 Comparison to State-of-the-Art in Semi-Supervised voss

Our primary focus is on the general, interactive P.V.S task, but we also address the specific semi-supervised voss setting (where the prompt is a ground-truth mask on the first frame), as it is a historically common protocol. We evaluate two versions of sam 2 with varying image encoder sizes (Hiera-B+/-L) with different speed-vs-accuracy tradeoffs. We measure frames per second (F.P.S) on a single A.100 GPU using a batch-size of one. sam 2 based on Hiera-B+ and Hiera-L runs at real-time speeds of 43.8 and 30.2 F.P.S, respectively.

We present a comparison with existing state-of-the-art in Table 6, reporting accuracy using standard protocols. sam 2 shows significant improvement over the best existing methods. We observe that using a larger image encoder brings significant accuracy gains across the board.

We also evaluate existing work on the S.A-V val and test sets which measure performance for open-world segments of “any” object class. When comparing on this benchmark, we see that most previous methods peak at around the same accuracy. The best performance on S.A-V val and S.A-V test for prior work is significantly lower demonstrating the gap to a “segment anything in videos” capability.

Finally, we see that sam 2 also brings notable gains in long-term video object segmentation as observed in the L.V.O.S benchmark result. For data and model ablations, see §A.

8 Conclusion

We present a natural evolution of Segment Anything into the video domain, based on three key aspects: (1) extending the promptable segmentation task to video, (2) equipping the sam architecture to use memory when applied to video, and (3) the diverse S.A-V dataset for training and benchmarking video segmentation. We believe sam 2 marks a significant advancement in visual perception, positioning our contributions as milestones that will propel further research and applications.

Acknowledgements. We thank Alexander Kirillov and Jitendra Malik for discussions on project direction. Thanks to Andrew Huang, Sahir Gomez, Miguel Martin, Devansh Kukreja, and Somya Jain for work on the demo, and to Aohan Lin and Meng Wang for creating the dataset visualizer. We thank Shoubhik Debnath and Sagar Vaze for their work on dataset preparation.

Thanks also to William Ngan and Sasha Mitts for their design expertise and to Grant Gardner and George Orlin for leading product management. We are grateful to Joelle Pineau, Daniel Bolya, Kate Saenko, Pengchuan Zhang, and Christopher Chedeau, for valuable discussions. Thanks to Rene Martinez Doehner and Baishan Guo for data support, and to our annotation engineering and management partners: Robert Kuo, Rishi Godugu, Bob Kamma, Ida Cheng, Claudette Ward, Kai Brown, Jake Kinney, Jenny Truong, and Kayren Bergan. Thanks to Vispi Cassod, Parth Malani, Shiva Koduvayur, Alexander Miller, and Caleb Ho for their support with compute and infra.

Finally, we thank Azita Shokrpour, Mallika Malhotra, Rodrick Shepard, Jonathan Torres, Luc Dahlin, David Soofian, Alex Bosenberg, and Amanda Kallet for project-level support.

Table of content:

• §A: Data and Model Ablations

- §B: Task Details

• §C: Limitations

• §D: Model Details

• §E: Dataset Details

- $E.2.1: Annotation Guidelines

• §F: Zero-shot Experiments Details

- Section H: Dataset, Annotation, and Model Cards

A Data and model ablations

This section presents ablations that informed the design decisions for sam 2. We evaluate on S.A-V val, Internal-test, our mose development set (“mose dev”) which contains 200 randomly-sampled videos from the mose training split, excluded from our training data and the average over 9 zero-shot video datasets. As the metric for comparison, we report J&F under 3-click input on the first frame as a balance between the 1-click regime and the voss-style mask prompts. Additionally, we report the average 1-click mIoU on the 23-dataset benchmark used by sam for the S.A task on images. Unless otherwise specified, we perform our ablations at 512 ^2 spatial resolution, trained with S.A-V manual and a 10% subset of S.A-1B. Additional details are in D 2..

A.1 Data ablations

Data mix ablation. In Table 7, we compare the accuracy of sam 2 when trained on different data mixtures. We pre-train on S.A-1B and then train a separate model for each setting.

Table 7 summary: The results demonstrate that including SA-V training data consistently leads to the highest performance across most evaluation benchmarks, including SA-V validation, Internal-test, and zero-shot datasets. Combining multiple data sources, particularly the inclusion of SA-1B alongside SA-V and Internal training sets, further enhances results, especially for the SA-23 mIoU metric. In contrast, training solely on VOS data yields the lowest overall accuracy across nearly all tested metrics.

We fix the number of iterations (200k) and batch size (128) with only the training data changing between experiments. We report accuracy on our S.A-V val and Internal set, mose, 9 zero-shot video benchmarks, and the S.A-23 tasks (§6.3).

Row 1 shows that a model purely trained on existing voss datasets (Davis, mose, YouTubeVOS) performs well on the in-domain mose dev, but poorly on all the others including the 9 zero-shot voss datasets.

(59.7 J&F ). We observe tremendous benefit from adding our data engine data into the training mix, including +12.1% average performance improvement on 9 zero-shot datasets (row 11 vs 1). This can be attributed to the limited coverage and size of voss datasets. Adding S.A-1B images improves the performance on the image segmentation task (rows 3 vs 4, 5 vs 6, 9 vs 10, 11 vs 12) without degrading the voss capability. Training only on S.A-V and S.A-1B (row 4) is enough to obtain strong performance on all benchmarks except for mose (specific object categories). Overall, we obtain the best results when mixing all datasets: voss, S.A-1B, and our data engine data (row 12).

Data quantity ablation. Next, we study the effect of scaling training data. sam 2 is pre-trained on S.A-1B before training on varying sizes of S.A-V. We report average J&F score (when prompted with 3 clicks in the first frame) over 3 benchmarks: S.A-V val, zero-shot, and mose dev. figure 6 shows a consistent power law relationship between the quantity of training data and the video segmentation accuracy on all benchmarks.

Figure 6 summary: This figure consists of three scatter plots with linear regression lines and confidence intervals. Each plot illustrates the relationship between the quantity of SA-V masklets used for training and the resulting J&F accuracy for SAM 2 across different evaluation sets: the SA-V validation set, a collection of zero-shot datasets, and the MOSE development set. Across all three scenarios, there is a consistent positive correlation between the number of masklets and model accuracy. The data indicates that increasing the volume of training masklets leads to a steady improvement in accuracy, suggesting that the model's performance scales effectively with more training data across both seen and unseen datasets.

Data quality ablation. In Table 8, we experiment with filtering strategies for quality. We subsample 50k masklets from S.A-V, either randomly or by taking the masklets that have been edited the most by annotators. Filtering based on the number of edited frames leads to strong performance using just 25% of the data, and outperforms random sampling, but is worse than using all 190k S.A-V masklets.

Table 8 summary: The results demonstrate that training on the full SA-V dataset yields the best performance across most evaluation benchmarks. Comparing the smaller subsets, using masklets with the most edited frames provides a slight improvement over a random sample, though both are outperformed by the complete dataset.

A.2 Model architecture ablations

In this section, we present model ablations that guided design decisions, conducted under a smaller model setup with 512 input resolution by default. For each ablation setting, we report segmentation accuracy for video ( J&F ) and image (mIoU) tasks, and its relative video segmentation speed (the maximum inference throughput relative to the ablation default setup in gray). We find design choices for image and video components to be largely decoupled – this can be attributed to our modular design and training strategy.

A.2.1 Capacity ablations

Input size. During training, we sample sequences of frames of fixed resolution and fixed length (here denoted by # frames). We ablate their impact in Tables 9a, 9b. A higher resolution leads to significant improvements across image and video tasks, and we use a spatial input resolution of 1024 ^2 in our final model. Increasing the number of frames brings notable gains on video benchmarks and we use a default of 8 to balance speed and accuracy.

Table 9a summary: Increasing the input resolution generally improves the mean Intersection over Union across most datasets, though the performance gains diminish at the highest resolution. However, these improvements in accuracy come at the cost of a significant reduction in processing speed.

Figure 9 summary: This figure is a data table. It presents an ablation study on the impact of image encoder size, categorized as small, base plus, and large, across several performance metrics including J and F scores for different datasets and mIoU for SA-23, while also recording the relative processing speed. The data indicates that increasing the size of the image encoder generally leads to higher accuracy and better segmentation performance, with the largest encoder achieving the best results in most categories. Conversely, there is a clear trade-off between model capacity and efficiency, as the larger encoders result in a decrease in processing speed compared to the smaller versions.

Memory size. Increasing the (maximum) number of memories, N , generally helps the performance although there could be some variance, as in Table 9c. We use a default value of 6 past frames to strike a balance between temporal context length and computational cost. Using fewer channels for memories does not cause much performance regression as in Table 9d, while making the memory required for storage 4 times smaller.

Table 9c summary: The ablation study on memory size reveals that varying the number of memories has a negligible impact on performance across multiple evaluation benchmarks and processing speed, suggesting that the model is robust to changes in memory capacity within the tested range.

Table 9d summary: Increasing the memory channel dimension leads to marginal changes in performance across most evaluation benchmarks, while resulting in a slight decrease in processing speed.

Model size. More capacity in the image encoder or memory-attention (#self-/#cross-attention blocks) generally leads to improved results, as shown in Tables 9e, 9f. Scaling the image encoder brings gains on both image and video metrics, while scaling the memory-attention only improves video metrics. We default to using a B+ image encoder, which provides a reasonable balance for speed and accuracy.

Table 9e summary: The ablation study on memory attention dimensions shows that varying the number of source and channel attributes has a negligible impact on performance across multiple evaluation benchmarks, while resulting in slight changes to processing speed.

Table 9 summary: The results indicate that increasing the number of input frames generally leads to an improvement in J&F scores across various datasets, including MOSE dev, SA-V val, and zero-shot evaluations. However, this increase in frame count does not significantly impact the processing speed or the mIoU on the SA-23 dataset, where performance remains relatively stable.

A.2.2 Relative positional encoding

By default, we always use absolute positional encoding in both the image encoder as well as memory attention. In Table 10, we study relative positional encoding design choices. Here we also evaluate on L.V.O.S.v 2 with 3 clicks on the 1st frame as a benchmark for long-term video object segmentation.

Table 10 summary: The results demonstrate that incorporating 2d-RoPE positional encoding and removing RPB generally leads to slight improvements in J&F metrics and mIoU across various datasets. While removing RPB enables a significant increase in processing speed through FlashAttention-2, the performance difference between using 2d-RoPE and the baseline without it is minimal.

While sam follows in adding relative positional biases (R.P.B) to all image encoder layers, Bolya et al. (2023) improve upon this by removing R.P.B in all but the global attention layers while adopting “absolute-win” positional encoding which brings large speed gains. We improve upon this further by removing all R.P.B from the image encoder, with no performance regression on S.A-23 and minimal regression on video benchmarks (see Table 10), while giving a significant speed boost at 1024 resolution. We also find it is beneficial to use 2d-RoPE in the memory attention.

A.2.3 Memory architecture ablations

Recurrent memory. We investigate the effectiveness of feeding the memory features to a grew before adding them to the memory bank. Similar to A 2..2 , we also evaluate on L.V.O.S.v 2 as an additional benchmark for long-term object segmentation. While prior works have commonly employed grew states as a means of incorporating memory into the tracking process, our findings in Table 11 suggest that this approach does not provide an improvement (except slightly on L.V.O.S.v 2). Instead, we find it sufficient to directly store the memory features in the memory bank, which is both simpler and more efficient.

Table 11 summary: The ablation study on memory design indicates that using object pointers generally leads to better performance across several validation datasets compared to the baseline, while the addition of a recurrent GRU memory provides marginal changes in accuracy and a slight decrease in processing speed.

Object pointers. We ablate the impact of cross-attending to the object pointer vectors from the mask decoder output in other frames (see §4). The results presented in Table 11 show that while cross-attending to object pointers does not enhance average performance across the 9 zero-shot datasets, it significantly boosts performance on S.A-V val dataset as well as on the challenging L.V.O.S.v2 benchmark (validation split). Hence, we default to cross-attending to object pointers together with the memory bank embeddings from the memory encoder.

B Details on the P.V.S Task

The Promptable Visual Segmentation (P.V.S) task can be seen as an extension of the Segment Anything (S.A) task from static images to videos. In the P.V.S setting, given an input video, the model can be interactively prompted with different types of inputs (including clicks, boxes, or masks) on any frame in the video, with the goal of segmenting (and tracking) a valid object throughout the video. When interacting with a video, the model provides an instant response on the frame being prompted (similar to the interactive experience of sam on images), and also returns the segmentation of the object throughout the entire video in near real-time. Similar to sam the focus is on valid objects which have a clearly defined boundary, and we do not consider regions without visual boundaries. figure 7 illustrates the task.

Figure 7 summary: This figure is a conceptual flow diagram. It illustrates the Promptable Visual Segmentation task by comparing it to existing methodologies. The diagram shows two parallel paths: one where Segment Anything leads into semi-supervised Video Object Segmentation, and another where a promptable interface directly feeds into the Promptable Visual Segmentation process, both ultimately resulting in a sequence of segmented object masks. The figure demonstrates that Promptable Visual Segmentation serves as a generalized framework, positioning both Segment Anything and semi-supervised Video Object Segmentation as specific instances or subsets of this broader task.

P.V.S is related to tasks in the image and video domains. For images, the S.A task can be considered a subset of P.V.S with the video reduced to a single frame. Similarly, traditional semi-supervised and interactive voss tasks are special cases of P.V.S, limited to mask prompts provided only on the first frame and scribbles on multiple frames to segment objects throughout a video, respectively. In P.V.S, prompts can either be clicks, masks, or boxes, and the focus is on enhancing the interactive experience, enabling refinement of a segmentation with minimal interaction.

C Limitations

sam 2 demonstrates strong performance in both static image and video domains, yet it encounters difficulties in certain scenarios. The model may fail to segment objects across shot changes and can lose track of or confuse objects in crowded scenes, after long occlusions or in extended videos. To alleviate this issue, we designed the ability to prompt sam 2 in any frame: if the model loses the object or makes an error, refinement clicks on additional frames can quickly recover the correct prediction in most cases.

sam 2 also struggles with accurately tracking objects with very thin or fine details especially when they are fast-moving. Another challenging scenario occurs when there are nearby objects with similar appearance (e.g., multiple identical juggling balls). Incorporating more explicit motion modeling into sam 2 could mitigate errors in such cases.

While sam 2 can track multiple objects in a video simultaneously, sam 2 processes each object separately, utilizing only shared per-frame embeddings without inter-object communication. While this approach is simple, incorporating shared object-level contextual information could aid in improving efficiency.

Our data engine relies on human annotators to verify masklet quality and select frames that require correction. Future developments could include automating this process to enhance efficiency.

D sam 2 Details

D.1 Architecture

Here we discuss further architecture details, expanding on the model description in §4.

Image encoder. We use a feature pyramid network to fuse the stride 16 and 32 features from Stages 3 and 4 of the Hiera image encoder respectively to produce the image embeddings for each frame. In addition, the stride 4 and 8 features from Stages 1 and 2 are not used in the memory attention but are added to the upsampling layers in the mask decoder as shown in Figure 8, which helps produce high-resolution segmentation details. We follow in using windowed absolute positional embeddings in the Hiera image encoder.

Figure 8 summary: This figure is a schematic architectural diagram. It illustrates the structure of a mask decoder, which processes image embeddings and a combination of output and prompt tokens through a series of attention mechanisms and multi-layer perceptrons. The architecture incorporates a token-to-image attention block and convolutional transformations that integrate features from the image encoder to produce final masks. The system further utilizes mask tokens as object pointers and employs separate multi-layer perceptrons to derive intersection-over-union scores and occlusion scores. The design indicates that the decoder is capable of not only segmenting objects but also tracking them via object pointers and assessing their visibility through occlusion scoring.

In Bolya et al. (2023), R.P.B provided positional information spanning across windows in the image encoder, in lieu of which we adopt a simpler approach of interpolating the global positional embedding instead to span across windows. We do not use any relative positional encoding. We train models with varying image encoder sizes - T, S, B+ and L. We follow and use global attention in only a subset of the image encoder layers (see Table 12).

Memory attention. In addition to sinusoidal absolute positional embeddings, we use 2d spatial Rotary Positional Embedding (RoPE) in self-attention and cross-attention layers. The object pointer tokens are excluded from RoPE as they do not have specific spatial correspondence. By default, the memory attention uses L = 4 layers.

Prompt encoder and mask decoder. The prompt encoder design follows sam, and we next discuss additional details on design changes in the mask decoder. We use the mask token corresponding to the output mask as the object pointer token for the frame, which is placed in the memory bank.

As discussed in §4, we also introduce an occlusion prediction head. This is accomplished by including an additional token along with the mask and IoU output tokens. An additional M.L.P head is applied to this new token to produce a score indicating the likelihood of the object of interest being visible in the current frame (as shown in Figure 8). In the memory bank, we also add a learned occlusion embedding to the memory features of those frames that are predicted to be occluded (invisible) by the occlusion prediction head.

sam introduced the ability to output multiple valid masks when faced with ambiguity about the object being segmented in an image. For example, when a person clicks on the tire of a bike, the model can interpret this click as referring to only the tire or the entire bike and output multiple predictions. In videos, this ambiguity can extend across video frames. For example, if in one frame only the tire is visible, a click on the tire might relate to just the tire, or as more of the bike becomes visible in subsequent frames, this click could have been intended for the entire bike. To handle this ambiguity, sam 2 predicts multiple masks at each step of the video. If further prompts do not resolve the ambiguity, the model selects the mask with the highest predicted IoU for the current frame for further propagation in the video.

Memory encoder and memory bank. Our memory encoder does not use an additional image encoder and instead reuses the image embeddings produced by the Hiera encoder, which are fused with the predicted mask information to produce memory features (as discussed in §4). This design allows the memory features to benefit from the strong representations produced by the image encoder (especially when we scale the image encoder to a larger size). Further, we project the memory features in our memory bank to a dimension of 64, and split the 256-dim object pointer into 4 tokens of 64-dim for cross-attention to the memory bank.

Handling multiple objects in a video. When applying sam 2 to segment multiple objects in the same video (such as multi-object tracking in the semi-supervised voss evaluation), we perform inference on each object independently. More specifically, we share the visual features from the image encoder between all the objects in the video but run all the other model components (such as the memory bank and the mask decoder) separately for each object.

D.2 Training

D.2.1 Pre-training

We first pre-train sam 2 on static images on the S.A-1B dataset. Table 12a details the settings used during pre-training on S.A-1B – other settings not mentioned here follow. The image encoder is initialized from mae pre-trained Hiera. Similar to sam, we filter masks covering more than 90% of the image and restricted training to 64 randomly sampled masks per image.

Unlike sam, we found it beneficial to use an 1 loss to more aggressively supervise the IoU predictions and to apply a sigmoid activation to the IoU logits to restrict the output into the range between 0 and 1. For multi-mask predictions (on the first click), we supervise the IoU predictions of all masks to encourage better learning of when a mask might be bad, but only supervise the mask logits with the lowest segmentation loss (linear combination of focal and dice loss). In sam, during iterative sampling of points, two iterations were inserted with no additional prompts (only feeding the previous mask logits) – we do not add such iterations during our training and use 7 correction clicks (instead of 8 in sam). We also employ horizontal flip augmentation during training and resize the image to a square size of 1024 times 1024 .

We use AdamW and apply layer decay on the image encoder and follow a reciprocal square-root schedule. See Table 12 (a) for the hyperparameters in our pre-training stage.

D.2.2 Full training

After pre-training, we train sam 2 on our introduced datasets S.A-V + Internal (section §5.2), a 10% subset of S.A-1B, and a mixture of open-source video datasets including Davis, mose, and YouTubeVOS. Our released model is trained on S.A-V manual + Internal and S.A-1B.

sam 2 is designed for two tasks; the P.V.S task (on videos) and the S.A task (on images). Training is done jointly on image and video data. To optimize our data usage and computational resources during training, we adopt an alternating training strategy between video data (multiple frames) and static images (one single frame).

Specifically, in each training iteration, we sample a full batch either from the image or video dataset, with their sampling probabilities proportional to the size of each data source. This approach allows for a balanced exposure to both tasks and a different batch size for each data source to maximize compute utilization. Settings not explicitly mentioned here for the image task follow settings from the pre-training phase.

See Table 12 (b) for the hyperparameters in our full training stage. The training data mixture consists of approximately 15.2% S.A-1B, approximately 70% S.A-V and approximately 14.8% Internal. The same settings are used when open-source datasets are included, with the change that the additional data is included (~1.3% Davis, approximately 9.4% mose, approximately 9.2% YouTubeVOS, approximately 15.5% S.A-1B, approximately 49.5% S.A-V, approximately 15.1% Internal). When training on S.A-V and other video datasets, we only use those manually annotated masklets (without adding automatically generated ones), which are sufficient to achieve strong performance based on our analyses.

We apply a series of data augmentations to the training videos (detailed in Table 12), including random horizontal flips, random affine transforms, random color jittering, and random grayscale transforms, as listed in Table 12. We also adopt a mosaic transform to simulate challenging scenarios with multiple similar-looking objects – with 10% probability, we tile the same training video into a 2×2 grid and select a masklet from one of the 4 quadrants as the target object to segment. In this case, the model must focus on other cues like motion or temporal continuity to distinguish the target object from their identical-looking counterparts in other quadrants. In addition, the videos and objects in each quadrant are smaller in size (only half the original width and height) after this mosaic transform, which also facilitates learning to segment small objects.

We train by simulating an interactive setting, sampling 8-frame sequences and randomly selecting up to 2 frames (including the first) for corrective clicks. During training, we use ground-truth masklets and model predictions to sample prompts, with initial prompts being the ground-truth mask (50% probability), a positive click from the ground-truth mask (25%), or a bounding box input (25%).

We restrict the maximum number of masklets for each sequence of 8 frames to 3 randomly chosen ones. We reverse the temporal order with a probability of 50% to help generalization to bi-directional propagation. When we sample corrective clicks, with a small probability of 10%, we randomly sample clicks from the ground truth mask, irrespective of the model prediction, to allow additional flexibility in mask refinement.

Fine-tuning using 16-frame sequences. A potential shortcoming of the procedure above is that the model only sees sampled 8-frame sequences during training, which is relatively short compared to the full video length during inference. To alleviate this issue and further boost the segmentation quality on long videos, we introduce an extra fine-tuning stage where we sample 16-frame sequences on challenging videos (those videos with the highest number of edited frames, as described in §E.2.1) More specifically, we sort our masklets by number of edited frames and only consider the top 50% most edited masklets for training, for both S.A-V and Internal datasets. We still keep the complete versions of the O.S.S datasets (Davis, mose, and YouTubeVOS) in the training mix. We fine-tune for 50k iterations (1/3 of the original schedule) using half of the original learning rate and freeze the image encoder to fit the 16-frame sequence into the 80 G.B memory of A.100 GPUs.

Losses and optimization. We supervise the model's predictions using a linear combination of focal and dice losses for the mask prediction, mean-absolute-error (mae) loss for the IoU prediction, and cross-entropy loss for object prediction with a ratio of 20:1:1 respectively. As during pre-training, for multi-mask predictions, we only supervise the mask with the lowest segmentation loss. If the ground-truth does not contain a mask for a frame, we do not supervise any of the mask outputs (but always supervise the occlusion prediction head that predicts whether there should exist a mask in the frame).

Figure 12 summary: This table outlines the hyperparameters and configuration settings used for the pre-training and full training phases of the SAM 2 model. It details the input data sources, image resolution, and precision, as well as the optimization settings including the learning rate schedules, weight decay, and gradient clipping. The configuration specifies different drop path rates and global attention blocks based on the image encoder size. The output of these settings is a tuned model trained using a combination of focal, dice, and IoU losses for mask prediction, with additional occlusion loss applied during full training.

D.3 Speed benchmarking

We conduct all benchmarking experiments on a single A.100 GPU using PyTorch 2.3.1 and cuda 12.1, under automatic mixed precision with bfloat16. We compile the image encoder with torch.compile for all sam 2 models and do the same for sam and HQ-sam for direct comparison on the S.A task (Tables 5 and 15). The F.P.S measurements for the S.A task were conducted using a batch size of 10 images, which was found to yield the highest F.P.S across all three model types. For video tasks, we use a batch size of 1 following the common protocol in video segmentation.

Table 15 summary: The table compares the zero-shot performance of SAM 2 against SAM and HQ-SAM across various datasets and domains. SAM 2 models, particularly those trained on the authors' data mix, consistently achieve higher mIoU scores across image and video benchmarks compared to the baselines. While larger model variants generally show improved accuracy, SAM 2 (Hiera-B+) provides a significant increase in inference speed (FPS) while maintaining competitive performance. The results indicate that the proposed data mix further enhances the zero-shot capabilities of SAM 2 across all tested categories.

E Data details

E.1 S.A-V Dataset Details

Videos. Resolutions range from 240p to 4K with 1,401 times 1,037 on average. Duration ranges from 4 seconds to 2.3 minutes, with an average of 13.8 seconds, totaling 4.2 million frames and 196 hours.

Dataset diversity. As shown in figure 10, S.A-V videos were recorded across 47 countries (Fig. 10b), by diverse participants (self-reported demographics in figure 10c). figure 10a shows a comparison of mask size distribution (normalized by video resolution) with Davis, mose, and YouTubeVOS. More than 88% of S.A-V masks have a normalized mask area less than 0.1.

Figure 10 summary: This figure consists of a bar chart, a world map, and two pie charts. The bar chart illustrates the distribution of normalized masklet sizes across several datasets, the world map displays the geographic distribution of videos by count, and the pie charts show the self-reported gender and age demographics of the crowdworkers. The data indicates that the vast majority of masklets across all datasets are small in size. Geographically, the videos are sourced globally, with the highest concentrations found in Asia. Regarding crowdworker demographics, there is a relatively balanced distribution between males and females, with the largest age group being young to middle-aged adults.

Automatic masklets. Similar to the approach described by Kirillov et al. (2023), automatic masklets are generated by prompting the model with regular grids. We prompt the model with a 32 times 32 grid on the first frame, and additionally we use a 16 times 16 grid on 4 zoomed image crops of the first frame (derived from a 2 times 2

Figure 9 summary: This figure consists of two side-by-side image panels showing segmentation overlays on a scene with people and packages. The first panel displays manual labels, while the second panel shows the result of incorporating automatic labels. The comparison demonstrates that the addition of automatic labeling significantly enhances the diversity and coverage of the annotations, capturing more objects and background elements than the manual approach alone.

overlapped window) and a 4 times 4 grid on 16 zoomed image crops of the first frame (derived from a 4 times 4 overlapped window). We apply two post-processing steps across all frames. First, we remove tiny disconnected components with areas smaller than 200 pixels. Second, we fill in holes in segmentation masks if the area of the hole is less than 200 pixels. By combining these automatically generated masklets with manually created ones, we enhance the coverage of annotations in the S.A-V dataset, as illustrated in figure 9.

E.1.1 Fairness evaluation

We evaluate sam 2 for fairness across demographic groups. We collect annotations for the people category in the EgoExo4D dataset, which contains self-reported demographic information supplied by the subject of the video. We employ the same annotation setup as for S.A-V val and test sets and apply this to 20-second clips from the third-person (exo) videos. We evaluate sam 2 on this data using 1-, 3-clicks, and ground-truth mask on the first frame.

Table 13 shows the comparison in J&F accuracy of sam 2 for segmenting people across gender and age. At 3 clicks and with ground-truth mask prompts there is minimal discrepancy. We manually inspect 1 click predictions, and find the model frequently predicts the mask for a part instead of the person. When limiting the comparison to clips where the person is correctly segmented, the gap in 1 click shrinks substantially ( J&F male 94.3, female 92.7), suggesting the discrepancy can be partially attributed to ambiguity in the prompt.

Table 13 summary: The fairness evaluation of SAM 2 across different demographic groups indicates that performance is consistently highest when using the mask prompt, followed closely by the 3-click prompt. In contrast, the 1-click prompt results in notably lower performance across all gender and age categories. Overall, the results show minimal variance between different demographic groups, suggesting that the model maintains a consistent level of performance regardless of gender or age.

In Appendix H, we provide model, data and annotation cards for S.A-V.

E.2 Data engine details

E.2.1 Annotation protocol

A diagram of the annotation protocol used in our data engine is shown in figure 11. The annotation task was separated into steps each carried out by a different annotator: Steps 1 and 2 focus on object selection, Steps 3 and 4 on masklet tracking, and Step 5 on quality verification. sam 2 was deployed on GPU as an A.P.I and built into the annotation tool to enable interactive use.

Figure 11 summary: This figure is a flow chart illustrating a multi-stage annotation process. The content describes a sequential workflow divided into masklet selection, masklet tracking, and masklet verification, involving three different annotators. The process begins with a first annotator watching a video to select objects and identifying tracking failures. A second annotator then iteratively corrects these predictions by adding refinement prompts and re-running the propagation model until the masklets are finished. Finally, a third annotator verifies the quality of the masklets, resulting in a classification of the work as satisfactory, not satisfactory, or rejected. It can be inferred that the annotation pipeline is designed to be iterative and collaborative, utilizing a human-in-the-loop approach to correct model errors. The separation of roles suggests a quality control mechanism where selection, correction, and final verification are handled by different individuals to ensure higher accuracy and consistency in the final dataset.

Compared to image segmentation annotation, large-scale video segmentation annotation presents unique challenges which require innovations in the annotation task design and protocol. To improve our model's ability to "segment anything", it was important to focus annotation on challenging objects where sam 2 struggled. We leveraged our online model in the loop setup to enable this, requesting annotators to use sam 2 interactively to identify failure modes and then correct them.

We found the number of edited frames to be a proxy to the “challengingness” of an object as shown in Table 8. Therefore, we asked annotators to annotate objects that required at least 2 edited frames with sam 2 in the loop. To focus annotation on less prominent and more challenging cases, annotators were presented with videos pre-filled with verified satisfactory automatic masklets and asked to find un-annotated challenging objects. We further decouple the object selection task from the annotation task: in the selection task annotators focus on choosing the challenging objects in one frame, while in the annotation task annotators are presented with a challenging target object and requested to annotate the masklet consistently throughout the video.

E.2.2 Data engine phase comparison

The comparison of data engine phases shown in Table 1 was conducted as a controlled experiment using 169 videos and 452 masklets. We ask three subsets of annotators to annotate the same set of objects with the annotation protocol from each phase. We categorize masklets into three buckets based on the mask area in the first frame (small: 1 to 32 ^{2} , medium: 32 ^{2} to 96 ^{2} , and large: equal or greater than 96 ^{2} ) to 96, and large: equal or greater than 96). Phase 1 data is used as the quality reference, due to the high quality masks from frame-by-frame manual annotation with sam.

F Details on zero-shot transfer experiments

In this section, we describe further details of our zero-shot experiments (§6). Unless otherwise noted, the results reported in this section follow our default setup using Hiera-B+ image encoder with a resolution of 1024 and trained on the full combination of datasets, that is, sam 2 (Hiera-B+) in Table 6.

F.1 Zero-shot video tasks

F.1.1 Video dataset details

We evaluate sam 2 on a diverse benchmark of 17 zero-shot datasets: EndoVis 2018 contains medical surgery videos with robotic instruments. E.S.D contains videos from a robot manipulator camera often with motion blur. L.V.O.S.v 2 is a benchmark for long-term video object segmentation.

L.V-V.I.S contains videos from a diverse set of open-vocabulary object categories. U.V.O contains videos for open-world object segmentation, and vost contains videos with objects undergoing large transformations such as egg broken or paper torn. PumaVOS contains videos with segments around object parts such as a person's cheek.

Virtual Kitti 2 is a synthetic video dataset with driving scenes. VIPSeg provides object segmentation in panoptic videos. Wildfires contains wildfire videos under different conditions from the Corsican Fire Database. visor contains egocentric videos in kitchen scenes with segments around hands and active objects.

F.B.M.S provides motion segmentation over moving objects in videos. Ego-Exo4D is a large dataset with egocentric videos around various human activities. Cityscapes contains videos for urban driving scenes. Lindenthal Camera contains videos in a wildlife park with segments around observed animals such as birds and mammals.

H.T.1080.W.T Cells contains microscopy videos with cell segments. Drosophila Heart contains microscopy videos for the heart of fruit flies.

Among these 17 zero-shot video datasets above, 9 of them (EndoVis, E.S.D, L.V.O.S.v2, 55-V.I.S, U.V.O, vost, PumaVOS, Virtual Kitti 2, and VIPSeg) have dense object segments annotated for every video frame. In the remaining 8 datasets (Wildfires, visor, F.B.M.S, Ego-Exo4D, Cityscapes, Lindenthal Camera, H.T.1080.W.T Cells, and Drosophila Heart), the object segments are sparsely annotated over only a subset of video frames, and we compute the metrics on those frames where the ground-truth segmentation masks are available. In most evaluations of the paper, we only evaluate zero-shot performance on the 9 densely annotated datasets, while in our semi-supervised voss evaluation (§6.2), we evaluate on all these 17 datasets listed above.

F.1.2 Interactive offline and online evaluation details

Offline evaluation involves multiple passes over the entire video. We start with click prompts on the first frame, segment the object throughout the entire video, and then in the next pass, we select the frame with the lowest segmentation IoU w.r.t. the ground-truth as the new frame for prompting. The model then segments the object again throughout the video based on all prompts received previously, until reaching a maximum of N frame passes (with one new prompted frame in each pass).

Online evaluation involves only one pass over the entire video. We start with click prompts on the first frame and propagate the prompts across the video, pausing propagation when encountering a frame with a low-quality prediction (IoU less than 0.75 with ground-truth). We then add additional click prompts on the paused frame to correct the segment on this frame and resume the propagation forward until reaching another low quality frame with IoU less than 0.75. This is repeated while the number of prompted frames is less than the maximum N frame . Unlike the previous offline evaluation, in this setting, the new prompts only affect the frames after the current paused frame but not the frames before it.

In both settings, we evaluate on 9 densely annotated datasets in §F 1..1 (EndoVis, E.S.D, L.V.O.S.v 2, 55-V.I.S, U.V.O, vost, PumaVOS, Virtual Kitti 2, and VIPSeg). If a video contains multiple objects to segment in its ground-truth annotations, we perform inference on each object independently. We simulate interactive video segmentation with N click = 3 clicks per frame, assuming that the user would visually locate the object to label it (with initial clicks) or to refine the current segmentation prediction of it (with correction clicks). Specifically, when starting the first pass (where there are not any existing predictions yet), we place an initial click on the first frame at the center of the object's ground-truth mask and then interactively add two more clicks based on the center of the error region (between the ground-truth mask and the predicted segments on the first frame). Then in subsequent passes (where there are already predicted segments), we interactively add three clicks based on the center of the error region (between the ground-truth mask and the predicted segments on the frame being prompted).

We report the average J and F metric over N frame equals 1 through 8 interacted frames and the J and F metrics under different annotation time on a video based on the following assumptions:

- On each frame, it takes T loc equals 1 sec for the annotator to visually locate an object in the frame, and T click equals 1.5 sec to add each click, following Delatolas et al. (2024).

- In offline mode, it takes T exam = 30 sec on a 300-frame video to examine the results throughout the video in each round, including finding the frame with the worst segmentation quality to add corrections (and for longer or shorter videos, this time is proportional to the video length L, assuming the annotator could examine the results at 10 F.P.S).

In online mode, it takes T exam = 30 sec on a 300-frame video to follow the results throughout the video in total, including pausing at a frame with low quality for further corrections (and this time is proportional to the video length L similar to the offline mode).

- The annotation time for an object is (T exam times (L divided by 300) plus T loc plus T click times N click) times N frame in offline mode and T exam times (L divided by 300) plus (T loc plus T click times N click) times N frame in online mode, where L is the total frame number in the video, N frame equals 1 through 8 is the number of frames annotated (i.e., the number of interactive rounds), and N click equals 3 is the number of clicks per frame.

We show per-dataset results of sam 2 and the two baselines (sam+X.Mem++ and sam+Cutie, see their details below) for interactive offline and online evaluation in figure 12 and figure 13. sam 2 outperforms both baselines with a notable margin on all datasets and settings.

Figure 13 summary: This figure consists of several line charts and one bar chart. The line charts display the J&F performance metric relative to the annotation time for various datasets, comparing SAM 2 against SAM + XMem++ and SAM + Cutie. The bar chart shows the average J&F performance across multiple datasets for these three models. Across all datasets and annotation times, SAM 2 consistently achieves higher performance than the baseline models. The line charts indicate that as annotation time increases, performance generally improves for all models, though SAM 2 maintains a significant lead. The bar chart confirms that SAM 2 outperforms the other methods on almost every dataset evaluated.

F.1.3 Semi-Supervised voss Evaluation Details

In §6.2, we also compare with previous video tracking methods under the semi-supervised voss setting, where prompts (which can be foreground/background clicks, bounding boxes, or ground-truth object masks) are provided only on the first frame of the video. When using click prompts, we interactively sample either 1, 3 or 5 clicks on the first video frame, and then track the object based on these clicks. Following the click-based evaluation in prior work, the initial click is placed on the object center and subsequent clicks are obtained from the center of the error region.

Figure 12 summary: This figure consists of a series of line charts showing the zero-shot performance of SAM 2 compared to two baseline models, SAM + XMem++ and SAM + Cutie, across various datasets. Each chart plots the J&F performance metric against the total annotation time spent using a three-click interaction method per frame. The content displays how segmentation accuracy improves as more time is invested in annotation across multiple datasets, including EndoVis 2018, LV-VIS, VIPSeg, ESD, PUMaVOS, Virtual KITTI 2, LVOSv2, UVO, and VOST. Each plot tracks the performance trajectory of the three models as the number of interacted frames increases. Across all tested datasets, SAM 2 consistently achieves higher J&F scores than both baseline models. While all models generally show an upward trend in performance with increased annotation time, SAM 2 maintains a significant performance lead, demonstrating superior zero-shot capabilities and better efficiency in leveraging interactive annotations for video segmentation.

Table 4 summary: SAM 2 consistently outperforms the SAM + XMem++ and SAM + Cutie baselines across all evaluated datasets and in overall average performance. The improvements are observed across every individual benchmark, demonstrating the superior zero-shot interactive offline performance of SAM 2 compared to the other methods.

Similar to the interactive setting, here we also use sam+X.Mem++ and sam+Cutie as two baselines. For click or box prompts, sam is first used to handle click or bounding box inputs, and its output mask is then used as input to X.Mem++ or Cutie. For mask prompts, the ground-truth object masks on the first frame are directly used as input to X.Mem++ and Cutie – this is the standard semi-supervised voss setting and evaluates X.Mem++ and Cutie without using sam.

In this setting, we evaluate on all 17 zero-shot video datasets in §F 1..1. If a dataset does not follow the standard voss format, we preprocess it into a format similar to mose. During processing, we ensure that all objects in each video have a valid non-empty segmentation mask on the first frame to be compatible with semi-supervised voss evaluation. In case an object doesn't appear in the first frame, we create a separate video for it starting from the first frame where the object appears.

We report the standard J&F metric for this evaluation. If a dataset provides an official evaluation toolkit, we use it for evaluation (on the vost dataset, we report the J metric instead, following its official protocol (Tokmakov et al., 2022)). The results are shown in Table 4, where sam 2 performs better than both baselines on the majority of the 17 datasets across different types of prompts.

We show per-dataset results of sam 2 and the two baselines (sam+X.Mem++ and sam+Cutie, see their details below) for semi-supervised voss evaluation in figure 14. sam 2 outperforms both baselines on the majority of these datasets across different types of prompts.

Figure 14 summary: This figure consists of a combination of line charts and grouped bar charts. The line charts illustrate the relationship between annotation time and performance for three different models across two specific datasets, while the bar charts compare the performance of these models across a wide variety of datasets under different interaction settings. In all cases, SAM 2 consistently outperforms the baseline models, SAM + XMem++ and SAM + Cutie. The line charts show that as more time is invested in annotation, performance increases for all models, with SAM 2 maintaining a lead throughout. The bar charts further demonstrate that SAM 2 achieves superior results across nearly all tested datasets, regardless of the number of interacted frames used during evaluation.

Table 13(b) summary: SAM 2 consistently outperforms both SAM + XMem++ and SAM + Cutie across all evaluated datasets and in the overall average. The performance gains are observed uniformly across all benchmarks, indicating that SAM 2 provides superior zero-shot interactive segmentation capabilities compared to the baseline methods.

F.1.4 sam+X.Mem++ and sam+Cutie Baseline Details

We adopt sam+X.Mem++ and sam+Cutie as two baselines for promptable video segmentation, where the click (or box) prompts are first processed by sam to obtain an object mask, and then X.Mem++ / Cutie models track this sam mask across the video to obtain the final masklet. In these two baselines, sam can be used to provide both an initial object mask on the first frame, or to correct an existing object mask output by X.Mem++ or Cutie. This is used for subsequent interacted frames during interactive offline and online evaluation, where new positive and negative clicks are provided as corrections over an existing mask.

When using sam to apply a correction over an existing mask prediction in a given frame, we follow the strategy in EVA-voss to first initialize sam with the X.Mem++ or Cutie output mask before incorporating the new correction clicks. Specifically, we first reconstruct the X.Mem++ or Cutie output mask in sam by sampling clicks from them and feeding them as inputs to sam until the reconstructed mask in sam reaches IoU greater than 0.8 with the X.Mem++ or Cutie output mask. Then, to incorporate new positive and negative clicks for correction, we concatenate these additional correction clicks with the initial clicks sampled during mask construction, and feed the joint concatenated list as input into sam to obtain the final corrected masks. We find that this strategy works better than several alternatives (such as feeding the X.Mem++ or Cutie output mask as a mask prompt together with new correction clicks into sam, or taking only the correction clicks as inputs to sam while ignoring the X.Mem++ or Cutie output mask).

F.2 Davis Interactive Benchmark

We also evaluate on the Davis interactive benchmark, which resembles our interactive offline evaluation in §6.1, where in each round of interaction, the evaluation server would provide new annotations on frames with the worst segmentation performance. The official Davis eval toolkit provides scribble prompts during interactions, while other work such as CiVOS has also extended this to cover click prompts.

Here we follow CiVOS to use positive and negative clicks as input prompts and adopt the same strategy for click sampling. We report the J and F at 60 seconds and A.U.C J and F metrics on this benchmark as provided by its evaluator, and compare to two baselines: MiVOS, which directly uses the provided scribbles via a scribble-to-mask module (and is also extended to click prompts in Vujasinović et al. (2022)). and CiVOS, which samples click from the provided scribbles. The results are shown in Table 14, where sam 2 (based on click inputs) outperforms both baselines under click inputs. We note that sam 2 often tends to segment object parts (e.g. a person's arm) on the first click while the Davis dataset mainly contains whole objects (e.g. an entire person), which could penalize sam 2's J&F performance on this benchmark.

Table 14 summary: When evaluated on the DAVIS interactive benchmark using click-based inputs, SAM 2 achieves the highest performance across both measured metrics compared to other models. While MiVOS shows strong results when using scribbles, SAM 2 outperforms both the click-based versions of MiVOS and CiVOS.

F.3 Zero-shot image tasks

F.3.1 Dataset details

For the interactive segmentation task, we evaluated sam 2 on a comprehensive suite of 37 datasets. This suite includes the 23 datasets previously used by sam for zero-shot evaluation. For completeness, we list the 23 datasets: L.V.I.S, A.D.E.20.K, Hypersim, Cityscapes, B.B.B.C.038.v1, Doors, dram, EgoHOS, G.T.E.A, iShape, N.D.D 20, NDISPark, ovis, P.P.D.L.S, Plittersdorf, Streets, TimberSeg, TrashCan, visor, WoodScape, PIDRay, ZeroWaste-f, and I.B.D. For more detailed information about these datasets, we refer the reader to Kirillov et al. (2023). In addition to these 23 datasets, we evaluated on frames sampled from 14 video datasets to assess sam 2's performance on images from the video domain. The video datasets used are listed as follows: Lindenthal Camera Traps (L.C.T), vost, L.V-V.I.S, F.B.M.S, Virtual Kitti 2, Corsican Fire Database (C.F.D), VIPSeg, Drosophila Heart O.C.M (D.H O.C.M), EndoVis 2018, E.S.D, U.V.O, Ego-Exo4d, L.V.O.S.v 2, and H.T.1080.W.T. Table 16 has a more detailed description of these datasets. (Some of these datasets are obtained from the same data source as the zero-shot video datasets in §F 1..1.)

Table 16 summary: This table details a diverse collection of video segmentation datasets used for zero-shot evaluation, spanning various domains such as microscopy, driving, egocentric views, medical surgery, and wildlife. The datasets vary significantly in scale and characteristics, with some containing a very small number of videos and others comprising thousands of samples. There is a mix of dense and sparse annotation types, and the volume of sampled frames and masks fluctuates greatly across the different video types, reflecting a wide range of data densities and complexities.

F.3.2 Detailed zero-shot experiments

In this section, we include a more detailed version of the experiments in §6.3. We compare sam 2 to sam and HQ-sam with different model sizes in Table 15. The main metrics we use for evaluation are the 1-and 5-click mIoU and we categorize the results by the dataset domain.

Table 15 first shows a comparison of the models trained only on images (for the S.A task) with different image encoder sizes on both the S.A-23 benchmark as well as the 14 newly introduced video datasets. sam 2 (Hiera-B+) trained only on S.A-1B outperforms sam (ViT-H) on 1-click accuracy, and both sam (ViT-H) and HQ-sam (ViT-H) on 5-click accuracy while being 6x faster. sam 2 (Hiera-L) further improves the 1-click accuracy by 1 point on average, but trading off speed. Despite being slower than Hiera-B+, it is still 3.4x faster than sam (ViT-H) and 1.5x faster than sam (ViT-B).

The last two rows in Table 15 illustrate the benefits of training with our mix of image and video data, which boosts the average accuracy to 61.4% across the 23 datasets with the Hrea-B+ image encoder. Additionally, we observe substantial improvements on the video benchmarks of S.A-23 as well as the 14 newly introduced video datasets. We note that we do not scale beyond Hiera-L, but expect better performance for a larger model.

A breakdown of the accuracy across datasets is presented in figure 15, where the per-dataset delta in 1-click mIoU relative to sam is color-coded to indicate the data type (image or video). Notably, sam 2 (Hiera-B+) surpasses sam on 29 datasets by up to 53.9 mIoU, despite using a smaller Hiera-B+ image encoder.

Figure 15 summary: This figure is a horizontal bar chart. It displays the difference in mean intersection over union for a single center click when comparing the zero-shot performance of SAM 2 against SAM across several dozen datasets, categorized by whether the data originates from a video or image domain. The majority of the datasets show a positive delta, indicating that SAM 2 generally outperforms SAM. The most substantial improvements are observed in datasets from the video domain, while some image-based datasets show minimal change or a slight decrease in performance.

G Details on Comparison to State-of-the-Art in Semi-Supervised voss

We provide additional details on the comparison to the previous state-of-the-art in semi-supervised voss ( §7 ). We include results from sam 2 trained only on S.A-1B, S.A-V and Internal data, for different encoder sizes.

Qualitative comparison: In figure 16, we show a comparison between our baseline (Cutie-base+, top row) and our model (sam 2, bottom row) when prompted with a mask in the first frame. While the mask prompt in the first frame only covers the shirt of the person, the masklet predicted by the baseline wrongfully propagates to the whole person. Our model, however, is able to restrict the masklet to the target object.

Figure 16 summary: This figure consists of two rows of image sequences showing object segmentation over time. The top row displays results from a baseline model, while the bottom row shows results from the proposed model, both initiated with a mask in the first frame to track a person on a swing. The sequences illustrate the ability of each model to maintain a segmentation mask on a moving subject across multiple frames. The baseline model shows significant mask drift and expansion, eventually covering unrelated parts of the background and other people in the scene. In contrast, the proposed model maintains a tight and consistent mask around the target subject throughout the entire sequence. It can be inferred that the proposed model is significantly more robust and accurate at temporal object tracking than the baseline. While the baseline fails to preserve the object boundaries as the motion increases, the proposed model demonstrates superior stability and precision in segmenting the target person across the video frames.

Quantitative comparison: In Table 17, we compare the performance of our model to previous approaches on additional semi-supervised voss metrics. sam 2 outperforms prior work on all evaluated benchmarks, in all metrics. Note that unlike these previous approaches, sam 2 is not specialized in the semi-supervised voss task but is capable of more general promptable segmentation. sam 2 is also not restricted to a specific set of object classes. The performance of our model on the S.A-V benchmark (Table 17a) demonstrates its capability to segment anything in a video.

Table 17 summary: SAM 2 consistently outperforms previous state-of-the-art methods across multiple semi-supervised video object segmentation benchmarks, including SA-V, LVOS, LVOSv2, MOSE, DAVIS17, and YTVOS19. Across all evaluation metrics such as J&F, J, and F, the various SAM 2 model scales generally achieve higher results than prior works like Cutie, DEVA, and XMem. Within the SAM 2 family, the Hiera-L variant typically yields the best performance, while the versions trained on additional datasets (marked with a double dagger) show further improvements in several benchmarks.

H Model, data and annotation cards

H.1 Model card

{Model Overview}

Intended Use

Name | sam 2 (Segment Anything Model 2)

Version | 1.0

Date | 2024

Organization | Meta Fair

Mode type | Promptable segmentation model

Architecture | See Section 4

Repository | github dot com U.R.L U.R.L

License | Apache 2.0

Primary intended users sam 2 was designed as a unified model for promptable video and image segmentation tasks. The model was primarily developed for research use cases. sam 2 is released under an Apache 2.0 license.

Out-of-scope use cases See Ethical considerations and license for restrictions.

Caveats and recommendations See Appendix C for limitations.

Relevant Factors

Groups sam 2 is class agnostic and was designed for promptable image and video segmentation. It can segment and track any object.

Instrumentation and environment sam 2 was evaluated across a variety of types of video and image data. The video benchmark suite included domains such as driving data, microscopy, egocentric video, robotic surgery. See Table 16 for descriptions of the benchmarks and Figure 17 for example frames. sam 2 was evaluated on the same suite of image benchmarks as Kirillov et al. (2023), which covers domains including underwater images, paintings, fish-eye images.

Figure 17 summary: This figure is a collection of image frames. It displays a variety of video sequences from the SAM 2 zero-shot video benchmark suite, showcasing segmentation masks applied to different objects across diverse environments. The examples demonstrate the model's ability to track and segment objects in various contexts, including indoor scenes, underwater environments, medical imaging, outdoor landscapes, and urban traffic. The results indicate that the model can effectively handle a wide range of object scales and categories without prior training on these specific datasets, suggesting strong generalization capabilities for zero-shot video object segmentation.

{Metrics}

Evaluation Data

{Training Data}

We evaluate the performance of sam 2 using the following metrics:

J&F: We evaluate performance using J&F for the promptable video segmentation and semi-supervised voss tasks.

G: We use G for evaluation on Y.T.V.O.S 2019 for the semi-supervised voss task. mIoU: We evaluate performance using mIoU for the promptable image segmentation task.

data source sam 2 was trained on the S.A-V dataset alongside internally available licensed video data. See Section 5 of the main text for more details and Appendix H.2 for the S.A-V dataset data card.

Ethical Considerations

Data See Section 5 for more details about the sam 2 training data. In Section E.1 we show a geographic distribution of the videos and demographic distribution of the crowdworkers who collected the videos in the S.A-V dataset.

Cost and impact of compute The released sam 2 was trained on 256 A.100 GPUs for 108 hours. This corresponds to 12165.12 kWH and an estimated emissions of 3.89 metric tons of C.O.2.e. The emissions from training the released sam 2 are equivalent to approximately 10k miles driven by an average gasoline-powered passenger vehicle.

Risks and harms In Section E.1.1 of the main text we analyze sam 2 performance on people across demographic groups. When using sam 2 in new settings, we suggest that researchers perform their own fairness evaluation for sam 2 specific to their use case.

Use cases We implore users to use their best judgement.

H.2 Dataset Card for S.A-V Dataset

Motivation

1. For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description. The dataset was designed for the P.V.S task. The contributions of our dataset to the vision community are: (1) The dataset, composed of 50.9 thousand videos and 642.6 thousand masklets, is the largest video segmentation dataset publicly available today (see 5.2 for comparisons to current voss datasets) (2) The dataset is available under a Creative Commons Attribution 4.0 International Public License at ai dot meta dot com U.R.L, (3) The data is a more geographically diverse, publicly available, video segmentation dataset than its predecessors.

2. Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? The dataset was created by Meta Fair. The underlying videos were collected via a contracted third party company.

3. Who funded the creation of the dataset? The dataset was funded by Meta Fair.

4. Any other comments? No.

Composition

1. What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description. All of the instances in the dataset are videos. Subject matter diversity was encouraged and no specific themes were applied during video collection. Common themes of the video include: locations, objects, scenes. All the videos are distinct, however there are some sets of videos that were taken of the same subject matter.

2. How many instances are there in total (of each type, if appropriate)? There are 50.9 thousand videos.

3. Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable). While the dataset contains all possible instances, reviewers were advised to refuse to annotate content containing explicit imagery.

4. What data does each instance consist of? "Raw" data (e.g., unprocessed text or images) or features? In either case, please provide a description. Each instance is a video.

5. Is there a label or target associated with each instance? If so, please provide a description. Each video is annotated with masklets that track objects throughout the video. There are no categories or text associated with the masklets. The data was annotated at 6 F.P.S. There are an average of 3.8 manual masklets, and 8.9 auto masklets per video, and there are 642.6 thousand masklets in total.

6. Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, for example, redacted text. No.

7. Are relationships between individual instances made explicit (e.g., users' movie ratings, social network links)? If so, please describe how these relationships are made explicit. No.

8. Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. For manual masklets, human errors may exist; for example, annotators may miss a frame to check or fix when needed. For auto masklets, as sam 2 is used to generate them, model errors such as inconsistencies in the masklets may exist.

9. Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (e.g., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate. The dataset is self contained.

10. Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)? If so, please provide a description. No.

11. Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why. We have three safety measures to prevent objectionable content: (1) The video collecting, crowdworkers were provided instructions to not record videos that might contain objectionable content (e.g., graphic, nudity, or inappropriate content). (2) The expert annotators who annotated the videos were provided instructions to flag and reject videos if objectionable content was present. (3) Reports about video (s) in the dataset can be submitted to segment-anything@meta.com.

12. Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset. The dataset does not identify any subpopulations of the people in the videos. The demographics of the crowdworkers who collected the videos in the dataset are presented in 5.2.

13. Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how. Videos were subjected to a face blurring model. Reports about videos in the dataset can be submitted to segment-anything@meta.com.

14. Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals race or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description. The dataset is not focused on data that may be considered sensitive. Reports about videos in the dataset can be submitted to segment-anything@meta.com.

15. Any other comments? No.

Collection Process

1. How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If the data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how. The released masklets associated with each video were collected using two methods. (1) sam 2 assisted manual annotation (2) automatically generated by sam 2 and verified by annotators.

2. What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software A.P.I's)? How were these mechanisms or procedures validated? The videos in the dataset were collected via a contracted third-party vendor. They are videos taken by crowdworkers with unknown equipment.

3. If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? N/A.

4. Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? (1) The videos in the dataset were collected via a contracted third-party vendor. They are videos taken by crowdworkers who were compensated with an hourly wage set by the vendor. (2) The manually collected masklets in the dataset were collected by annotators via another third-party vendor. Annotators were compensated with an hourly wage set by the vendor.

5. Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. The videos were filmed between November 2023 and March 2024. The masklet annotations were collected between April 2024 and July 2024.

6. Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation. If the dataset does not relate to people, you may skip the remaining questions in this section. The project underwent an internal review process.

7. Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g. websites)? We contracted with third-party vendors to collect the videos and to generate or review annotations.

8. Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself. The videos were collected by crowdworkers via a contracted third-party vendor. The crowdworkers agreed to consent forms.

9. Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented. The videos were collected via a contracted third-party who provided appropriate representations regarding the collection of any notices and consents as required from individuals.

10. If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate). Pursuant to the contract, the contracted third-party collected consents and provided opportunity for consent revocation.

11. Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation. See detail in E.1.1

12. Any other comments? No.

Preprocessing / Cleaning / Labeling

1. Was any preprocessing / cleaning / labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, sift feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remaining questions in this section. The videos were re-sampled to 24 fps and converted to mp4 format.

2. Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the "raw" data. No.

Uses

1. Has the dataset been used for any tasks already? If so, please provide a description. The dataset has been used to train and evaluate sam 2.

2. What (other) tasks could the dataset be used for? The data could be used for voss, iVOS, or P.V.S tasks. If frames are sampled from the videos, the dataset can be used for the image segmentation task.

3. Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a dataset consumer might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other risks or harms (e.g., legal risks, financial harms)? If so, please provide a description. Is there anything a dataset consumer could do to mitigate these risks or harms? We have an analysis of the geography and crowdworker demographic of our dataset in 5.2. While we believe our dataset to be more representative on these factors than most of the publicly existing datasets of its kind at this time, we acknowledge that we do not have parity across all geographic and demographic groups, and we encourage users of the dataset to be mindful of any potential biases models may learn using this dataset.

4. Are there tasks for which the dataset should not be used? If so, please provide a description. No. Full terms of use for the dataset can be found at ai dot meta dot com U.R.L.

5. Any other comments? No.

Distribution

1. Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description. The dataset will be available under the permissive Creative Commons Attribution 4.0 International Public License.

2. How will the dataset will be distributed (e.g., tarball on website, A.P.I, GitHub)? Does the dataset have a digital object identifier (D.O.I)? The dataset is available at ai dot meta dot com U.R.L.

3. When will the dataset be distributed? The dataset will be distributed in July 2024.

4. Will the dataset be distributed under a copyright or other intellectual property (I.P) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions. Yes, the dataset will be available under the Creative Commons Attribution 4.0 International Public License. The license agreement and terms of use for the dataset can be found at ai dot meta dot com U.R.L. Users must agree to the terms of use before downloading or using the dataset.

5. Have any third parties imposed I.P-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions. Full terms of use and restrictions on use of the S.A-V dataset can be found at ai dot meta dot com U.R.L.

6. Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation. The license and restrictions on use of the S.A-V dataset can be found at ai dot meta dot com U.R.L.

7. Any other comments? No.

Maintenance

1. Who will be supporting/hosting/maintaining the dataset? The dataset will be hosted at ai dot meta dot com U.R.L and maintained by Meta Fair.

2. How can the owner/curator/manager of the dataset be contacted (e.g., email address)? Please email segment-anything@meta.com.

3. Is there an erratum? If so, please provide a link or other access point. No.

4. Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (e.g., mailing list, GitHub)? Updates may be made pursuant to inbound received at segment-anything@meta.com.

5. If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced. There are no limits on data retention.

6. Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers. No. If updates are made to the dataset, previous versions will not continue to be hosted.

7. If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description. We encourage further annotations for S.A-V, but these will not be validated/verified or supported/hosted/maintained by Meta.

8. Any other comments? No.

H.3 Data annotation card

Task Formulation

1. At a high level, what are the subjective aspects of your task? Selecting objects to mask and track in a video is inherently a subjective task, and annotators might differ in their decision to mask objects.

2. What assumptions do you make about annotators? We assume our annotators understand the P.V.S task and are well trained on video related tasks. Our annotators worked full time on our annotation task. This made it possible to train the annotators by sharing feedback on a regular basis.

3. How did you choose the specific wording of your task instructions? What steps, if any, were taken to verify the clarity of task instructions and wording for annotators? (1) The task instructions included visual examples (images and videos) to provide clarity. (2) Annotators were well trained before working on production queues. (3) The research team shared feedback daily and met with the annotators weekly for Q&A sessions.

4. What, if any, risks did your task pose for annotators and were they informed of the risks prior to engagement with the task? Annotators were informed to reject objectionable videos.

5. What are the precise instructions that were provided to annotators? See detail in 11 for annotation instructions.

Selecting Annotations

1. Are there certain perspectives that should be privileged? If so, how did you seek these perspectives out? We chose to work with annotators with previous video annotation experience.

2. Are there certain perspectives that would be harmful to include? If so, how did you screen these perspectives out? No.

3. Were sociodemographic characteristics used to select annotators for your task? If so, please detail the process. For masklet annotations, sociodemographic characteristics were not used to select the annotators. For video collection, we emphasized the importance of diversity among the crowdworkers to our third-party vendor. While it was not a strict requirement, we encouraged the inclusion of a diverse group of crowdworkers to enrich the data collection process with a wide range of perspectives. This approach aimed to naturally incorporate diversity without imposing strict selection based on sociodemographic factors.

4. If you have any aggregated socio-demographic statistics about your annotator pool, please describe. Do you have reason to believe that sociodemographic characteristics of annotators may have impacted how they annotated the data? Why or why not? Aggregated socio-demographic statistics about the crowdworkers who collected the videos are presented in 5.2.

5. Consider the intended context of use of the dataset and the individuals and communities that may be impacted by a model trained on this dataset. Are these communities represented in your annotator pool? The S.A-V dataset is a geographically diverse, publicly available, video segmentation dataset, as discussed in 5.2. In addition, we analyze the responsible A.I axes of a model trained on the dataset, as discussed in E.1.1

Platform and Infrastructure Choices

1. What annotation platform did you utilize? At a high level, what considerations informed your decision to choose this platform? Did the chosen platform sufficiently meet the requirements you outlined for annotator pools? Are any aspects not covered? We used an internal annotation platform.

2. What, if any, communication channels did your chosen platform offer to facilitate communication with annotators? How did this channel of communication influence the annotation process and/or resulting annotations? The research team shared feedback daily and met with the annotators weekly to align on the task instructions and expectations and to hold Q&A sessions. Outside of those sessions, annotators had access to a spreadsheet and chat group to facilitate communication with the research team.

3. How much were annotators compensated? Did you consider any particular pay standards, when determining their compensation? If so, please describe. The video collecting crowdworkers were compensated with an hourly wage set by the vendor. Annotators were compensated with an hourly wage set by the vendor.

Dataset Analysis and Evaluation

1. How do you define the quality of annotations in your context, and how did you assess the quality in the dataset you constructed? Annotators were required to follow a training before moving to production queues. Annotators followed a 2-day training session led by the vendor and then were asked to annotate jobs from a training queue. Annotators were able to move from training to production after the vendor Q&A team or the research team reviewed their work and assessed quality. On average, annotators spent 1 - 2 weeks in training before moving to production. Similarly, the vendor and research team Q&A manually reviewed the production queues' annotations daily, sharing feedback daily.

2. Have you conducted any analysis on disagreement patterns? If so, what analyses did you use and what were the major findings? Did you analyze potential sources of disagreement? The disagreement patterns were shared daily and weekly during feedback and Q&A sessions.

3. How do the individual annotator responses relate to the final labels released in the dataset? The final labels are after data cleaning and post processing from the individual annotator responses.

Dataset Release and Maintenance

1. Do you have reason to believe the annotations in this dataset may change over time? Do you plan to update your dataset? No.

2. Are there any conditions or definitions that, if changed, could impact the utility of your dataset? No.

3. Will you attempt to track, impose limitations on, or otherwise influence how your dataset is used? If so, how? The S.A-V dataset is released under a permissive C.C by 4.0 license.

4. Were annotators informed about how the data is externalized? If changes to the dataset are made, will they be informed? No.

5. Is there a process by which annotators can later choose to withdraw their data from the dataset? If so, please detail. No.

You have reached the end of the document.