Introduction

In April 2023, Meta AI Research release Segment Anything (SAM). It generates segmentation masks for all objects it can detect or masks originating from input promts such as points or boxes. The paper mentions text promts as well. As of 2023-04-30, text promts seem not to be released yet. However, there exists Ground-Segment-Anything by IDEA (International Digital Economy Academy) which seems to include text and voice promts. Additionally, it includes StableDiffusion.

Some examples using the interactive demo.

Hovering above a pixel basically provides that pixel as input promt. Even slight shifts can yield different segmentation results. This becomes very apparent when hovering over the tail section of the aircraft.

Providing single or multiple point promts works great until it may suddenly break and yield something unexpected:

The box promt seems to be better for such cases than using multiple points:

The auto mask approach will apply a regular grid over the full image as promt input and return the masks:

Using the web demo it becomes apparent that the prompt influences segmentation results heavily. In a way asking the right questions/providing the right input is key to getting useful results. This makes SAM super interesting to adapt for annotation tasks that involve segmentation.

Post-processing Results

However, such an interactive web demo is of no use except for some demonstration purposes. However, building such a demo requires knowledge of model inputs, outputs and how to parse them as well.

The simplest way to get started is to use the automatic mask generator (SamAutomaticMaskGenerator) which allows some fine tuning with respect to thresholds at initialization.

Init signature:
SamAutomaticMaskGenerator(
    model: segment_anything.modeling.sam.Sam,
    points_per_side: Optional[int] = 32,
    points_per_batch: int = 64,
    pred_iou_thresh: float = 0.88,
    stability_score_thresh: float = 0.95,
    stability_score_offset: float = 1.0,
    box_nms_thresh: float = 0.7,
    crop_n_layers: int = 0,
    crop_nms_thresh: float = 0.7,
    crop_overlap_ratio: float = 0.3413333333333333,
    crop_n_points_downscale_factor: int = 1,
    point_grids: Optional[List[numpy.ndarray]] = None,
    min_mask_region_area: int = 0,
    output_mode: str = 'binary_mask',
) -> None
mask_auto_generator = SamAutomaticMaskGenerator(sam)
masks = mask_auto_generator.generate(image)

mask is of type list and len(masks) will return the number of masks detected/generated. Each mask predicted is stored as a dict. This allows us to check what keys are available.

dict_keys(['segmentation', 'area', 'bbox', 'predicted_iou', 'point_coords', 'stability_score', 'crop_box'])

The masks seem not be sorted by size or scores. If there is any filtering by size or score desired, then all masks need to be iterated over.

area indicates the number of active pixels in a mask and not the area of bbox. segmentation contains a 2D mask (values True, False) which can be applied to the input image.

The output format when using SamPredictor is slightly different. First of all, it comes with two predict functions. predict expects numpy.ndarray as inputs wheras predict_torch expect torch.Tensor as input. Furthermore, predict can’t be simply applied using the image as one of the arguments. The image needs to be set up front using .set_image(image). The idea behind this is that image encoding and promting are independent of each other. The predictor allows various inputs for promting and returns three numpy arrays. For additional fine tuning an existing mask can be passed to the predictor (mask_input) along the other promts.

predictor.predict(
    point_coords: Optional[numpy.ndarray] = None,
    point_labels: Optional[numpy.ndarray] = None,
    box: Optional[numpy.ndarray] = None,
    mask_input: Optional[numpy.ndarray] = None,
    multimask_output: bool = True,
    return_logits: bool = False,
) -> Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

The arrays returned are “masks”, “scores” and “logits”. By default (multimask_output=True) 3 masks are returned and it seems to be the recommended setting to improve quality even if only one mask is desired. Scores are provided in a separate numpy array. Similar to the automatic mask generation, the mask indices are not sorted by scores.