Introduction
In April 2023, Meta AI Research release Segment Anything (SAM). It generates segmentation masks for all objects it can detect or masks originating from input promts such as points or boxes. The paper mentions text promts as well. As of 2023-04-30, text promts seem not to be released yet. However, there exists Ground-Segment-Anything by IDEA (International Digital Economy Academy) which seems to include text and voice promts. Additionally, it includes StableDiffusion.
Some examples using the interactive demo.
Hovering above a pixel basically provides that pixel as input promt. Even slight shifts can yield different segmentation results. This becomes very apparent when hovering over the tail section of the aircraft.
Providing single or multiple point promts works great until it may suddenly break and yield something unexpected:
The box promt seems to be better for such cases than using multiple points:
The auto mask approach will apply a regular grid over the full image as promt input and return the masks:
Using the web demo it becomes apparent that the prompt influences segmentation results heavily. In a way asking the right questions/providing the right input is key to getting useful results. This makes SAM super interesting to adapt for annotation tasks that involve segmentation.
Post-processing Results
However, such an interactive web demo is of no use except for some demonstration purposes. However, building such a demo requires knowledge of model inputs, outputs and how to parse them as well.
The simplest way to get started is to use the automatic mask generator (SamAutomaticMaskGenerator
) which allows some fine tuning with respect to thresholds at initialization.
Init signature:
SamAutomaticMaskGenerator(
model: segment_anything.modeling.sam.Sam,
points_per_side: Optional[int] = 32,
points_per_batch: int = 64,
pred_iou_thresh: float = 0.88,
stability_score_thresh: float = 0.95,
stability_score_offset: float = 1.0,
box_nms_thresh: float = 0.7,
crop_n_layers: int = 0,
crop_nms_thresh: float = 0.7,
crop_overlap_ratio: float = 0.3413333333333333,
crop_n_points_downscale_factor: int = 1,
point_grids: Optional[List[numpy.ndarray]] = None,
min_mask_region_area: int = 0,
output_mode: str = 'binary_mask',
) -> None
mask_auto_generator = SamAutomaticMaskGenerator(sam)
masks = mask_auto_generator.generate(image)
mask
is of type list
and len(masks)
will return the number of masks detected/generated. Each mask predicted is stored as a dict
. This allows us to check what keys are available.
dict_keys(['segmentation', 'area', 'bbox', 'predicted_iou', 'point_coords', 'stability_score', 'crop_box'])
The masks seem not be sorted by size or scores. If there is any filtering by size or score desired, then all masks need to be iterated over.
area
indicates the number of active pixels in a mask and not the area of bbox
. segmentation
contains a 2D mask (values True
, False
) which can be applied to the input image.
The output format when using SamPredictor
is slightly different. First of all, it comes with two predict functions. predict
expects numpy.ndarray
as inputs wheras predict_torch
expect torch.Tensor
as input. Furthermore, predict
can’t be simply applied using the image as one of the arguments. The image needs to be set up front using .set_image(image)
. The idea behind this is that image encoding and promting are independent of each other. The predictor allows various inputs for promting and returns three numpy arrays. For additional fine tuning an existing mask can be passed to the predictor (mask_input
) along the other promts.
predictor.predict(
point_coords: Optional[numpy.ndarray] = None,
point_labels: Optional[numpy.ndarray] = None,
box: Optional[numpy.ndarray] = None,
mask_input: Optional[numpy.ndarray] = None,
multimask_output: bool = True,
return_logits: bool = False,
) -> Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]
The arrays returned are “masks”, “scores” and “logits”. By default (multimask_output=True
) 3 masks are returned and it seems to be the recommended setting to improve quality even if only one mask is desired. Scores are provided in a separate numpy array. Similar to the automatic mask generation, the mask indices are not sorted by scores.