Getting SAM 2 running on Windows without losing your mind

For the last few weeks I’ve been living inside Meta’s Segment Anything Model 2.1 (SAM 2.1) and a pile of aerial imagery.

This post is my own breadcrumb trail: how I installed SAM 2.1 on Windows, what’s actually going on under the hood, and how I plugged it into segment-geospatial to slice up a big aerial scene into ~300 segments with zero labeled data.

If future-me forgets how any of this works, this is where I come back.

1. Setting Up SAM 2.1 on Windows

I’m running SAM 2.1 inside a dedicated conda environment with Python 3.11 and GPU-enabled PyTorch.

1.1. Create a clean environment

I started with:

Terminal · Conda env

conda create -n sam2 python=3.11
conda activate sam2

SAM 2.x wants Python ≥ 3.11, so I locked that in from the beginning.

1.2. Install PyTorch with CUDA

Inside that environment, I installed PyTorch + CUDA that matches my system. The exact command will depend on your GPU & CUDA version, but the pattern looks like:

Terminal · PyTorch + CUDA

# Example – replace with the command from PyTorch's website for your setup
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

I double-checked that CUDA was visible from Python:

Python

import torch
print(torch.cuda.is_available())  # True is what you want

If this prints False, you’re not really using the GPU yet—time to fix your drivers / CUDA install.

1.3. Clone and install SAM 2

Then I pulled the official SAM 2 repo and installed it in editable mode:

Terminal · SAM 2 repo

git clone https://github.com/facebookresearch/sam2.git
cd sam2
pip install -e ".[demo]"

On Windows, that last line was a little flaky for me, so when it failed I just installed the important pieces manually:

Terminal · extra deps

pip install opencv-python matplotlib notebook

1.4. Download the SAM 2.1 checkpoints

Meta ships several model sizes (Hiera-B, Hiera-L, etc.). The repo has a helper script called download_ckpts.sh, but that’s a Bash script and Windows doesn’t love .sh by default.

So I had two options:

run the script in WSL / Git Bash, or
manually download the .pt files from the links in the README / model card and drop them into a checkpoints/ folder.

Either way, I ended up with something like:

Project layout

sam2/
  checkpoints/
    sam2_hiera_large.pt
    sam2_hiera_base.pt
  configs/
    sam2_hiera_l.yaml
    sam2_hiera_b.yaml

1.5. Sanity check: can the model actually segment something?

Once the code and checkpoints were in place, I did a tiny inference test on a single image.

Python · quick smoke test

import cv2
import numpy as np
import torch

from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor

device = "cuda" if torch.cuda.is_available() else "cpu"

checkpoint = "checkpoints/sam2_hiera_large.pt"
config = "configs/sam2_hiera_l.yaml"

# build model
sam2_model = build_sam2(config, checkpoint, device=device)
predictor = SAM2ImagePredictor(sam2_model)

# load an image
image = cv2.imread("test.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# set the image for the predictor
predictor.set_image(image)

# point prompt at (100, 150)
point_coords = np.array([[100, 150]])
point_labels = np.array([1])   # 1 = foreground, 0 = background

masks, scores, logits = predictor.predict(
    point_coords=point_coords,
    point_labels=point_labels,
    multimask_output=True,
)

print("Got", len(masks), "candidate masks")

If this runs without errors and prints a few masks, congrats: SAM 2.1 is alive and talking to your GPU.

1.6. Little Windows gotchas

A couple of small potholes I hit and fixed:

CUDA mismatch: PyTorch needs to match your installed CUDA toolkit. If things break during import, re-install PyTorch with the exact command from their website.
CUDA_HOME: some builds want CUDA_HOME set. On Windows, I pointed it at my toolkit folder, e.g.:

Command Prompt

set CUDA_HOME="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4"

Once all of that was stable, I moved from “cool demo” to “how does this thing actually work?”

2. How SAM 2.1 Thinks: A Quick Mental Model

High-level architecture of Segment Anything Model 2 (SAM 2), showing how it processes video frames with memory. In each frame, SAM2’s image encoder processes the frame, memory from previous frames is used to inform the mask decoder, and prompts (points/boxes/masks) can be applied at any frame.

SAM 2 is a big model with several moving parts. Here’s my mental model of how it works under the hood.

2.1. Image encoder – Hiera

SAM 2 uses a hierarchical Vision Transformer called Hiera as its image encoder.

It processes each frame (or image) and produces a stack of multi-scale feature maps.

For a single image, it feels very similar to SAM 1’s ViT encoder: big image in → dense embedding out.

Think of this as: “I’ve turned your pixels into a language I can reason about.”

2.2. Prompt encoder – your way of pointing

Just like SAM 1, SAM 2 listens to prompts:

Points / boxes → turned into “sparse” embeddings with positional encodings.
Masks as prompts → run through a small conv net to become a “dense” embedding.

This is how you tell the model: “I care about this thing, not the whole scene.”

2.3. Memory bank – SAM’s long-term attention span

This is where SAM 2 goes beyond SAM 1.

For videos, after you segment an object in a frame:

SAM 2 stores feature maps for that frame
plus compact object pointers that summarize what that object looked like.

When a new frame arrives, SAM 2:

runs the image encoder again,
then uses memory attention to cross-attend the current features with memories from previous frames.

Result: it can track the same object across time without you re-prompting every frame.

For a single still image, there is no past, so the memory machinery is basically idle and SAM 2 behaves like a very strong SAM 1.

2.4. Mask decoder – turning understanding into pixels

The mask decoder:

takes the image features (possibly enhanced with memory)
and the prompt embeddings
and runs a few rounds of transformer-style two-way attention to spit out segmentation masks.

Just like SAM 1, if a prompt is ambiguous, it can output multiple candidates. SAM 2 adds an extra trick: a head that can say “the object is not in this frame” which matters a lot in video when things move off-screen.

2.5. Memory encoder – writing back to the brain

After predicting a mask for the current frame, SAM 2 has a memory encoder that:

compresses that mask + image features
and writes them back into the memory bank (usually in some FIFO style).

Over time, it builds a rolling memory of what the object has looked like in recent frames.

SAM 2.1 is basically an upgraded checkpoint of this architecture: same structure, just trained more and tuned better, especially for small / tricky / occluded things.

Once I felt like I had a handle on that mental model, I moved to the fun part: geo stuff.

3. Bringing SAM into the Geospatial World with segment-geospatial

Next stop: segment-geospatial (a.k.a. samgeo) – a Python package that wraps SAM for working with GeoTIFFs and other spatial imagery.

The dream here is simple: “Give me a satellite or aerial image and let SAM carve it into meaningful regions without me drawing polygons for hours.”

3.1. Installing segment-geospatial

Inside the same sam2 environment:

Terminal · samgeo

pip install -U segment-geospatial

The author (Dr. Qiusheng Wu) has already wired in support for SAM 2, usually exposed as SamGeo2 in the API.

3.2. Test image

For my tests I used a high-resolution aerial image of a residential area:

houses
roads
yards
trees

Basically the kind of scene you’d normally digitize by hand in GIS.

4. First Attempt: Prompt-based Segmentation (and a Facepalm)

My first instinct was: “Let’s click on a building and get just that building back.”

I tried to use SamGeo directly in predictor mode and ran into:

Python traceback

AttributeError: 'SamGeo' object has no attribute 'predictor'

Which was the library gently saying: “You’re using me in the wrong mode, my dude.”

The important realization:

For interactive prompts with SAM 2, you want SamGeo2
and you want automatic=False so it acts like a predictor instead of running full automatic mask generation.

Something like:

Python · predictor mode

from samgeo import SamGeo2

sam = SamGeo2(
    model_id="sam2-hiera-large",
    automatic=False,
)

sam.set_image("sample_aerial.tif")

mask = sam.predict(
    point_coords=[[x, y]],  # pixel coords
    point_labels=[1],       # 1 = foreground
)

sam.show_mask(mask)

Once I re-wired my code that way, prompt-based prediction started making sense. But before I went too deep on that, I decided to see what full auto could do.

5. Letting SAM Go Wild: Automatic Mask Generation

To really stress test SAM 2.1 on the aerial image, I switched to automatic segmentation.

Python · auto masks

from samgeo import SamGeo2

sam = SamGeo2(
    model_id="sam2-hiera-large",
    automatic=True,
)

sam.generate(
    source="sample_aerial.tif",
    output="output_mask.tif",
)

This tells segment-geospatial:

load the image (GeoTIFF or similar)
run SAM’s Automatic Mask Generator
save the result as a GeoTIFF mask, with each segment encoded as a unique integer.

5.1. Memory + bit depth issues

My test image was pretty large, so:

the run was GPU + RAM heavy, but still finished
the output contained ~294 segments (!)

The first time I tried saving the mask, I hit:

Python error

OverflowError: Python int too large to convert to uint8

Totally fair – 294 segment IDs don’t fit in an 8-bit range cleanly. The fix was to bump the output to a higher bit depth (e.g. uint16) so all IDs are preserved. Once I adjusted the save settings, output_mask.tif wrote cleanly.

5.2. What the result looked like

Visualizing that mask with a random colormap gave me a very colorful picture:

each house = its own blob of color
long continuous stretches of road = distinct segments
trees and vegetation = separate patches

The crazy part is: SAM 2.1 was never trained specifically on satellite imagery, but still partitioned the scene into reasonable units just by visual distinctiveness.

Zero labeled data. One pass. Nearly 300 regions.

6. What I’ve Actually Achieved (So Far)

Here’s what I can do now, end-to-end:

Spin up a Windows + CUDA + PyTorch environment and run SAM 2.1 on GPU.
Understand enough of the architecture (image encoder, prompts, memory, decoder) to reason about what it’s doing, not just treat it like magic.
Use segment-geospatial to:
- run automatic segmentation on an aerial image and write out a georeferenced mask
- get ~300 meaningful segments in one go, saved as a GeoTIFF

That’s already useful: I can convert that mask into polygons and feed them into QGIS / ArcGIS for measuring areas, filtering by size, or intersecting with other layers.

7. Where I Want to Take This Next

I’m nowhere near done with this pipeline. The to-do list looks like this:

7.1. Filter and label the segments

SAM doesn’t know what the segments are, only that they’re visually distinct.

Next step:

identify which segments are buildings, roads, trees, water, etc.
experiment with text prompts + CLIP-style filtering (segment-geospatial supports some of this)
or bring in a simple classifier that works on crops of each segment.

7.2. Go back to interactive prompts

Now that I know how to correctly initialize SamGeo2 in predictor mode, I want to:

click once inside each building and instantly get precise building footprints
use this to speed up manual mapping instead of replacing it entirely

For that, the pattern will look like:

Python · building masks

sam = SamGeo2(
    model_id="sam2-hiera-large",
    automatic=False,
)

sam.set_image("sample_aerial.tif")
mask = sam.predict(point_coords=[[x, y]], point_labels=[1])
sam.save_mask(mask, "building_mask.tif")

7.3. HQ-SAM and geospatial fine-tunes

The base SAM 2.1 masks are good, but not perfect:

edges can bleed across shadows
roofs might be split into multiple small segments

I want to try:

HQ-SAM integrations in segment-geospatial for sharper boundaries
dedicated geospatial fine-tunes like GeoSAM for road and infrastructure mapping

7.4. Post-processing and simplification

With ~294 segments, some are:

too tiny
or too fragmented to be useful.

The plan:

convert the raster mask → polygons
dissolve tiny regions or adjacent pieces that clearly belong together
drop noise segments with very small area

Classic GIS cleanup.

7.5. Vectorization and full GIS workflows

Because the outputs are georeferenced, I can:

export to Shapefile / GeoJSON
pull them into QGIS / ArcGIS
overlay with parcels, zoning, risk layers, etc.

There’s even a QGIS plugin (GeoOSAM) using SAM 2.1 directly, which tells me I’m not the only one walking this path.

7.6. Time series and change detection

This is the really fun idea:

use SAM 2.1’s video abilities on multi-temporal imagery
treat each timestamp as a frame in a “video”
track objects across time to see things like:

deforestation
new buildings
changing water bodies

Haven’t built this yet, but with the memory mechanism, SAM 2.1 is basically begging to be used on time series.

If you made it this far: thanks. This post is mostly here so I don’t forget how any of this works six months from now, but if it helps you get SAM 2.1 running on your own machine, or convinces you to throw it at satellite images, then it did its job.