Spatial Databases (PostGIS) & AI-Assisted GIS for Public Health Research

Welcome! This interactive application explores the concepts and benefits of using PostGIS spatial databases in public health research. It's designed for researchers who may have limited prior experience with databases or programming but are familiar with public health data.

Navigate through the sections using the sidebar to learn about why spatial databases are crucial, understand PostGIS basics, see how to connect using R and Python, get oriented to the modern (2026) geospatial stack, explore real-world use cases, learn how to use generative AI to study GIS concepts and run spatial analyses safely, test your knowledge with a pop quiz, and find resources for further learning.

Overall Goal: To introduce the concept and benefits of using PostGIS spatial databases in public health research, demonstrate basic data access using R and Python, and equip you to learn and perform GIS work with the help of generative AI — responsibly.

📅 2026 revamp: This edition updates the material for the current geospatial ecosystem (PostGIS 3.6, PostgreSQL 18, GeoParquet, DuckDB, psycopg v3, GeoPandas 1.x) and adds a hands-on guide to learning and doing GIS with generative AI. New and updated content is marked with a New badge.

🗺️ New — Live data explorer Hands-on: a companion app at postgis.son.do/interactive wired to a real, read-only PostGIS of Korean public-health GIS — 128,471 hospitals & clinics, district/province boundaries, air-quality sites, and more. Toggle layers on a live map, drop a buffer to count clinics within 5 km, and run the lecture's ST_* spatial SQL in an in-browser DuckDB console — no install. Open the explorer →

How to follow along

This is a hands-on lecture: we alternate concept → demo → you try it. When you see a Lab box, pause and run it yourself — solutions follow. Code is labelled by track so you can pick the one you know:

SQL Python R DuckDB Docker

Two ways to practice — set up either (or both) in Part III

A zero-setup DuckDB track that runs on any laptop, and a full Docker PostGIS track for the real multi-user server experience. The spatial SQL is nearly identical across both.

Learning objectives

By the end of today, you should be able to:

Explain why spatial databases beat CSVs and shapefiles for research
Stand up a working PostGIS (Docker) and DuckDB spatial environment
Describe geometry types, SRIDs/CRS, geometry vs. geography
Create and use a spatial index — and read EXPLAIN ANALYZE

Write spatial SQL: ST_DWithin, ST_Intersects, spatial joins, buffers
Connect from Python (geopandas) and R (sf)
Carry one public-health analysis end-to-end (clinic accessibility)
Use generative AI to draft, translate, and debug GIS work — and verify it

The throughline: the spatial SQL you learn today transfers across PostGIS, DuckDB, and AI-assisted workflows alike. Learn it once; use it everywhere.

Today's agenda & timing

IOverview & goals~5 min

IIWhy spatial databases~10 min

IIISetup — your lab Hands-on~15 min

IVPostGIS fundamentals (deep)~25 min

VConnecting & querying~15 min

VIModern stack · then a break~10 min

VIICase study end-to-end Hands-on~25 min

VIIIGIS with generative AI Hands-on~20 min

IXPop quiz~10 min

XSummary & resources~5 min

Before we go further: if you can, start the Docker download now (Part III, the docker run command) — it pulls in the background while we cover concepts.

Why Spatial Databases?

This section delves into the importance of spatial data in public health and the limitations of traditional data storage methods like CSVs and Shapefiles, highlighting the need for robust database solutions like PostGIS.

What is Spatial Data in Public Health?

Spatial data links information to a specific location on Earth. In public health, understanding "where" health events occur, vulnerable populations reside, and services are located is fundamental.

Examples include:

Mapping disease outbreaks (e.g., influenza spread, cancer mortality rates by state).
Locations of hospitals/clinics (assessing geographic accessibility).
Patient addresses (anonymized/aggregated, for understanding distribution).
Environmental exposure sources (e.g., pollution sites near communities).
Demographic data aggregated to geographic areas (e.g., census tracts, zip codes).

Visualizing such data on maps can reveal important characteristics or trends not apparent from tabular data alone. This enables targeted, geographically-focused interventions.

Key takeaway: Spatial context is crucial in understanding and addressing many public health issues.

Limitations of Traditional Data Storage & The Need for Databases

While CSVs and Shapefiles are common, they have significant limitations, especially for large, complex datasets or collaborative work.

Challenges with CSV Files

No inherent spatial understanding: Coordinates are plain numbers; system must be explicitly defined.
Data type ambiguity: No enforced data types, can lead to precision loss (e.g., ZIP codes losing leading zeros).
Inconsistent formatting: Issues with parsing, delimiters, special characters.
Limited querying: No built-in spatial query capabilities.
Scalability issues: Poor performance with very large datasets.
Data integrity/security: Lacks encryption, password protection, validation.

Challenges with Shapefiles

File proliferation: A single "shapefile" is multiple files (.shp, .shx, .dbf, .prj).
Data duplication/inconsistency: Easy to create multiple versions, leading to redundancy.
Attribute limitations: Field names limited to 10 chars; no null values (often uses arbitrary numbers); dates without time; 2GB file size limit.
Limited querying/analysis: Complex spatial queries require external GIS and convoluted workflows.
No concurrent access: Not designed for simultaneous multi-user access, leading to versioning issues.

These limitations can undermine research quality, reproducibility, and collaboration. Relational databases (like PostgreSQL) and specialized spatial databases (like PostGIS, an extension for PostgreSQL) address these challenges.

Summary of Challenges: Traditional vs. Database Approach

Issue	Example with CSVs	Example with Shapefiles	Impact on Public Health Research
Data Integrity	Latitude/Longitude stored as text, potential precision loss.	Field name "ENVIRONMENTAL_EXPOSURE_LEVEL" truncated. Nulls replaced with 0.	Inaccurate analysis, misinterpretation, inability to distinguish missing from true zero.
Scalability	Slow processing of 1 million patient address records.	2GB file size limit reached with detailed national boundary data.	Inability to analyze large-scale or high-resolution datasets efficiently.
Spatial Querying	Cannot directly query for clinics within 1km of a patient address.	Requires external GIS and complex workflows for proximity analysis.	Slow, inefficient, and error-prone execution of complex spatial questions.
Collaboration	High risk of overwriting data if multiple users edit a shared CSV.	Difficult to merge changes from different team members.	Hinders teamwork, potential data loss, outdated analyses, reproducibility issues.
Coordinate System Info	No embedded coordinate system information.	`.prj` file stores system, but can be missing/incorrect.	Risk of incorrect spatial analysis if data misaligned.

Key takeaway: Spatial databases offer a more robust, efficient, and powerful way to manage and analyze geographic health data, overcoming critical limitations of traditional file-based approaches.

The question we'll answer all day

To make this concrete, we'll carry one realistic analysis through the whole lecture.

Clinic accessibility in Riverbend County

"How many people live within 5 km of a primary-care clinic, and which districts are underserved relative to their population?"

Our (synthetic) data — three tables we'll build in Part III:

Table	Geometry	Key columns
`clinics`	Point (4326)	`clinic_id`, `name`
`districts`	MultiPolygon (4326)	`district_id`, `name`, `population`
`blocks`	Point (4326)	`block_id`, `pop`

Note Why synthetic data?

We use synthetic data on purpose — never real patient locations in a teaching or AI context. More on that in Part VIII.

Setup — Your Lab Environment Hands-on

You'll get the most out of today if you can run queries yourself. Pick a track — or do both. The two engines share nearly identical spatial SQL, so the skill transfers either way.

Two tracks, one set of skills

DuckDB Zero-setup, local

One binary or pip install — no server
Runs spatial SQL inside Python / R / CLI
Reads GeoParquet & files directly
Perfect for labs on any laptop

Best if you want to start immediately.

Docker Real PostGIS server

A genuine multi-user PostgreSQL + PostGIS
Concurrency, roles, indexes, EXPLAIN
Mirrors a production research database
Needs Docker Desktop installed

Best if you want the authentic workflow.

The SQL is nearly identical. DuckDB's spatial extension implements the same ST_* functions. Learn the query once; run it in either engine.

Track A — DuckDB, step by step DuckDB

The fastest path to running spatial SQL: no server, no accounts, one extension. Pick the Python flavour (used in our labs) or the standalone CLI — the SQL is identical. Targets DuckDB 1.1+, which reads and writes GeoParquet natively.

1

Install DuckDB

Either the Python package (recommended for labs) or the standalone CLI binary. A virtual environment keeps the install isolated.

Option 1 — Python Python

# create an isolated environment (recommended)
python -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate

pip install "duckdb>=1.1" "geopandas>=1.0" pandas

# sanity check the version
python -c "import duckdb; print(duckdb.__version__)"

Option 2 — CLI binary

# macOS (Homebrew)
brew install duckdb

# Linux / macOS (official install script)
curl https://install.duckdb.org | sh

# Windows: winget install DuckDB.cli
duckdb --version           # expect v1.1.x or newer

2

Open a persistent database & load the spatial extension

Passing a file name (not :memory:) means your tables survive a restart. INSTALL downloads the extension once; LOAD activates it in every new session.

Python

import duckdb

# a file on disk → data persists between runs
con = duckdb.connect("riverbend.duckdb")

con.sql("INSTALL spatial;")   # one-time download
con.sql("LOAD spatial;")      # every session
con.sql("SELECT ST_AsText(ST_Point(-117.4, 33.9));").show()

CLI

duckdb riverbend.duckdb       # opens (or creates) the file

-- inside the DuckDB shell:
INSTALL spatial;
LOAD spatial;
SELECT ST_AsText(ST_Point(-117.4, 33.9));
-- POINT (-117.4 33.9)

3

(Optional) read remote files directly DuckDB

The httpfs extension lets DuckDB stream GeoParquet/Parquet straight from a URL or S3 — no download step. We use this for the open-data labs further down.

INSTALL httpfs; LOAD httpfs;
-- now any https:// or s3:// path works as a table source, e.g.
-- SELECT * FROM 'https://example.org/clinics.parquet' LIMIT 5;

Lab 0a Verify DuckDB

Run steps 1–2. If SELECT ST_AsText(ST_Point(-117.4, 33.9)) prints POINT (-117.4 33.9), your spatial engine works. ✅

Track B — Docker PostGIS, step by step Docker

The official postgis/postgis image bundles PostgreSQL 18 + PostGIS 3.6 — a genuine multi-user server, in a few commands. Install Docker Desktop (macOS/Windows) or Docker Engine (Linux) first, then check it runs: docker --version.

1

Run a PostGIS server with a persistent volume

The named volume almanac-pgdata keeps your database even if the container is removed and recreated. Downloads ~600 MB the first time.

docker run -d --name almanac-pg \
  -e POSTGRES_PASSWORD=almanac \
  -e POSTGRES_DB=riverbend \
  -p 5432:5432 \
  -v almanac-pgdata:/var/lib/postgresql/data \
  postgis/postgis:18-3.6

Apple Silicon: the image is multi-arch and runs natively — no --platform flag needed. If port 5432 is busy, map a different host port, e.g. -p 5433:5432.

2

Wait until the server is ready, then connect with psql

pg_isready confirms the database has finished starting before you connect.

# is it accepting connections yet?
docker exec almanac-pg pg_isready -U postgres
# → /var/run/postgresql:5432 - accepting connections

# open an interactive SQL shell inside the container
docker exec -it almanac-pg psql -U postgres -d riverbend

3

Enable the PostGIS extension (once per database)

CREATE EXTENSION IF NOT EXISTS postgis;
SELECT postgis_full_version();
-- POSTGIS="3.6.0" ... GEOS="3.13" PROJ="9.x" ...

Lifecycle: -d runs it in the background. Pause/resume with docker stop almanac-pg / docker start almanac-pg. docker logs -f almanac-pg tails the server log. Because data lives in the volume, even docker rm won't erase it — only docker volume rm almanac-pgdata does.

Teaching only: a hard-coded password and an open port are fine on your laptop — never on a shared or internet-facing host. Use secrets and restricted roles in production.

Connect from your own tools (host side) Docker

Because we published port 5432, any client on your laptop can reach the server at localhost — a host psql, a Python/R script, or a GUI like DBeaver / pgAdmin / psql.

Host psql via a connection URL

psql "postgresql://postgres:almanac@localhost:5432/riverbend"

GUI client fields

Host:     localhost
Port:     5432
Database: riverbend
User:     postgres
Password: almanac

Troubleshooting Docker

Symptom	Likely cause	Fix
`bind: address already in use` on 5432	A local PostgreSQL (or an old container) already owns the port.	Re-run with a different host port: `-p 5433:5432`, then connect on `5433`.
`Cannot connect to the Docker daemon`	Docker Desktop / Engine isn't running.	Start Docker Desktop (or `sudo systemctl start docker`) and retry.
`password authentication failed`	Wrong password, or a stale volume from an earlier run with a different one.	Use `almanac`, or reset: `docker rm -f almanac-pg && docker volume rm almanac-pgdata`, then re-run step 1.
`The container name "/almanac-pg" is already in use`	A container with that name exists.	Reuse it with `docker start almanac-pg`, or remove it: `docker rm -f almanac-pg`.
`FATAL: the database system is starting up`	You connected before initialization finished.	Wait for `docker exec almanac-pg pg_isready -U postgres` to report accepting connections.

Lab 0b Prove PostGIS is alive SQL

Run this in psql (Docker) or the DuckDB shell. Both should return a version string and a valid point.

Docker / PostGIS

SELECT postgis_version();
-- 3.6 USE_GEOS=1 USE_PROJ=1 ...

SELECT ST_AsText(
  ST_SetSRID(ST_MakePoint(-117.4, 33.9), 4326)
);
-- POINT(-117.4 33.9)

DuckDB

LOAD spatial;
SELECT version();        -- v1.1+

SELECT ST_AsText(
  ST_Point(-117.4, 33.9)
);
-- POINT (-117.4 33.9)

Loading the sample data SQL

We'll generate the synthetic Riverbend tables directly in SQL, so everyone has identical data.

Clinics — a handful of points

CREATE TABLE clinics (
  clinic_id int PRIMARY KEY,
  name      text,
  geom      geometry(Point, 4326)
);

INSERT INTO clinics VALUES
 (1,'Riverbend Family Health',
    ST_SetSRID(ST_MakePoint(-117.40, 33.95),4326)),
 (2,'Eastside Community Clinic',
    ST_SetSRID(ST_MakePoint(-117.31, 33.92),4326)),
 (3,'North Valley Health Center',
    ST_SetSRID(ST_MakePoint(-117.36, 34.02),4326));

Blocks — population points (a grid)

CREATE TABLE blocks AS
SELECT
  row_number() OVER () AS block_id,
  (50 + (random()*450))::int AS pop,
  ST_SetSRID(ST_MakePoint(
    -117.45 + gx*0.01,
     33.88 + gy*0.01), 4326) AS geom
FROM generate_series(0,18) gx,
     generate_series(0,16) gy;

~300 population points across the county. (DuckDB: use range() instead of generate_series in a cross join — the AI can translate it; see Part VIII.)

Districts — simple square cells acting as administrative areas

CREATE TABLE districts AS
SELECT
  d AS district_id,
  'District ' || d AS name,
  (8000 + (random()*40000))::int AS population,
  ST_SetSRID(
    ST_MakeEnvelope(                        -- xmin, ymin, xmax, ymax
      -117.45 + (d%3)*0.10,  33.88 + (d/3)*0.08,
      -117.35 + (d%3)*0.10,  33.96 + (d/3)*0.08
    ), 4326) AS geom
FROM generate_series(0,5) d;

You now have three layers: clinics (points), blocks (population points), and districts (polygons) — all in SRID 4326. This is the data for our case study in Part VII.

Prefer real files? The same tables ship as GeoParquet in the course pack. DuckDB can read them straight from a URL or disk — SELECT * FROM 'clinics.parquet'; — no import step needed.

Lab 0c Create the tables

Run all three blocks. Then: SELECT count(*) FROM blocks; — expect ~323 rows.

Real sample datasets to explore interactively New

The synthetic Riverbend tables above are perfect for the case study because everyone has identical data. But once you want to play, nothing beats real spatial data. Below are battle-tested, openly-licensed datasets used across the GIS world — each with a copy-and-run recipe for PostGIS or DuckDB.

Dataset	What's in it	Format / size	Best engine
PostGIS "Intro" NYC data	Census blocks, streets, subway stations, neighborhoods + sociodemographics. The canonical teaching set.	Shapefiles, ~25 MB	PostGIS
Natural Earth	Global countries, states, cities, rivers, coastlines at 1:10m / 1:50m / 1:110m.	Shapefile / GeoPackage, public domain	PostGIS or DuckDB
Overture Maps	Global places (POIs), buildings, transportation, admin boundaries.	GeoParquet on S3, billions of rows	DuckDB
US Census TIGER/Line	US states, counties, tracts, roads — the backbone of US public-health geography.	Shapefile / GeoParquet	PostGIS or DuckDB
OpenStreetMap extracts	Roads, buildings, amenities (clinics, pharmacies) for any region.	PBF (via Geofabrik)	PostGIS (`osm2pgsql`)

Recipe 1 — Load the official PostGIS NYC workshop data PostGIS

The Introduction to PostGIS workshop ships four NYC shapefiles. Load them with shp2pgsql (bundled with PostGIS, so it already exists inside the container).

# 1. Download & unzip the workshop data bundle (on your laptop)
curl -L -o nyc_data.zip \
  https://s3.cleverelephant.ca/postgis-workshop-2020.zip
unzip nyc_data.zip            # → data/nyc_*.shp

# 2. Copy the shapefiles into the running container
docker cp data almanac-pg:/tmp/nyc

# 3. Convert one shapefile to SQL and pipe it straight into the DB
#    -s 26918 = NY State Plane (UTM 18N);  -I builds a spatial index
docker exec -i almanac-pg bash -c \
  "shp2pgsql -s 26918 -I /tmp/nyc/nyc_neighborhoods nyc_neighborhoods \
   | psql -U postgres -d riverbend"

Repeat step 3 for nyc_census_blocks, nyc_streets, and nyc_subway_stations. Then SELECT name FROM nyc_neighborhoods LIMIT 5; to confirm.

Recipe 2 — Stream Overture Maps places straight into DuckDB DuckDB

No download, no import: DuckDB reads the GeoParquet on S3 and the bbox filter means only the bytes for your area of interest are fetched. Great for pulling real clinics/pharmacies as context for a health analysis.

INSTALL spatial; LOAD spatial;
INSTALL httpfs;  LOAD httpfs;

-- Health-related POIs near downtown Los Angeles.
-- Check docs.overturemaps.org for the current release tag.
SELECT id,
       names.primary           AS name,
       categories.primary      AS category,
       ST_AsText(geometry)     AS location
FROM read_parquet(
  's3://overturemaps-us-west-2/release/2025-04-23.0/theme=places/type=place/*',
  filename = true, hive_partitioning = 1)
WHERE categories.primary IN ('hospital','pharmacy','clinic')
  AND bbox.xmin BETWEEN -118.30 AND -118.20
  AND bbox.ymin BETWEEN   34.00 AND   34.10
LIMIT 20;

Overture publishes a new release roughly monthly; swap 2025-04-23.0 for the latest tag from the Overture → DuckDB docs.

Recipe 3 — Import Natural Earth with `ogr2ogr` PostGIS

ogr2ogr (part of GDAL) is the universal converter — it reads almost any spatial format and writes directly to PostGIS. Here we load global country polygons.

# Needs GDAL on your host:  brew install gdal  /  apt install gdal-bin
ogr2ogr -f PostgreSQL \
  PG:"host=localhost port=5432 dbname=riverbend user=postgres password=almanac" \
  /vsizip/vsicurl/https://naturalearth.s3.amazonaws.com/110m_cultural/ne_110m_admin_0_countries.zip \
  -nln ne_countries -nlt PROMOTE_TO_MULTI -lco GEOMETRY_NAME=geom -overwrite

/vsizip/vsicurl/ is GDAL's virtual filesystem: it streams and unzips the remote shapefile in one shot — no manual download. The same trick loads Natural Earth straight into DuckDB via ST_Read().

Don't have GDAL or the right URL handy? This is exactly where the AICopilot from Part VIII shines — paste the dataset name and ask it to write the ogr2ogr/shp2pgsql/read_parquet command for your engine, then verify the row count.

Lab 0d Bring in real data SQL

Pick one recipe and load it. Then run SELECT count(*) on the new table and SELECT ST_SRID(geom) FROM <table> LIMIT 1; — note which SRID it arrived in (you'll often need to ST_Transform it to match your other layers).

PostGIS Fundamentals

This section introduces PostgreSQL, the PostGIS extension, and core spatial concepts essential for using PostGIS effectively in public health research.

What is PostgreSQL and PostGIS?

PostgreSQL: A powerful, open-source object-relational database system (ORDBMS). Known for reliability, data integrity, scalability, and extensibility. It's free and community-supported.

PostGIS: An extension to PostgreSQL that "teaches" it to understand, store, manage, and analyze geographic objects. It adds:

Spatial Data Types (e.g., `geometry`, `geography`).
Spatial Indexes (e.g., R-Tree based on GiST for fast spatial searches).
Spatial Functions (hundreds for analysis, e.g., distance, intersection, buffers).

Analogy: PostgreSQL is the filing cabinet; PostGIS provides special folders and tools for maps within it.

Current versions (mid-2026): PostgreSQL 18 and PostGIS 3.6.x are the current stable releases. Recent PostGIS versions keep expanding raster, vector-tile (`ST_AsMVT`), and 3D support, and modern PostgreSQL releases bring faster parallel queries and better JSON handling — all useful for large public-health datasets. You don't need the newest version to learn, but knowing what's current helps when reading docs and asking AI assistants for help.

Key takeaway: PostGIS turns a regular (but very powerful) database (PostgreSQL) into a spatial powerhouse, specifically designed to handle and analyze geographic information.

Core Spatial Concepts in PostGIS

1. Geometry Types

PostGIS stores geographic features as specific geometry types. Click each type to see examples:

Point 📍

Represents a single location (e.g., X,Y coordinates).

Public Health Examples: Patient residence, clinic location, disease case site, water sample point.

LineString 〰️

Represents linear features (sequence of connected points).

Public Health Examples: Roads, rivers, public transport routes, disease vector paths.

Polygon 🏞️

Represents an area (closed sequence of points).

Public Health Examples: Administrative boundaries (counties, states), service areas, exposure zones, park boundaries.

PostGIS also supports Multi-geometries (MultiPoint, MultiLineString, MultiPolygon) which are collections of the same geometry type, plus GeometryCollections.

How geometries are written — WKT & WKB

Every geometry has a human-readable and a binary form. You'll see both constantly.

WKT — Well-Known Text (readable)

POINT(-117.4 33.9)
LINESTRING(0 0, 1 1, 2 1)
POLYGON((0 0, 4 0, 4 4, 0 4, 0 0))

-- Constructing & inspecting
SELECT ST_AsText(            -- geometry → WKT
  ST_GeomFromText('POINT(-117.4 33.9)', 4326)
);
SELECT ST_AsGeoJSON(geom)    -- → GeoJSON for web maps
FROM clinics LIMIT 1;

WKB (Well-Known Binary) is the compact form actually stored on disk. You rarely write it by hand — functions convert for you.

Constructor cheat-sheet:
ST_MakePoint(x, y) — fast point
ST_SetSRID(geom, 4326) — stamp a CRS
ST_GeomFromText(wkt, srid) — from WKT
ST_AsText / ST_AsGeoJSON — read it back

Order matters: WKT and ST_MakePoint are (longitude, latitude) — X then Y. Reversing them silently puts your data in the wrong hemisphere.

2. Spatial Reference Systems (SRS) / SRID

An SRS (or CRS) defines how coordinate values relate to actual locations on Earth. It's the "language" or "map projection" your data speaks.

Crucial for: Aligning different data layers correctly and ensuring meaningful measurements (distance, area).
SRID (Spatial Reference Identifier): A unique integer ID for each SRS in PostGIS (e.g., stored in `spatial_ref_sys` table).
Common Examples:
- WGS84 (EPSG:4326): Geographic (lat/lon in degrees), used by GPS.
- Projected Systems (e.g., UTM, State Plane): Transform Earth's surface to a flat plane (units in meters/feet), better for accurate local distance/area calculations.
Rule: All spatial data in PostGIS MUST have a defined SRID. For operations between layers, SRIDs should match or be transformable (using `ST_Transform()`).

Conceptual Visualization: SRID Alignment

Layer A (SRID: X)

Layer B (SRID: Y)

Misaligned! 🚫

➡️

Layer A (Transformed to SRID: Z)

Layer B (Transformed to SRID: Z)

Correctly Aligned! ✅

Illustrates that layers with different SRIDs won't overlay correctly unless transformed to a common system.

The classic trap — degrees are not meters

Distance on lat/lon (SRID 4326) is measured in degrees, which is almost never what you want.

-- WRONG: "0.045" — that's degrees, meaningless
SELECT ST_Distance(
  ST_SetSRID(ST_MakePoint(-117.40,33.95),4326),
  ST_SetSRID(ST_MakePoint(-117.36,34.02),4326)
);

-- RIGHT (option 1): transform to a metric CRS
SELECT ST_Distance(
  ST_Transform(a.geom, 32611),   -- UTM 11N, meters
  ST_Transform(b.geom, 32611)
) FROM clinics a, clinics b
WHERE a.clinic_id=1 AND b.clinic_id=3;
-- ≈ 8100  (meters)

-- RIGHT (option 2): cast to geography
-- (meters on a sphere)
SELECT ST_Distance(
  a.geom::geography,
  b.geom::geography
) FROM clinics a, clinics b
WHERE a.clinic_id=1 AND b.clinic_id=3;
-- ≈ 8120  (meters)

If a distance looks like 0.0x, you're in degrees. Transform to a projected CRS (e.g. the right UTM zone) or cast to geography.

geometry vs. geography — which to use

PostGIS offers two spatial types. Choosing well saves you from the degrees-vs-meters trap.

`geometry` — Cartesian (flat)

Math on a flat plane — fast
Units are the CRS's units (degrees for 4326, meters for UTM)
Hundreds of functions; the default for analysis
Best when your data is in a projected CRS, or covers a small area

`geography` — geodetic (round Earth)

Math on the spheroid — distances/areas in meters, always
Correct over long distances & across UTM zones
Fewer functions, somewhat slower
Best when data spans large areas in lat/lon and you want true meters

Rule of thumb: national or continental lat/lon data and you want meters → geography. Local analysis or you've projected to UTM/State Plane → geometry. You can cast between them: geom::geography.

3. Spatial indexes

Like a book's index — but for locations. They dramatically speed up spatial queries ("what's near X?", "what's inside Y?").

Analogy: a GPS that lets the database rapidly find spatial data.
How (simplified): PostGIS builds a GiST index over each geometry's bounding box (an R-Tree), so the planner skips most rows.
Two-phase search: a fast index filter on bounding boxes, then an exact check on the survivors.

-- Build a spatial index (do this on every geometry column you query)
CREATE INDEX blocks_geom_idx ON blocks USING GIST (geom);
ANALYZE blocks;            -- refresh planner statistics

Functions that use the index automatically: ST_DWithin, ST_Intersects, ST_Contains, && — because they begin with a bounding-box test. ST_Distance alone does not; prefer ST_DWithin for proximity.

Proving the index works — `EXPLAIN ANALYZE`

The planner tells you whether it used your index. Reading this is a core skill.

EXPLAIN ANALYZE
SELECT b.block_id
FROM blocks b, clinics c
WHERE c.clinic_id = 1
  AND ST_DWithin(
        b.geom::geography,
        c.geom::geography, 5000);

What to look for:

✅ Index Scan using blocks_geom_idx — good, the index is used
⚠️ Seq Scan on a big table — the index was skipped (missing index, wrong type, or a function wrapping the column)
actual time= — real milliseconds; compare before/after adding the index
rows= — estimate vs. actual; large gaps mean stale stats → run ANALYZE

Lab 1 Index & explain SQL

Create blocks_geom_idx, run the EXPLAIN ANALYZE above, and note the time. Drop the index, re-run, and compare. (DuckDB users: skip — it auto-indexes; focus on the query.)

Spatial relationships — the vocabulary

Most spatial questions reduce to a handful of relationship predicates. They return true/false.

Function	Asks…
`ST_Intersects(a,b)`	do they touch/overlap at all?
`ST_Contains(a,b)`	is b fully inside a?
`ST_Within(a,b)`	is a fully inside b?
`ST_DWithin(a,b,d)`	are they within distance d?
`ST_Touches(a,b)`	do they share only a boundary?
`ST_Crosses(a,b)`	does a line cross a polygon?

Under the hood these implement the DE-9IM model — a 3×3 matrix describing how two geometries' interiors, boundaries, and exteriors relate. You rarely write DE-9IM directly, but it's why the named predicates are precise and composable.

Measurement companions: ST_Distance, ST_Length, ST_Area, ST_Buffer, ST_Intersection, ST_Union, ST_Centroid.

The workhorse — spatial joins

A spatial join connects two tables by a spatial relationship instead of a shared key. This is where PostGIS earns its keep.

-- Count population points inside each district
SELECT d.name,
       count(b.*)      AS n_blocks,
       sum(b.pop)      AS people
FROM districts d
JOIN blocks b
  ON ST_Contains(d.geom, b.geom)   -- the join!
GROUP BY d.name
ORDER BY people DESC;

How it works:

The ON clause uses a spatial predicate, not a.id = b.id
The planner uses the GiST index to avoid comparing every pair
JOIN keeps matches; LEFT JOIN keeps districts with zero points too (watch for those!)

Pattern to memorize: points-in-polygons via JOIN … ON ST_Contains(poly, pt) + GROUP BY. It answers a huge share of public-health questions.

Part IV key takeaway: PostGIS stores shapes (geometries) with a real-world coordinate system (SRID), lets you choose geometry (flat, fast) or geography (true meters), uses GiST spatial indexes to stay fast, and answers questions through relationship predicates and spatial joins. Watch the two classic traps: degrees vs. meters, and queries that skip the index (EXPLAIN ANALYZE is your friend).

Connecting to & Querying PostGIS

This section covers general principles for database connections and demonstrates how to connect to PostGIS and retrieve data using R and Python, along with a brief note on SQL.

General Connection Principles

To access any database, you need standard parameters:

Hostname (Server Address): Network address of the database server (e.g., `localhost`, IP address, domain name).
Port: Communication endpoint (default for PostgreSQL is `5432`).
Database Name (dbname): The specific database you want to connect to.
Username: Your database username.
Password: Password for the username.

For demos, a read-only account is used to prevent accidental data modification.

Key takeaway: These five pieces of information are the standard keys needed to unlock access to almost any database.

Using R with PostGIS

Key R Packages: `sf` (Simple Features for spatial data), `DBI` (DataBase Interface), `RPostgres` (PostgreSQL driver).

The following snippet shows how to connect, read data, and disconnect. Replace placeholder details with your actual database credentials.


# 1. Load necessary libraries
# install.packages(c("sf", "DBI", "RPostgres")) # Run once if not installed
library(sf)
library(DBI)
library(RPostgres)

# 2. Database connection parameters (REPLACE with actual details)
db_host <- "your_host_address"
db_port <- 5432
db_name <- "your_sample_db_name"
db_user <- "readonly_user"
db_password <- "readonly_password"

# 3. Establish connection
con <- dbConnect(RPostgres::Postgres(),
                 dbname = db_name,
                 host = db_host,
                 port = db_port,
                 user = db_user,
                 password = db_password)

# 4. Read a spatial table into an sf object
# Example: Read 5 features from 'health_clinics' table
sql_query_clinics <- "SELECT objectid, clinic_name, address, geom FROM health_clinics LIMIT 5;"
health_data_sf <- st_read(con, query = sql_query_clinics)

# 5. View the loaded data
print(health_data_sf)
# plot(st_geometry(health_data_sf)) # Simple plot

# 6. Close the connection
dbDisconnect(con)

Key takeaway: R, with packages like `sf` and `DBI`, can easily connect to PostGIS and bring spatial data directly into a familiar analysis environment.

Brief Note on SQL

The `query` parameter in `st_read()` (R) and `read_postgis()` (Python) uses SQL (Structured Query Language). PostGIS extends SQL with many spatial functions (e.g., `ST_Distance`, `ST_Intersects`, `ST_Contains`, `ST_Buffer`, `ST_DWithin`).

Performing spatial operations directly in the database using these SQL functions is often very efficient, as it processes data on the server before transferring it to R/Python.

Key takeaway: Learning basic SQL, augmented with PostGIS's spatial functions, unlocks the ability to perform powerful and efficient data selection and manipulation directly within the database.

Same skills, no server — DuckDB from Python DuckDB Python

The identical analysis, running in-process over your local tables or files — no database server required.

import duckdb, geopandas as gpd

con = duckdb.connect("riverbend.duckdb")
con.sql("LOAD spatial;")

# The SAME spatial SQL you'd run in PostGIS
df = con.sql("""
  SELECT d.name, sum(b.pop) AS people
  FROM districts d
  JOIN blocks b ON ST_Contains(d.geom, b.geom)
  GROUP BY d.name ORDER BY people DESC
""").df()

# Read big open data with zero import:
con.sql("SELECT count(*) FROM 'overture_places.parquet'")

Notice: the ST_Contains spatial join is byte-for-byte the same as the PostGIS version. That's the payoff of learning spatial SQL once.

Lab 2 Your first end-to-end query Python SQL

Connect from Python (PostGIS engine or DuckDB connection).
Run the population-per-district spatial join.
Print the busiest district. Bonus: gdf.explore() to see it on a map.

# Works against either engine — just change how you read:
sql = """SELECT d.name, sum(b.pop) AS people
         FROM districts d
         JOIN blocks b ON ST_Contains(d.geom, b.geom)
         GROUP BY d.name ORDER BY people DESC"""

# PostGIS:  df = gpd.read_postgis(sql, engine)
# DuckDB :  df = con.sql(sql).df()
print(df.head(1))     # -> the most-populated district

Checkpoint: you've now connected, run a spatial join, and pulled results into a dataframe — the core loop of every analysis to come. (For our Docker server: host localhost, port 5432, db riverbend, user postgres, password almanac.)

The Modern Geospatial Stack (2026) New

PostGIS remains a cornerstone of spatial data work, but the surrounding ecosystem has evolved quickly. This section is a map of the landscape so you know what tools exist, what they're good at, and how they fit alongside PostGIS in a public-health workflow.

PostGIS is no longer the only place to run spatial SQL

A major shift in recent years is that you can now run powerful spatial SQL without a database server at all. The biggest example is DuckDB with its spatial extension — an in-process analytical engine (think "SQLite for analytics") that speaks much of the same ST_* spatial SQL you'll learn for PostGIS.

DuckDB + spatial — the lightweight analytics powerhouse

No server to install or manage — runs inside R, Python, or the command line.
Reads and writes GeoParquet, GeoJSON, Shapefiles, and can query files directly on your disk or in cloud storage (S3/GCS/Azure) without loading them first.
Extremely fast for analytical scans and joins over tens-to-hundreds of millions of features on a laptop.
When to reach for it: ad-hoc analysis, one-off transformations, reading big open-data files, reproducible notebook pipelines.
When PostGIS is still better: a shared, central, multi-user database; enforced data integrity; transactional editing; a long-lived authoritative dataset many people query.

You can even use them together: pg_duckdb and DuckDB's `postgres` extension let DuckDB query a live PostGIS database directly.

PostGIS vs. DuckDB — how they really differ New

Because both speak ST_* spatial SQL, it's tempting to treat them as interchangeable. They aren't. They are architecturally opposite tools that happen to share a query language — and knowing the difference tells you which one to reach for. The one-line version: PostGIS is a database you connect to; DuckDB is a library you run inside your script.

PostGIS A spatial system of record

Client–server: a long-running PostgreSQL process many people connect to at once.
Row-oriented + MVCC: built for lots of concurrent reads and writes, transactions, and live editing.
Durable & governed: roles, permissions, constraints, backups — a single authoritative source of truth.
The deepest spatial toolbox: geography type, raster, topology, ST_AsMVT tiles, hundreds of mature functions.

Shines when the data is shared, edited, and long-lived.

DuckDB A spatial analytics engine

In-process / embedded: a library inside your Python/R/CLI session — "SQLite for analytics," no server.
Columnar + vectorized: tears through analytical scans, aggregations, and joins over huge tables.
Files-first: queries GeoParquet/CSV/GeoJSON in place on disk or in S3 — often no import step at all.
Lean & fast to start: pip install, load spatial, go; auto-builds spatial indexes at query time.

Shines for ad-hoc, single-analyst, big-file analysis.

At a glance: the same query language, opposite engines

Dimension	PostGIS	DuckDB + spatial
Architecture	Client–server; persistent database process	Embedded, in-process library (like SQLite)
Workload sweet spot	Transactional (OLTP) + serving apps & many users	Analytical (OLAP) scans, aggregations, joins
Storage model	Row-oriented heap tables	Columnar, vectorized execution
Concurrency	Many concurrent readers and writers (MVCC)	Single process; great parallel reads, one writer
Persistence	Durable, server-managed database	One `.duckdb` file or purely in-memory
Getting data in	Load/import into tables first	Query files in place (disk, URL, S3) — no import
Spatial index	Persistent GiST R-Tree you `CREATE INDEX`	R-Tree built automatically at query time
`geography` type	Yes — true geodetic metres on the spheroid	No native type — project, or use sphere helpers
Function library	Hundreds; + raster, topology, 3D, vector tiles	Growing GEOS-backed subset; vector-focused
Transactions	Full ACID, multi-statement, constraints, roles	ACID on its own file; analytics-first, lighter
Setup cost	Install/run a server (or Docker)	`pip install duckdb` — seconds
Role in the stack	The durable, shared source of truth	The fast analytical scratchpad / ETL engine

Same spatial SQL, a few dialect differences to watch SQL

The core predicates (ST_Intersects, ST_Contains, ST_DWithin, ST_Buffer) are byte-for-byte the same. The differences are mostly at the edges — point construction, the geography type, and how you produce rows.

PostGIS

-- Stamp an explicit SRID on a point
ST_SetSRID(ST_MakePoint(-117.4, 33.9), 4326)

-- True metres via the geography type
ST_DWithin(a.geom::geography,
           b.geom::geography, 5000)

-- Generate a grid of rows
SELECT * FROM generate_series(0,18) gx,
              generate_series(0,16) gy;

-- Persistent spatial index (you create it)
CREATE INDEX ON blocks USING GIST (geom);

DuckDB + spatial

-- ST_Point already takes (x, y); SRID is metadata-light
ST_Point(-117.4, 33.9)

-- No geography type: project to metres first…
ST_DWithin(ST_Transform(a.geom,'EPSG:4326','EPSG:32611'),
           ST_Transform(b.geom,'EPSG:4326','EPSG:32611'), 5000)
-- …or measure on a sphere: ST_Distance_Sphere(a, b)

-- range() in a cross join instead of generate_series
SELECT * FROM range(0,19) t(gx),
              range(0,17) u(gy);

-- No CREATE INDEX needed — R-Tree is automatic
SELECT count(*) FROM blocks;            -- read files too:
SELECT count(*) FROM 'blocks.parquet';

The two gotchas that bite most often: (1) DuckDB has no geography type — for real metres you must ST_Transform to a projected CRS (e.g. the right UTM zone) or use ST_Distance_Sphere; a bare distance on lon/lat is in degrees. (2) generate_series as a row source behaves differently — reach for range() in DuckDB. When in doubt, ask an AI to translate a query between the two and then verify (Part VIII).

Better together, not either/or: a common 2026 pattern is DuckDB for the heavy lifting — scan a 50‑million‑row GeoParquet file, filter and aggregate to a tidy result — then write that result into PostGIS as the shared, governed table everyone queries and maps. With pg_duckdb (a Postgres extension) or DuckDB's postgres extension, one can even reach into the other live.

Cloud-native geospatial formats

The "shapefile era" is fading. Modern formats are designed for the cloud, for big data, and for streaming only the bytes you need.

📦 GeoParquet

Columnar, compressed vector format. Fast, tiny, and the emerging standard for sharing large vector datasets. Read/written by GeoPandas, DuckDB, QGIS, and more. A modern replacement for shapefiles/CSV exports.

🗺️ PMTiles & vector tiles

A single-file format for serving interactive maps cheaply (even from static hosting). PostGIS can generate vector tiles directly with ST_AsMVT.

🛰️ COG & STAC (raster)

Cloud-Optimized GeoTIFFs let you stream just the pixels you need from satellite/environmental imagery; STAC is the catalog standard for finding it — relevant for environmental-exposure work.

🌐 Open base data: Overture Maps

A large, openly-licensed dataset (places, buildings, transportation, administrative areas) distributed as GeoParquet — useful free context for health analyses.

Spatial indexing beyond R-Trees: H3

Discrete global grid systems like H3 (hexagonal cells) have become popular for aggregating and joining point data at scale. The h3-pg extension brings H3 functions into PostGIS, which is handy for privacy-preserving aggregation of patient locations into uniform cells.

Choosing a tool: a quick guide

Your situation	Good default tool	Why
Shared, authoritative dataset, many analysts, ongoing edits	PostGIS	Concurrency, integrity, security, central source of truth.
Ad-hoc analysis of large files on your laptop / in a notebook	DuckDB + spatial	Zero setup, fast, reads GeoParquet & cloud files directly.
Interactive analysis in a familiar dataframe workflow	GeoPandas (Py) / sf (R)	Ergonomic, great for modeling, plotting, and small-to-medium data.
Sharing a dataset with collaborators or for archiving	GeoParquet	Compact, typed, portable; far better than shapefiles/CSV.
Publishing an interactive web map	PMTiles / ST_AsMVT	Cheap, scalable, no heavy map server required.

Key takeaway: Think in terms of a stack, not a single tool. PostGIS is your durable, shared spatial database; DuckDB and GeoParquet make ad-hoc, large-scale, and cloud analysis fast and cheap; GeoPandas/sf remain your interactive analysis home. The spatial SQL you learn for PostGIS transfers across much of this ecosystem.

Break ☕ 5 minutes

When we return: we build the full clinic-accessibility analysis end-to-end on your data (Part VII).

Case Study — Clinic Accessibility Hands-on

One question, carried end-to-end on your own Riverbend database. We'll answer it in five steps, each a short lab. By the end you'll have a reusable accessibility pipeline — every step runnable in PostGIS or DuckDB.

1Inventory & sanity-check the data — counts, SRIDs, validity

2Proximity — population within 5 km of any clinic

3Service areas — buffers and what they cover

4Aggregate by district — access rate per population

5Validate — sanity checks before we trust it

Question: "How many people live within 5 km of a clinic, and which districts are underserved relative to population?"

Step 1 — inventory & sanity-check SQL

Before any analysis: confirm what you have. Counts, coordinate systems, and validity.

-- How much data, and what SRID?
SELECT 'clinics' t, count(*), ST_SRID(geom) FROM clinics GROUP BY 3
UNION ALL
SELECT 'blocks', count(*), ST_SRID(geom) FROM blocks GROUP BY 3
UNION ALL
SELECT 'districts', count(*), ST_SRID(geom) FROM districts GROUP BY 3;

-- Any invalid polygons? (real data often has them)
SELECT district_id
FROM districts
WHERE NOT ST_IsValid(geom);

What we're checking:

All three layers share SRID 4326 → they'll align without transforming
Row counts match what we loaded (~3 / ~323 / 6)
No invalid geometries — if any, fix with ST_MakeValid(geom)

Lab 3.1

Run both queries. Confirm matching SRIDs and zero invalid rows before continuing.

Step 2 — population within 5 km of a clinic SQL

The core accessibility measure: which population points are "served"?

-- Mark each block as served / unserved, using true meters via geography
SELECT
  sum(pop)                                   AS total_pop,
  sum(pop) FILTER (WHERE served)             AS served_pop,
  round(100.0 * sum(pop) FILTER (WHERE served) / sum(pop), 1) AS pct_served
FROM (
  SELECT b.pop,
         EXISTS (
           SELECT 1 FROM clinics c
           WHERE ST_DWithin(b.geom::geography, c.geom::geography, 5000)
         ) AS served
  FROM blocks b
) s;

Why ST_DWithin + ::geography: it uses the spatial index and measures real meters — both traps from Part IV, avoided in one line.

Lab 3.2

Run it. What % of the county is within 5 km of a clinic? Try 3 km and 10 km — how sensitive is access to the threshold?

Step 3 — service areas as buffers SQL

Sometimes you want the zone itself — to map it, or intersect it with other layers.

-- A 5 km service area around each clinic
SELECT clinic_id,
       ST_Buffer(geom::geography, 5000)::geometry AS svc
FROM clinics;

-- Union them into one combined coverage polygon
SELECT ST_Union(
         ST_Buffer(geom::geography, 5000)::geometry
       ) AS coverage
FROM clinics;

ST_Buffer on geography grows a true-meter zone
ST_Union dissolves overlapping buffers into one coverage shape
Export coverage as GeoJSON to drop onto a web map, or intersect it with districts (next step)

Buffers are approximations of circles (segmented polygons) and ignore roads/terrain. For travel-time access you'd use a routing service — note the assumption.

Step 4 — access rate by district SQL

Now the decision-relevant output: a rate per district, so we compare fairly across populations.

SELECT
  d.name,
  d.population,
  sum(b.pop)                                   AS pop_in_blocks,
  sum(b.pop) FILTER (WHERE served)             AS served_pop,
  round(100.0 * sum(b.pop) FILTER (WHERE served)
        / nullif(sum(b.pop),0), 1)             AS pct_served
FROM districts d
JOIN blocks b ON ST_Contains(d.geom, b.geom)        -- points-in-polygons
CROSS JOIN LATERAL (
  SELECT EXISTS (SELECT 1 FROM clinics c
    WHERE ST_DWithin(b.geom::geography, c.geom::geography, 5000)) AS served
) x
GROUP BY d.name, d.population
ORDER BY pct_served ASC;          -- worst-served first

Lab 3.3

Run it. Which two districts are least served? Those are where you'd target a new clinic.

Step 5 — validate before you trust it SQL

A number is not an answer until you've checked it. Build the habit now.

Cross-checks:

Totals reconcile — does sum(served) ≤ total_pop? Does district pop roughly match population?
Spot a known case — pick one block next to a clinic; is it flagged served?
Boundary effects — blocks near the county edge may have clinics just outside your data → undercount
Units — distances in meters? geography confirms it
Sensitivity — does the ranking hold at 3 km vs 5 km?

-- One block you can reason about by hand
SELECT b.block_id,
  ST_Distance(b.geom::geography,
              c.geom::geography)::int AS m
FROM blocks b
JOIN clinics c ON c.clinic_id = 1
ORDER BY m
LIMIT 3;       -- nearest blocks to clinic 1

Validation is the analyst's job — not the database's, and certainly not the AI's.

Extending the pattern — more public-health questions

The same building blocks answer many questions. Notice the recurring verbs: contain, within-distance, buffer, intersect, aggregate.

🦠 Disease surveillance & clusters

Cases (points) in districts (polygons), normalized by population.

SELECT d.name, count(c.*) AS cases
FROM districts d
JOIN cases c ON ST_Contains(d.geom,c.geom)
GROUP BY d.name;     -- ÷ population → rate

🏭 Environmental exposure

Who lives inside a pollutant buffer?

SELECT sum(b.pop) AS exposed
FROM blocks b, sources s
WHERE ST_DWithin(
  b.geom::geography,
  s.geom::geography, 2000);   -- 2 km

Key takeaway: PostGIS answers sophisticated spatial questions efficiently — and once you know the handful of core predicates, new questions are recombinations of the same moves.

Learning GIS with Generative AI 🤖 New

Generative AI assistants (such as Claude, ChatGPT, Gemini, and coding tools like GitHub Copilot, Cursor, and Claude Code) have become some of the most effective ways to learn spatial concepts and to accelerate real GIS analysis. They are especially valuable for public-health researchers who know their domain deeply but are newer to databases and code. This section shows you how to use them well — and, just as importantly, how to use them safely.

Mindset: Treat the AI as a tireless, fast, occasionally-overconfident teaching assistant and pair-programmer — not as an oracle. It is excellent at drafting, explaining, and translating. It is you who remains the analyst responsible for correctness, ethics, and interpretation.

Four ways AI helps with GIS

Click each card to expand. These map to the whole learning-to-doing journey.

1. Tutor & explainer 📖

Ask it to explain concepts at your level, with public-health examples and analogies. Great for SRIDs, geometry vs. geography, spatial joins, projections, and "why is my distance in degrees?".

Try: "Explain the difference between geometry and geography in PostGIS to someone who knows statistics but not GIS, using a clinic-access example."

2. Code generator & translator 🔁

Turn a plain-English question into spatial SQL, or translate the same analysis between SQL, R (sf), and Python (geopandas). Excellent for getting unstuck on syntax.

Try: "Write a PostGIS query that counts disease cases per district, then show the equivalent in R sf and Python geopandas."

3. Debugger & reviewer 🐞

Paste an error message and your query; ask what's wrong and how to fix it. Ask it to review a query for correctness and performance (e.g., "is this using the spatial index?").

Try: "This query returns 0 rows but I expected matches. Here's the SQL and the SRIDs of each table — what's likely wrong?"

4. Analysis copilot 🚀

Plan a whole analysis with it: what data you need, which functions to use, how to validate results. Coding-agent tools (Claude Code, Cursor) can even run code, inspect data, and iterate with you.

Try: "I want to assess healthcare accessibility for a county. Help me plan the data, the PostGIS steps, and how I'd sanity-check the output."

How to write a good GIS prompt

The quality of the answer depends heavily on the context you give. A reliable recipe:

Role & level: "You are a PostGIS expert helping a public-health researcher new to SQL."
The environment: tool and version ("PostGIS 3.6, PostgreSQL 18"), language (SQL / R sf / Python geopandas).
The data: table names, key columns, geometry column name, geometry type, and SRID of each layer.
The goal: the precise question and the units you want (meters? rate per 1,000?).
The ask: "explain each step", "use a spatial index", "keep it to one query".

Copy-ready prompt templates

Fill in the [brackets] and paste into your AI assistant of choice.

Explain a concept

You are a patient GIS tutor. Explain [CONCEPT, e.g. "what an SRID is and why ST_Transform matters"] to a public-health researcher who understands statistics but is new to spatial data. Use a concrete clinic/disease-mapping example, avoid jargon (or define it), and finish with one common mistake to avoid.

Generate spatial SQL

You are a PostGIS expert (PostGIS 3.6 / PostgreSQL 18). I have: - Table "cases": columns id, diagnosis_date, geom (Point, SRID 4326) - Table "districts": columns district_id, name, population, geom (MultiPolygon, SRID 4326) Write one SQL query that returns each district's name, case count, and cases per 1,000 population. Make sure it can use a spatial index, and explain each line in plain English.

Translate between languages

Here is a PostGIS SQL query: [PASTE QUERY]. Show the equivalent analysis (a) in R using the sf package and (b) in Python using geopandas, reading from the same PostGIS database. Note any behavioural differences I should watch for.

Debug an error / empty result

My PostGIS query returns [ERROR MESSAGE / "0 rows but I expected matches"]. Here is the query: [PASTE QUERY]. The SRID of each table is: [LIST SRIDs]. The geometry columns are: [LIST]. What are the most likely causes, ranked, and how do I check and fix each one?

Plan & sanity-check an analysis

I want to [RESEARCH GOAL, e.g. "estimate how many people live within 5 km of a clinic in County X"]. I have [DESCRIBE DATA]. Propose a step-by-step PostGIS workflow, flag the assumptions and CRS/units decisions, and tell me three checks I can run to confirm the results are sensible before I trust them.

Lab 4 Prompt → run → verify Python SQL

Using your own Riverbend schema, ask an AI assistant to write the 5 km access rate by district query (our Step 4). Then verify it against the answer you already have.

Your prompt

You are a PostGIS 3.6 expert helping a public-health researcher. Tables (all SRID 4326): - clinics(clinic_id, name, geom Point) - blocks(block_id, pop, geom Point) - districts(district_id, name, population, geom Polygon) Write ONE query: per district, the % of block population within 5 km of any clinic. Use meters, use the spatial index, and explain each line.

Then check the AI's output against reality:

Does it use ::geography or ST_Transform (meters, not degrees)?
Does it use ST_DWithin (index) — not bare ST_Distance?
Run it. Does the ranking match your Step 4 result?
Ask a follow-up: "add EXPLAIN ANALYZE and confirm the index is used."

If its numbers differ from yours, you find out why. That gap-hunting is the skill.

A safe AI-assisted workflow

Use AI as a loop, not a one-shot. A dependable pattern:

1

Frame the question yourself. Know what answer would make sense (rough magnitude, units, expected shape of output). This is your guardrail against confident-but-wrong output.

2

Give rich context. Tool/version, table and column names, geometry types, and SRIDs (see the templates above).

3

Ask for an explanation, not just code. "Explain each step" turns the answer into a learning opportunity and surfaces flawed assumptions.

4

Run it on a small sample first. Use LIMIT, a single district, or a known case where you can predict the answer.

5

Verify against reality. Check counts, totals, and a couple of records by hand or with an independent method. Cross-check function names in the official docs.

6

Iterate. Feed back what you saw ("the totals are 10× too high — could it be a unit issue?") and refine.

Trust, but verify: common AI pitfalls in GIS

⚠️ AI can be confidently wrong. Watch for these GIS-specific failure modes:

Invented functions/arguments: AI sometimes "hallucinates" a plausible-sounding ST_* function or option that doesn't exist. Confirm names in the PostGIS reference.
SRID / units mistakes: measuring distance on SRID 4326 gives degrees, not meters. AI may forget to ST_Transform to a projected CRS or to use the geography type. Always check the units of your answer.
Invalid geometries: real-world data has self-intersections and nulls. Suggested queries may fail or mislead until you run ST_IsValid / ST_MakeValid.
Performance blind spots: a "correct" query can be unusably slow without a spatial index or with the wrong join. Ask it to EXPLAIN ANALYZE and to confirm index use.
Outdated patterns: training data may reference old versions (e.g., psycopg2 instead of psycopg v3). State your versions explicitly.
Statistical / epidemiological subtlety: spatial clustering, the modifiable areal unit problem (MAUP), edge effects, and ecological fallacy are easy for AI to gloss over. Apply your domain expertise.

🔒 Data privacy & ethics (read this twice)

Never paste protected health information (PHI/PII) or precise patient locations into a consumer AI chat. Patient addresses and coordinates are among the most sensitive and re-identifiable data that exist. Pasting them into a public tool may violate HIPAA, GDPR, IRB protocols, and your institution's data-use agreements.

Share schema, not data. Give the AI your table and column structure, types, and SRIDs — never the actual rows.
Use synthetic or fake examples when you need sample values ("a clinic at lng/lat 0,0").
De-identify and aggregate before any sharing — e.g., aggregate points to census tracts or H3 cells; apply your jurisdiction's small-cell suppression rules.
Prefer approved/enterprise AI tools with a zero-data-retention or no-training agreement, or self-hosted/local models, when working near sensitive data — and follow your institution's policy first.
Keep a human in the loop for decisions. AI output informs analysis; it does not make public-health decisions.

Good practice: describe your problem in terms of structure and goals. For example — "I have a points table of de-identified case locations (SRID 4326) and a polygons table of census tracts; help me aggregate cases per tract" — gives the AI everything it needs without exposing a single real record.

The AI tooling landscape for GIS

💬 Chat assistants

Claude, ChatGPT, Gemini. Best for explaining concepts, drafting queries, translating between languages, and debugging from pasted errors.

⌨️ In-editor assistants

GitHub Copilot, Cursor, and notebook AI features autocomplete and refactor code where you work — handy in RStudio, VS Code, and Jupyter.

🛠️ Coding agents

Tools like Claude Code can read your files, run queries, inspect results, and iterate on a whole analysis with you — closer to a true copilot than a chatbot.

🔌 MCP & text-to-SQL

The Model Context Protocol (MCP) lets AI assistants connect directly to a (read-only!) PostGIS database to inspect the schema and answer questions in natural language. Powerful — but scope permissions tightly, especially with health data.

Emerging generative-AI applications in the field New

Beyond chat assistants, generative AI is showing up inside the geospatial workflow. These are real, active areas in 2026 — useful to know about even if you don't use all of them yet. Click a card to expand.

🛰️ Geospatial foundation models for imagery

Models like Prithvi (NASA–IBM, trained on Landsat/Sentinel-2 and now even running in orbit) and Clay turn satellite/aerial imagery into ready-to-use features for flood mapping, burn-scar detection, land-use and cloud-gap reconstruction — with little to no manual feature engineering.

Public-health relevance: environmental-exposure assessment, disaster response, monitoring flooding/heat/air-quality drivers of disease.

✂️ Promptable feature extraction (SAM-for-geo)

Segment-Anything-style models (e.g., segment-geospatial/samgeo) let you extract buildings, water bodies, fields, or roads from imagery with a click or a text prompt — no hand-digitizing.

Public-health relevance: rapidly mapping settlements/buildings for population denominators, sanitation, or access studies in data-poor regions.

💬 "Chat with your map / database"

QGIS LLM plugins (GeoAgent, Promptly, IntelliGeo, Whitebox AI Agent) and MCP/text-to-SQL tools let you describe an operation in plain language; the AI inspects your layers/CRS/schema and generates the geoprocessing or SQL to run it.

Public-health relevance: lowers the barrier for epidemiologists to run spatial analyses without deep GIS/SQL skills (still verify the output!).

🧪 Synthetic & privacy-preserving data

Generative models can produce synthetic populations or case locations that preserve realistic spatial distributions without exposing real individuals — for sharing, teaching, or method development.

Public-health relevance: collaborate and publish reproducible spatial methods while protecting PHI; build demos without real patient data.

📍 Geocoding, parsing & record linkage

LLMs are strong at cleaning messy addresses, standardizing place names, parsing free-text locations from clinical notes, and matching records across datasets that lack a shared key.

Public-health relevance: turning inconsistent surveillance/EHR address fields into mappable, linkable data (do this on de-identified data in approved environments).

🔎 Semantic + spatial search (embeddings)

With pgvector inside PostgreSQL/PostGIS you can combine vector embeddings (semantic similarity) with spatial filters — e.g., "facilities similar to X and within 10 km" — in a single database.

Public-health relevance: retrieval-augmented analysis over reports/facilities, smarter matching of services to needs.

🤝 Agentic analysis copilots

Coding agents (e.g., Claude Code, Cursor) now read your files, write and run spatial code, inspect results, and iterate end-to-end — turning a research question into a reproducible pipeline with you supervising.

Public-health relevance: faster, more reproducible analyses; great for prototyping a study workflow before scaling it.

🗺️ AI cartography & narrative reporting

Generative tools assist with map styling, choosing classifications/color schemes, and drafting plain-language summaries of spatial results for briefs and dashboards.

Public-health relevance: communicating findings to decision-makers and the public quickly — with a human checking accuracy and avoiding misleading maps.

⚠️ Same rules apply: these applications are powerful but not infallible. Foundation-model outputs need ground-truthing, LLM-generated code and SQL need verification, and none of them excuse putting real PHI into unapproved tools. Treat AI output as a draft to validate, not a result to publish.

Key takeaway: Generative AI dramatically lowers the barrier to learning and doing GIS — as a tutor, code translator, debugger, and analysis copilot. Get the most from it by giving rich context (versions, schema, SRIDs), asking for explanations, and verifying every result. And never trade your participants' privacy for convenience: share structure, not sensitive data.

Pop Quiz: Test Your Knowledge!

Let's see what you've learned. For each question, consider the options and then click "Show Answer" to see the correct choice and explanation.

Question 1: What is a key advantage of using a spatial database like PostGIS over traditional shapefiles for collaborative public health research?

A) Shapefiles support more advanced SQL queries.
B) PostGIS allows for robust concurrent access and data integrity.
C) Shapefiles are better for handling very large datasets.
D) PostGIS does not require understanding coordinate systems.

Question 2: Which PostGIS geometry type would be most suitable for representing the boundaries of administrative districts (e.g., counties)?

A) Point
B) LineString
C) Polygon
D) MultiPoint

Question 3: What is the primary purpose of an SRID (Spatial Reference Identifier) in PostGIS?

A) To uniquely identify each record in a table.
B) To define the visual style of map features.
C) To define the coordinate system and give real-world meaning to geometric data.
D) To speed up non-spatial attribute queries.

Question 4: When reading spatial data from PostGIS into R using `sf::st_read()` or Python using `geopandas.read_postgis()`, what is a benefit of using a SQL query within these functions?

A) It's the only way to load any data.
B) It allows for server-side filtering and processing, which can be more efficient for large datasets.
C) It bypasses the need for database connection parameters.
D) It automatically converts data to a local file format like a shapefile.

Question 5: To find all health clinics within a 2km radius of a patient's address in PostGIS, which spatial function is most directly applicable and efficient for this proximity query?

A) `ST_Area()`
B) `ST_DWithin()`
C) `ST_Union()`
D) `ST_Centroid()`

Question 6: You need to do fast, one-off analysis of a 50-million-row GeoParquet file on your laptop, with no database server to set up. Which modern tool is the best fit?

A) A shapefile opened in a spreadsheet
B) DuckDB with the spatial extension
C) Emailing the file to a colleague
D) Converting everything to CSV first

Question 7: You're using a generative AI assistant to help with a spatial analysis of de-identified patient case locations. Which practice is most appropriate?

A) Paste the full table of patient coordinates so the AI has complete context.
B) Describe the schema (table/column names, geometry types, SRIDs) and use synthetic example values — never real PHI.
C) Trust the generated query without checking, since AI rarely makes mistakes.
D) Skip stating your PostGIS version because it never matters.

Summary & Resources

This final section summarizes the key benefits of PostGIS for public health research and provides a list of resources for continued learning.

Quick Summary & Benefits Recap

Overcomes Flat File Limitations: Structured, secure, scalable data storage; better data integrity, attribute handling, concurrent access.
Enables Complex Spatial Analysis: Rich library of spatial SQL functions for proximity, containment, overlay, buffering directly in the database.
Promotes Data Integrity and Scalability: Enforces data types, relationships; efficiently manages very large datasets.
Facilitates Collaboration: Multi-user environment for centralized, consistent data access.
Integrates with Familiar Analysis Environments: Easy access from R (sf, DBI) and Python (geopandas, SQLAlchemy, psycopg v3).
Fits a Modern Stack: Complements cloud-native formats (GeoParquet), serverless engines (DuckDB + spatial), and open base data (Overture Maps) — the spatial SQL you learn transfers across them.
Pairs Well with Generative AI: AI assistants accelerate learning and analysis — when you supply rich context (versions, schema, SRIDs), verify results, and protect sensitive data.

Q&A and Further Resources

To continue your learning journey:

Official Websites:

PostGIS: postgis.net
PostgreSQL: postgresql.org

Key R Packages Documentation:

`sf` (Simple Features): r-spatial.github.io/sf/
`DBI` (DataBase Interface): dbi.r-dbi.org
`RPostgres`: rpostgres.r-dbi.org

Key Python Packages Documentation:

`geopandas`: geopandas.org
`SQLAlchemy`: sqlalchemy.org
`psycopg2` (PostgreSQL adapter): psycopg.org/docs/ (psycopg2/3)

Modern Geospatial Stack: New

DuckDB Spatial: duckdb.org → spatial
GeoParquet: geoparquet.org
Overture Maps (open base data): overturemaps.org
Cloud-Native Geospatial Forum: cloudnativegeo.org
Spatial SQL reference (PostGIS functions): postgis.net/docs/reference.html

Sample & test datasets (to practice on): New

PostGIS "Intro" NYC workshop data: postgis.net → workshops
Natural Earth (public-domain global vectors): naturalearthdata.com
Overture Maps via DuckDB: docs.overturemaps.org
US Census TIGER/Line (US geography): census.gov → TIGER/Line
OpenStreetMap regional extracts: download.geofabrik.de
Source GeoJSON for quick tests (US states, counties): geojson.org

Learning & Doing GIS with AI: New

Use the copy-ready prompt templates in the "Learning GIS with AI" section as starting points.
Always state your tool + version (e.g., "PostGIS 3.6, PostgreSQL 18") and your schema + SRIDs in prompts.
Verify AI-suggested function names against the official PostGIS reference, and check the units of any distance/area result.
Privacy: never paste PHI/PII or precise patient coordinates into AI tools — share structure and synthetic examples, follow your institution's data policy, and prefer approved/enterprise or local models near sensitive data.

Emerging GenAI-in-geospatial to explore:

Prithvi geospatial foundation model: ibm-nasa-geospatial (Hugging Face)
Clay foundation model: madewithclay.org
segment-geospatial (SAM for imagery): samgeo.gishub.org
QGIS LLM plugins (GeoAgent, Promptly, IntelliGeo): QGIS natural-language plugins
pgvector (embeddings in PostgreSQL): github.com/pgvector/pgvector

General Tutorial Search Terms:

"PostGIS Introduction tutorial"
"GeoPandas PostGIS tutorial"
"R sf PostGIS tutorial"
"DuckDB spatial GeoParquet tutorial"
"spatial SQL ST_DWithin example"

Hands-on practice with the provided sample database details and code snippets is highly encouraged — and try working through one of the use cases with an AI assistant as your tutor!

The PostGIS Almanac