Spatial Databases (PostGIS) & AI-Assisted GIS for Public Health Research

Welcome! This interactive application explores the concepts and benefits of using PostGIS spatial databases in public health research. It's designed for researchers who may have limited prior experience with databases or programming but are familiar with public health data.

Navigate through the sections using the sidebar to learn about why spatial databases are crucial, understand PostGIS basics, see how to connect using R and Python, get oriented to the modern (2026) geospatial stack, explore real-world use cases, learn how to use generative AI to study GIS concepts and run spatial analyses safely, test your knowledge with a pop quiz, and find resources for further learning.

Overall Goal: To introduce the concept and benefits of using PostGIS spatial databases in public health research, demonstrate basic data access using R and Python, and equip you to learn and perform GIS work with the help of generative AI — responsibly.

📅 2026 revamp: This edition updates the material for the current geospatial ecosystem (PostGIS 3.6, PostgreSQL 18, GeoParquet, DuckDB, psycopg v3, GeoPandas 1.x) and adds a hands-on guide to learning and doing GIS with generative AI. New and updated content is marked with a New badge.

How to follow along

This is a hands-on lecture: we alternate concept → demo → you try it. When you see a Lab box, pause and run it yourself — solutions follow. Code is labelled by track so you can pick the one you know:

SQL Python R DuckDB Docker

Two ways to practice — set up either (or both) in Part III

A zero-setup DuckDB track that runs on any laptop, and a full Docker PostGIS track for the real multi-user server experience. The spatial SQL is nearly identical across both.

Learning objectives

By the end of today, you should be able to:

  • Explain why spatial databases beat CSVs and shapefiles for research
  • Stand up a working PostGIS (Docker) and DuckDB spatial environment
  • Describe geometry types, SRIDs/CRS, geometry vs. geography
  • Create and use a spatial index — and read EXPLAIN ANALYZE
  • Write spatial SQL: ST_DWithin, ST_Intersects, spatial joins, buffers
  • Connect from Python (geopandas) and R (sf)
  • Carry one public-health analysis end-to-end (clinic accessibility)
  • Use generative AI to draft, translate, and debug GIS work — and verify it

The throughline: the spatial SQL you learn today transfers across PostGIS, DuckDB, and AI-assisted workflows alike. Learn it once; use it everywhere.

Today's agenda & timing

IOverview & goals~5 min
IIWhy spatial databases~10 min
IIISetup — your lab Hands-on~15 min
IVPostGIS fundamentals (deep)~25 min
VConnecting & querying~15 min
VIModern stack · then a break~10 min
VIICase study end-to-end Hands-on~25 min
VIIIGIS with generative AI Hands-on~20 min
IXPop quiz~10 min
XSummary & resources~5 min

Before we go further: if you can, start the Docker download now (Part III, the docker run command) — it pulls in the background while we cover concepts.

Why Spatial Databases?

This section delves into the importance of spatial data in public health and the limitations of traditional data storage methods like CSVs and Shapefiles, highlighting the need for robust database solutions like PostGIS.

What is Spatial Data in Public Health?

Spatial data links information to a specific location on Earth. In public health, understanding "where" health events occur, vulnerable populations reside, and services are located is fundamental.

Examples include:

Visualizing such data on maps can reveal important characteristics or trends not apparent from tabular data alone. This enables targeted, geographically-focused interventions.

Key takeaway: Spatial context is crucial in understanding and addressing many public health issues.

Limitations of Traditional Data Storage & The Need for Databases

While CSVs and Shapefiles are common, they have significant limitations, especially for large, complex datasets or collaborative work.

Challenges with CSV Files

  • No inherent spatial understanding: Coordinates are plain numbers; system must be explicitly defined.
  • Data type ambiguity: No enforced data types, can lead to precision loss (e.g., ZIP codes losing leading zeros).
  • Inconsistent formatting: Issues with parsing, delimiters, special characters.
  • Limited querying: No built-in spatial query capabilities.
  • Scalability issues: Poor performance with very large datasets.
  • Data integrity/security: Lacks encryption, password protection, validation.

Challenges with Shapefiles

  • File proliferation: A single "shapefile" is multiple files (.shp, .shx, .dbf, .prj).
  • Data duplication/inconsistency: Easy to create multiple versions, leading to redundancy.
  • Attribute limitations: Field names limited to 10 chars; no null values (often uses arbitrary numbers); dates without time; 2GB file size limit.
  • Limited querying/analysis: Complex spatial queries require external GIS and convoluted workflows.
  • No concurrent access: Not designed for simultaneous multi-user access, leading to versioning issues.

These limitations can undermine research quality, reproducibility, and collaboration. Relational databases (like PostgreSQL) and specialized spatial databases (like PostGIS, an extension for PostgreSQL) address these challenges.

Summary of Challenges: Traditional vs. Database Approach

Issue Example with CSVs Example with Shapefiles Impact on Public Health Research
Data Integrity Latitude/Longitude stored as text, potential precision loss. Field name "ENVIRONMENTAL_EXPOSURE_LEVEL" truncated. Nulls replaced with 0. Inaccurate analysis, misinterpretation, inability to distinguish missing from true zero.
Scalability Slow processing of 1 million patient address records. 2GB file size limit reached with detailed national boundary data. Inability to analyze large-scale or high-resolution datasets efficiently.
Spatial Querying Cannot directly query for clinics within 1km of a patient address. Requires external GIS and complex workflows for proximity analysis. Slow, inefficient, and error-prone execution of complex spatial questions.
Collaboration High risk of overwriting data if multiple users edit a shared CSV. Difficult to merge changes from different team members. Hinders teamwork, potential data loss, outdated analyses, reproducibility issues.
Coordinate System Info No embedded coordinate system information. `.prj` file stores system, but can be missing/incorrect. Risk of incorrect spatial analysis if data misaligned.

Key takeaway: Spatial databases offer a more robust, efficient, and powerful way to manage and analyze geographic health data, overcoming critical limitations of traditional file-based approaches.

The question we'll answer all day

To make this concrete, we'll carry one realistic analysis through the whole lecture.

Running case study

Clinic accessibility in Riverbend County

"How many people live within 5 km of a primary-care clinic, and which districts are underserved relative to their population?"

Our (synthetic) data — three tables we'll build in Part III:

TableGeometryKey columns
clinicsPoint (4326)clinic_id, name
districtsMultiPolygon (4326)district_id, name, population
blocksPoint (4326)block_id, pop
Note Why synthetic data?
We use synthetic data on purpose — never real patient locations in a teaching or AI context. More on that in Part VIII.

Setup — Your Lab Environment Hands-on

You'll get the most out of today if you can run queries yourself. Pick a track — or do both. The two engines share nearly identical spatial SQL, so the skill transfers either way.

Two tracks, one set of skills

DuckDB Zero-setup, local

  • One binary or pip installno server
  • Runs spatial SQL inside Python / R / CLI
  • Reads GeoParquet & files directly
  • Perfect for labs on any laptop

Best if you want to start immediately.

Docker Real PostGIS server

  • A genuine multi-user PostgreSQL + PostGIS
  • Concurrency, roles, indexes, EXPLAIN
  • Mirrors a production research database
  • Needs Docker Desktop installed

Best if you want the authentic workflow.

The SQL is nearly identical. DuckDB's spatial extension implements the same ST_* functions. Learn the query once; run it in either engine.

Track A — DuckDB in 60 seconds DuckDB

The fastest path to running spatial SQL. No server, no accounts.

Option 1 — Python (what we'll use in labs)

pip install "duckdb>=1.1" "geopandas>=1.0"

# then, in Python:
import duckdb
con = duckdb.connect("riverbend.duckdb")
con.sql("INSTALL spatial; LOAD spatial;")
con.sql("SELECT ST_Point(0, 0) AS p;").show()

Option 2 — CLI (great for quick checks)

# macOS / Linux
brew install duckdb        # or download the binary
duckdb riverbend.duckdb

-- inside the DuckDB shell:
INSTALL spatial; LOAD spatial;
SELECT ST_AsText(ST_Point(-117.4, 33.9));
-- POINT(-117.4 33.9)
Lab 0a Verify DuckDB
Run either snippet. If you see a POINT printed, your spatial engine works. ✅

Track B — Docker PostGIS in one command Docker

The official image bundles PostgreSQL 18 + PostGIS 3.6. One command, a real server.

# Pull & run a PostGIS server (downloads ~600 MB the first time)
docker run -d --name almanac-pg \
  -e POSTGRES_PASSWORD=almanac \
  -e POSTGRES_DB=riverbend \
  -p 5432:5432 \
  postgis/postgis:18-3.6

Connect with psql (inside the container)

docker exec -it almanac-pg \
  psql -U postgres -d riverbend

-- Turn on PostGIS in this database (once)
CREATE EXTENSION IF NOT EXISTS postgis;
SELECT postgis_full_version();

Tip: -d runs it in the background. Stop/restart with docker stop almanac-pg / docker start almanac-pg. Your data persists in the container until you docker rm it.

Teaching only: a hard-coded password and open port are fine on your laptop — never on a shared or internet-facing host.

Lab 0b Prove PostGIS is alive SQL
Run this in psql (Docker) or the DuckDB shell. Both should return a version string and a valid point.

Docker / PostGIS

SELECT postgis_version();
-- 3.6 USE_GEOS=1 USE_PROJ=1 ...

SELECT ST_AsText(
  ST_SetSRID(ST_MakePoint(-117.4, 33.9), 4326)
);
-- POINT(-117.4 33.9)

DuckDB

LOAD spatial;
SELECT version();        -- v1.1+

SELECT ST_AsText(
  ST_Point(-117.4, 33.9)
);
-- POINT (-117.4 33.9)

Loading the sample data SQL

We'll generate the synthetic Riverbend tables directly in SQL, so everyone has identical data.

Clinics — a handful of points

CREATE TABLE clinics (
  clinic_id int PRIMARY KEY,
  name      text,
  geom      geometry(Point, 4326)
);

INSERT INTO clinics VALUES
 (1,'Riverbend Family Health',
    ST_SetSRID(ST_MakePoint(-117.40, 33.95),4326)),
 (2,'Eastside Community Clinic',
    ST_SetSRID(ST_MakePoint(-117.31, 33.92),4326)),
 (3,'North Valley Health Center',
    ST_SetSRID(ST_MakePoint(-117.36, 34.02),4326));

Blocks — population points (a grid)

CREATE TABLE blocks AS
SELECT
  row_number() OVER () AS block_id,
  (50 + (random()*450))::int AS pop,
  ST_SetSRID(ST_MakePoint(
    -117.45 + gx*0.01,
     33.88 + gy*0.01), 4326) AS geom
FROM generate_series(0,18) gx,
     generate_series(0,16) gy;

~300 population points across the county. (DuckDB: use range() instead of generate_series in a cross join — the AI can translate it; see Part VIII.)

Districts — simple square cells acting as administrative areas

CREATE TABLE districts AS
SELECT
  d AS district_id,
  'District ' || d AS name,
  (8000 + (random()*40000))::int AS population,
  ST_SetSRID(
    ST_MakeEnvelope(                        -- xmin, ymin, xmax, ymax
      -117.45 + (d%3)*0.10,  33.88 + (d/3)*0.08,
      -117.35 + (d%3)*0.10,  33.96 + (d/3)*0.08
    ), 4326) AS geom
FROM generate_series(0,5) d;

You now have three layers: clinics (points), blocks (population points), and districts (polygons) — all in SRID 4326. This is the data for our case study in Part VII.

Prefer real files? The same tables ship as GeoParquet in the course pack. DuckDB can read them straight from a URL or disk — SELECT * FROM 'clinics.parquet'; — no import step needed.

Lab 0c Create the tables
Run all three blocks. Then: SELECT count(*) FROM blocks; — expect ~323 rows.

PostGIS Fundamentals

This section introduces PostgreSQL, the PostGIS extension, and core spatial concepts essential for using PostGIS effectively in public health research.

What is PostgreSQL and PostGIS?

PostgreSQL: A powerful, open-source object-relational database system (ORDBMS). Known for reliability, data integrity, scalability, and extensibility. It's free and community-supported.

PostGIS: An extension to PostgreSQL that "teaches" it to understand, store, manage, and analyze geographic objects. It adds:

Analogy: PostgreSQL is the filing cabinet; PostGIS provides special folders and tools for maps within it.

Current versions (mid-2026): PostgreSQL 18 and PostGIS 3.6.x are the current stable releases. Recent PostGIS versions keep expanding raster, vector-tile (`ST_AsMVT`), and 3D support, and modern PostgreSQL releases bring faster parallel queries and better JSON handling — all useful for large public-health datasets. You don't need the newest version to learn, but knowing what's current helps when reading docs and asking AI assistants for help.

Key takeaway: PostGIS turns a regular (but very powerful) database (PostgreSQL) into a spatial powerhouse, specifically designed to handle and analyze geographic information.

Core Spatial Concepts in PostGIS

1. Geometry Types

PostGIS stores geographic features as specific geometry types. Click each type to see examples:

Point 📍

Represents a single location (e.g., X,Y coordinates).

Public Health Examples: Patient residence, clinic location, disease case site, water sample point.

LineString 〰️

Represents linear features (sequence of connected points).

Public Health Examples: Roads, rivers, public transport routes, disease vector paths.

Polygon 🏞️

Represents an area (closed sequence of points).

Public Health Examples: Administrative boundaries (counties, states), service areas, exposure zones, park boundaries.

PostGIS also supports Multi-geometries (MultiPoint, MultiLineString, MultiPolygon) which are collections of the same geometry type, plus GeometryCollections.

How geometries are written — WKT & WKB

Every geometry has a human-readable and a binary form. You'll see both constantly.

WKT — Well-Known Text (readable)

POINT(-117.4 33.9)
LINESTRING(0 0, 1 1, 2 1)
POLYGON((0 0, 4 0, 4 4, 0 4, 0 0))

-- Constructing & inspecting
SELECT ST_AsText(            -- geometry → WKT
  ST_GeomFromText('POINT(-117.4 33.9)', 4326)
);
SELECT ST_AsGeoJSON(geom)    -- → GeoJSON for web maps
FROM clinics LIMIT 1;

WKB (Well-Known Binary) is the compact form actually stored on disk. You rarely write it by hand — functions convert for you.

Constructor cheat-sheet:
ST_MakePoint(x, y) — fast point
ST_SetSRID(geom, 4326) — stamp a CRS
ST_GeomFromText(wkt, srid) — from WKT
ST_AsText / ST_AsGeoJSON — read it back

Order matters: WKT and ST_MakePoint are (longitude, latitude) — X then Y. Reversing them silently puts your data in the wrong hemisphere.

2. Spatial Reference Systems (SRS) / SRID

An SRS (or CRS) defines how coordinate values relate to actual locations on Earth. It's the "language" or "map projection" your data speaks.

Conceptual Visualization: SRID Alignment

Layer A (SRID: X)

Layer B (SRID: Y)

Misaligned! 🚫

➡️

Layer A (Transformed to SRID: Z)

Layer B (Transformed to SRID: Z)

Correctly Aligned!

Illustrates that layers with different SRIDs won't overlay correctly unless transformed to a common system.

The classic trap — degrees are not meters

Distance on lat/lon (SRID 4326) is measured in degrees, which is almost never what you want.

-- WRONG: "0.045" — that's degrees, meaningless
SELECT ST_Distance(
  ST_SetSRID(ST_MakePoint(-117.40,33.95),4326),
  ST_SetSRID(ST_MakePoint(-117.36,34.02),4326)
);

-- RIGHT (option 1): transform to a metric CRS
SELECT ST_Distance(
  ST_Transform(a.geom, 32611),   -- UTM 11N, meters
  ST_Transform(b.geom, 32611)
) FROM clinics a, clinics b
WHERE a.clinic_id=1 AND b.clinic_id=3;
-- ≈ 8100  (meters)
-- RIGHT (option 2): cast to geography
-- (meters on a sphere)
SELECT ST_Distance(
  a.geom::geography,
  b.geom::geography
) FROM clinics a, clinics b
WHERE a.clinic_id=1 AND b.clinic_id=3;
-- ≈ 8120  (meters)

If a distance looks like 0.0x, you're in degrees. Transform to a projected CRS (e.g. the right UTM zone) or cast to geography.

geometry vs. geography — which to use

PostGIS offers two spatial types. Choosing well saves you from the degrees-vs-meters trap.

geometry — Cartesian (flat)

  • Math on a flat plane — fast
  • Units are the CRS's units (degrees for 4326, meters for UTM)
  • Hundreds of functions; the default for analysis
  • Best when your data is in a projected CRS, or covers a small area

geography — geodetic (round Earth)

  • Math on the spheroid — distances/areas in meters, always
  • Correct over long distances & across UTM zones
  • Fewer functions, somewhat slower
  • Best when data spans large areas in lat/lon and you want true meters

Rule of thumb: national or continental lat/lon data and you want meters → geography. Local analysis or you've projected to UTM/State Plane → geometry. You can cast between them: geom::geography.

3. Spatial indexes

Like a book's index — but for locations. They dramatically speed up spatial queries ("what's near X?", "what's inside Y?").

-- Build a spatial index (do this on every geometry column you query)
CREATE INDEX blocks_geom_idx ON blocks USING GIST (geom);
ANALYZE blocks;            -- refresh planner statistics

Functions that use the index automatically: ST_DWithin, ST_Intersects, ST_Contains, && — because they begin with a bounding-box test. ST_Distance alone does not; prefer ST_DWithin for proximity.

Proving the index works — EXPLAIN ANALYZE

The planner tells you whether it used your index. Reading this is a core skill.

EXPLAIN ANALYZE
SELECT b.block_id
FROM blocks b, clinics c
WHERE c.clinic_id = 1
  AND ST_DWithin(
        b.geom::geography,
        c.geom::geography, 5000);

What to look for:

  • Index Scan using blocks_geom_idx — good, the index is used
  • ⚠️ Seq Scan on a big table — the index was skipped (missing index, wrong type, or a function wrapping the column)
  • actual time= — real milliseconds; compare before/after adding the index
  • rows= — estimate vs. actual; large gaps mean stale stats → run ANALYZE
Lab 1 Index & explain SQL
Create blocks_geom_idx, run the EXPLAIN ANALYZE above, and note the time. Drop the index, re-run, and compare. (DuckDB users: skip — it auto-indexes; focus on the query.)

Spatial relationships — the vocabulary

Most spatial questions reduce to a handful of relationship predicates. They return true/false.

FunctionAsks…
ST_Intersects(a,b)do they touch/overlap at all?
ST_Contains(a,b)is b fully inside a?
ST_Within(a,b)is a fully inside b?
ST_DWithin(a,b,d)are they within distance d?
ST_Touches(a,b)do they share only a boundary?
ST_Crosses(a,b)does a line cross a polygon?

Under the hood these implement the DE-9IM model — a 3×3 matrix describing how two geometries' interiors, boundaries, and exteriors relate. You rarely write DE-9IM directly, but it's why the named predicates are precise and composable.

Measurement companions: ST_Distance, ST_Length, ST_Area, ST_Buffer, ST_Intersection, ST_Union, ST_Centroid.

The workhorse — spatial joins

A spatial join connects two tables by a spatial relationship instead of a shared key. This is where PostGIS earns its keep.

-- Count population points inside each district
SELECT d.name,
       count(b.*)      AS n_blocks,
       sum(b.pop)      AS people
FROM districts d
JOIN blocks b
  ON ST_Contains(d.geom, b.geom)   -- the join!
GROUP BY d.name
ORDER BY people DESC;

How it works:

  • The ON clause uses a spatial predicate, not a.id = b.id
  • The planner uses the GiST index to avoid comparing every pair
  • JOIN keeps matches; LEFT JOIN keeps districts with zero points too (watch for those!)

Pattern to memorize: points-in-polygons via JOIN … ON ST_Contains(poly, pt) + GROUP BY. It answers a huge share of public-health questions.

Part IV key takeaway: PostGIS stores shapes (geometries) with a real-world coordinate system (SRID), lets you choose geometry (flat, fast) or geography (true meters), uses GiST spatial indexes to stay fast, and answers questions through relationship predicates and spatial joins. Watch the two classic traps: degrees vs. meters, and queries that skip the index (EXPLAIN ANALYZE is your friend).

Connecting to & Querying PostGIS

This section covers general principles for database connections and demonstrates how to connect to PostGIS and retrieve data using R and Python, along with a brief note on SQL.

General Connection Principles

To access any database, you need standard parameters:

For demos, a read-only account is used to prevent accidental data modification.

Key takeaway: These five pieces of information are the standard keys needed to unlock access to almost any database.

Using R with PostGIS

Key R Packages: `sf` (Simple Features for spatial data), `DBI` (DataBase Interface), `RPostgres` (PostgreSQL driver).

The following snippet shows how to connect, read data, and disconnect. Replace placeholder details with your actual database credentials.


# 1. Load necessary libraries
# install.packages(c("sf", "DBI", "RPostgres")) # Run once if not installed
library(sf)
library(DBI)
library(RPostgres)

# 2. Database connection parameters (REPLACE with actual details)
db_host <- "your_host_address"
db_port <- 5432
db_name <- "your_sample_db_name"
db_user <- "readonly_user"
db_password <- "readonly_password"

# 3. Establish connection
con <- dbConnect(RPostgres::Postgres(),
                 dbname = db_name,
                 host = db_host,
                 port = db_port,
                 user = db_user,
                 password = db_password)

# 4. Read a spatial table into an sf object
# Example: Read 5 features from 'health_clinics' table
sql_query_clinics <- "SELECT objectid, clinic_name, address, geom FROM health_clinics LIMIT 5;"
health_data_sf <- st_read(con, query = sql_query_clinics)

# 5. View the loaded data
print(health_data_sf)
# plot(st_geometry(health_data_sf)) # Simple plot

# 6. Close the connection
dbDisconnect(con)
                        

Key takeaway: R, with packages like `sf` and `DBI`, can easily connect to PostGIS and bring spatial data directly into a familiar analysis environment.

Brief Note on SQL

The `query` parameter in `st_read()` (R) and `read_postgis()` (Python) uses SQL (Structured Query Language). PostGIS extends SQL with many spatial functions (e.g., `ST_Distance`, `ST_Intersects`, `ST_Contains`, `ST_Buffer`, `ST_DWithin`).

Performing spatial operations directly in the database using these SQL functions is often very efficient, as it processes data on the server before transferring it to R/Python.

Key takeaway: Learning basic SQL, augmented with PostGIS's spatial functions, unlocks the ability to perform powerful and efficient data selection and manipulation directly within the database.

Same skills, no server — DuckDB from Python DuckDB Python

The identical analysis, running in-process over your local tables or files — no database server required.

import duckdb, geopandas as gpd

con = duckdb.connect("riverbend.duckdb")
con.sql("LOAD spatial;")

# The SAME spatial SQL you'd run in PostGIS
df = con.sql("""
  SELECT d.name, sum(b.pop) AS people
  FROM districts d
  JOIN blocks b ON ST_Contains(d.geom, b.geom)
  GROUP BY d.name ORDER BY people DESC
""").df()

# Read big open data with zero import:
con.sql("SELECT count(*) FROM 'overture_places.parquet'")

Notice: the ST_Contains spatial join is byte-for-byte the same as the PostGIS version. That's the payoff of learning spatial SQL once.

Lab 2 Your first end-to-end query Python SQL
  1. Connect from Python (PostGIS engine or DuckDB connection).
  2. Run the population-per-district spatial join.
  3. Print the busiest district. Bonus: gdf.explore() to see it on a map.
# Works against either engine — just change how you read:
sql = """SELECT d.name, sum(b.pop) AS people
         FROM districts d
         JOIN blocks b ON ST_Contains(d.geom, b.geom)
         GROUP BY d.name ORDER BY people DESC"""

# PostGIS:  df = gpd.read_postgis(sql, engine)
# DuckDB :  df = con.sql(sql).df()
print(df.head(1))     # -> the most-populated district

Checkpoint: you've now connected, run a spatial join, and pulled results into a dataframe — the core loop of every analysis to come. (For our Docker server: host localhost, port 5432, db riverbend, user postgres, password almanac.)

The Modern Geospatial Stack (2026) New

PostGIS remains a cornerstone of spatial data work, but the surrounding ecosystem has evolved quickly. This section is a map of the landscape so you know what tools exist, what they're good at, and how they fit alongside PostGIS in a public-health workflow.

PostGIS is no longer the only place to run spatial SQL

A major shift in recent years is that you can now run powerful spatial SQL without a database server at all. The biggest example is DuckDB with its spatial extension — an in-process analytical engine (think "SQLite for analytics") that speaks much of the same ST_* spatial SQL you'll learn for PostGIS.

DuckDB + spatial — the lightweight analytics powerhouse

  • No server to install or manage — runs inside R, Python, or the command line.
  • Reads and writes GeoParquet, GeoJSON, Shapefiles, and can query files directly on your disk or in cloud storage (S3/GCS/Azure) without loading them first.
  • Extremely fast for analytical scans and joins over tens-to-hundreds of millions of features on a laptop.
  • When to reach for it: ad-hoc analysis, one-off transformations, reading big open-data files, reproducible notebook pipelines.
  • When PostGIS is still better: a shared, central, multi-user database; enforced data integrity; transactional editing; a long-lived authoritative dataset many people query.

You can even use them together: pg_duckdb and DuckDB's `postgres` extension let DuckDB query a live PostGIS database directly.

Cloud-native geospatial formats

The "shapefile era" is fading. Modern formats are designed for the cloud, for big data, and for streaming only the bytes you need.

📦 GeoParquet

Columnar, compressed vector format. Fast, tiny, and the emerging standard for sharing large vector datasets. Read/written by GeoPandas, DuckDB, QGIS, and more. A modern replacement for shapefiles/CSV exports.

🗺️ PMTiles & vector tiles

A single-file format for serving interactive maps cheaply (even from static hosting). PostGIS can generate vector tiles directly with ST_AsMVT.

🛰️ COG & STAC (raster)

Cloud-Optimized GeoTIFFs let you stream just the pixels you need from satellite/environmental imagery; STAC is the catalog standard for finding it — relevant for environmental-exposure work.

🌐 Open base data: Overture Maps

A large, openly-licensed dataset (places, buildings, transportation, administrative areas) distributed as GeoParquet — useful free context for health analyses.

Spatial indexing beyond R-Trees: H3

Discrete global grid systems like H3 (hexagonal cells) have become popular for aggregating and joining point data at scale. The h3-pg extension brings H3 functions into PostGIS, which is handy for privacy-preserving aggregation of patient locations into uniform cells.

Choosing a tool: a quick guide

Your situation Good default tool Why
Shared, authoritative dataset, many analysts, ongoing edits PostGIS Concurrency, integrity, security, central source of truth.
Ad-hoc analysis of large files on your laptop / in a notebook DuckDB + spatial Zero setup, fast, reads GeoParquet & cloud files directly.
Interactive analysis in a familiar dataframe workflow GeoPandas (Py) / sf (R) Ergonomic, great for modeling, plotting, and small-to-medium data.
Sharing a dataset with collaborators or for archiving GeoParquet Compact, typed, portable; far better than shapefiles/CSV.
Publishing an interactive web map PMTiles / ST_AsMVT Cheap, scalable, no heavy map server required.

Key takeaway: Think in terms of a stack, not a single tool. PostGIS is your durable, shared spatial database; DuckDB and GeoParquet make ad-hoc, large-scale, and cloud analysis fast and cheap; GeoPandas/sf remain your interactive analysis home. The spatial SQL you learn for PostGIS transfers across much of this ecosystem.

Break ☕ 5 minutes
When we return: we build the full clinic-accessibility analysis end-to-end on your data (Part VII).

Case Study — Clinic Accessibility Hands-on

One question, carried end-to-end on your own Riverbend database. We'll answer it in five steps, each a short lab. By the end you'll have a reusable accessibility pipeline — every step runnable in PostGIS or DuckDB.

1Inventory & sanity-check the data — counts, SRIDs, validity
2Proximity — population within 5 km of any clinic
3Service areas — buffers and what they cover
4Aggregate by district — access rate per population
5Validate — sanity checks before we trust it

Question: "How many people live within 5 km of a clinic, and which districts are underserved relative to population?"

Step 1 — inventory & sanity-check SQL

Before any analysis: confirm what you have. Counts, coordinate systems, and validity.

-- How much data, and what SRID?
SELECT 'clinics' t, count(*), ST_SRID(geom) FROM clinics GROUP BY 3
UNION ALL
SELECT 'blocks', count(*), ST_SRID(geom) FROM blocks GROUP BY 3
UNION ALL
SELECT 'districts', count(*), ST_SRID(geom) FROM districts GROUP BY 3;

-- Any invalid polygons? (real data often has them)
SELECT district_id
FROM districts
WHERE NOT ST_IsValid(geom);

What we're checking:

  • All three layers share SRID 4326 → they'll align without transforming
  • Row counts match what we loaded (~3 / ~323 / 6)
  • No invalid geometries — if any, fix with ST_MakeValid(geom)
Lab 3.1
Run both queries. Confirm matching SRIDs and zero invalid rows before continuing.

Step 2 — population within 5 km of a clinic SQL

The core accessibility measure: which population points are "served"?

-- Mark each block as served / unserved, using true meters via geography
SELECT
  sum(pop)                                   AS total_pop,
  sum(pop) FILTER (WHERE served)             AS served_pop,
  round(100.0 * sum(pop) FILTER (WHERE served) / sum(pop), 1) AS pct_served
FROM (
  SELECT b.pop,
         EXISTS (
           SELECT 1 FROM clinics c
           WHERE ST_DWithin(b.geom::geography, c.geom::geography, 5000)
         ) AS served
  FROM blocks b
) s;

Why ST_DWithin + ::geography: it uses the spatial index and measures real meters — both traps from Part IV, avoided in one line.

Lab 3.2
Run it. What % of the county is within 5 km of a clinic? Try 3 km and 10 km — how sensitive is access to the threshold?

Step 3 — service areas as buffers SQL

Sometimes you want the zone itself — to map it, or intersect it with other layers.

-- A 5 km service area around each clinic
SELECT clinic_id,
       ST_Buffer(geom::geography, 5000)::geometry AS svc
FROM clinics;

-- Union them into one combined coverage polygon
SELECT ST_Union(
         ST_Buffer(geom::geography, 5000)::geometry
       ) AS coverage
FROM clinics;
  • ST_Buffer on geography grows a true-meter zone
  • ST_Union dissolves overlapping buffers into one coverage shape
  • Export coverage as GeoJSON to drop onto a web map, or intersect it with districts (next step)

Buffers are approximations of circles (segmented polygons) and ignore roads/terrain. For travel-time access you'd use a routing service — note the assumption.

Step 4 — access rate by district SQL

Now the decision-relevant output: a rate per district, so we compare fairly across populations.

SELECT
  d.name,
  d.population,
  sum(b.pop)                                   AS pop_in_blocks,
  sum(b.pop) FILTER (WHERE served)             AS served_pop,
  round(100.0 * sum(b.pop) FILTER (WHERE served)
        / nullif(sum(b.pop),0), 1)             AS pct_served
FROM districts d
JOIN blocks b ON ST_Contains(d.geom, b.geom)        -- points-in-polygons
CROSS JOIN LATERAL (
  SELECT EXISTS (SELECT 1 FROM clinics c
    WHERE ST_DWithin(b.geom::geography, c.geom::geography, 5000)) AS served
) x
GROUP BY d.name, d.population
ORDER BY pct_served ASC;          -- worst-served first
Lab 3.3
Run it. Which two districts are least served? Those are where you'd target a new clinic.

Step 5 — validate before you trust it SQL

A number is not an answer until you've checked it. Build the habit now.

Cross-checks:

  • Totals reconcile — does sum(served) ≤ total_pop? Does district pop roughly match population?
  • Spot a known case — pick one block next to a clinic; is it flagged served?
  • Boundary effects — blocks near the county edge may have clinics just outside your data → undercount
  • Units — distances in meters? geography confirms it
  • Sensitivity — does the ranking hold at 3 km vs 5 km?
-- One block you can reason about by hand
SELECT b.block_id,
  ST_Distance(b.geom::geography,
              c.geom::geography)::int AS m
FROM blocks b
JOIN clinics c ON c.clinic_id = 1
ORDER BY m
LIMIT 3;       -- nearest blocks to clinic 1

Validation is the analyst's job — not the database's, and certainly not the AI's.

Extending the pattern — more public-health questions

The same building blocks answer many questions. Notice the recurring verbs: contain, within-distance, buffer, intersect, aggregate.

🦠 Disease surveillance & clusters

Cases (points) in districts (polygons), normalized by population.

SELECT d.name, count(c.*) AS cases
FROM districts d
JOIN cases c ON ST_Contains(d.geom,c.geom)
GROUP BY d.name;     -- ÷ population → rate

🏭 Environmental exposure

Who lives inside a pollutant buffer?

SELECT sum(b.pop) AS exposed
FROM blocks b, sources s
WHERE ST_DWithin(
  b.geom::geography,
  s.geom::geography, 2000);   -- 2 km

Key takeaway: PostGIS answers sophisticated spatial questions efficiently — and once you know the handful of core predicates, new questions are recombinations of the same moves.

Learning GIS with Generative AI 🤖 New

Generative AI assistants (such as Claude, ChatGPT, Gemini, and coding tools like GitHub Copilot, Cursor, and Claude Code) have become some of the most effective ways to learn spatial concepts and to accelerate real GIS analysis. They are especially valuable for public-health researchers who know their domain deeply but are newer to databases and code. This section shows you how to use them well — and, just as importantly, how to use them safely.

Mindset: Treat the AI as a tireless, fast, occasionally-overconfident teaching assistant and pair-programmer — not as an oracle. It is excellent at drafting, explaining, and translating. It is you who remains the analyst responsible for correctness, ethics, and interpretation.

Four ways AI helps with GIS

Click each card to expand. These map to the whole learning-to-doing journey.

1. Tutor & explainer 📖

Ask it to explain concepts at your level, with public-health examples and analogies. Great for SRIDs, geometry vs. geography, spatial joins, projections, and "why is my distance in degrees?".

Try: "Explain the difference between geometry and geography in PostGIS to someone who knows statistics but not GIS, using a clinic-access example."

2. Code generator & translator 🔁

Turn a plain-English question into spatial SQL, or translate the same analysis between SQL, R (sf), and Python (geopandas). Excellent for getting unstuck on syntax.

Try: "Write a PostGIS query that counts disease cases per district, then show the equivalent in R sf and Python geopandas."

3. Debugger & reviewer 🐞

Paste an error message and your query; ask what's wrong and how to fix it. Ask it to review a query for correctness and performance (e.g., "is this using the spatial index?").

Try: "This query returns 0 rows but I expected matches. Here's the SQL and the SRIDs of each table — what's likely wrong?"

4. Analysis copilot 🚀

Plan a whole analysis with it: what data you need, which functions to use, how to validate results. Coding-agent tools (Claude Code, Cursor) can even run code, inspect data, and iterate with you.

Try: "I want to assess healthcare accessibility for a county. Help me plan the data, the PostGIS steps, and how I'd sanity-check the output."

How to write a good GIS prompt

The quality of the answer depends heavily on the context you give. A reliable recipe:

Copy-ready prompt templates

Fill in the [brackets] and paste into your AI assistant of choice.

Explain a concept
You are a patient GIS tutor. Explain [CONCEPT, e.g. "what an SRID is and why ST_Transform matters"] to a public-health researcher who understands statistics but is new to spatial data. Use a concrete clinic/disease-mapping example, avoid jargon (or define it), and finish with one common mistake to avoid.
Generate spatial SQL
You are a PostGIS expert (PostGIS 3.6 / PostgreSQL 18). I have: - Table "cases": columns id, diagnosis_date, geom (Point, SRID 4326) - Table "districts": columns district_id, name, population, geom (MultiPolygon, SRID 4326) Write one SQL query that returns each district's name, case count, and cases per 1,000 population. Make sure it can use a spatial index, and explain each line in plain English.
Translate between languages
Here is a PostGIS SQL query: [PASTE QUERY]. Show the equivalent analysis (a) in R using the sf package and (b) in Python using geopandas, reading from the same PostGIS database. Note any behavioural differences I should watch for.
Debug an error / empty result
My PostGIS query returns [ERROR MESSAGE / "0 rows but I expected matches"]. Here is the query: [PASTE QUERY]. The SRID of each table is: [LIST SRIDs]. The geometry columns are: [LIST]. What are the most likely causes, ranked, and how do I check and fix each one?
Plan & sanity-check an analysis
I want to [RESEARCH GOAL, e.g. "estimate how many people live within 5 km of a clinic in County X"]. I have [DESCRIBE DATA]. Propose a step-by-step PostGIS workflow, flag the assumptions and CRS/units decisions, and tell me three checks I can run to confirm the results are sensible before I trust them.
Lab 4 Prompt → run → verify Python SQL

Using your own Riverbend schema, ask an AI assistant to write the 5 km access rate by district query (our Step 4). Then verify it against the answer you already have.

Your prompt
You are a PostGIS 3.6 expert helping a public-health researcher. Tables (all SRID 4326): - clinics(clinic_id, name, geom Point) - blocks(block_id, pop, geom Point) - districts(district_id, name, population, geom Polygon) Write ONE query: per district, the % of block population within 5 km of any clinic. Use meters, use the spatial index, and explain each line.

Then check the AI's output against reality:

  1. Does it use ::geography or ST_Transform (meters, not degrees)?
  2. Does it use ST_DWithin (index) — not bare ST_Distance?
  3. Run it. Does the ranking match your Step 4 result?
  4. Ask a follow-up: "add EXPLAIN ANALYZE and confirm the index is used."

If its numbers differ from yours, you find out why. That gap-hunting is the skill.

A safe AI-assisted workflow

Use AI as a loop, not a one-shot. A dependable pattern:

1
Frame the question yourself. Know what answer would make sense (rough magnitude, units, expected shape of output). This is your guardrail against confident-but-wrong output.
2
Give rich context. Tool/version, table and column names, geometry types, and SRIDs (see the templates above).
3
Ask for an explanation, not just code. "Explain each step" turns the answer into a learning opportunity and surfaces flawed assumptions.
4
Run it on a small sample first. Use LIMIT, a single district, or a known case where you can predict the answer.
5
Verify against reality. Check counts, totals, and a couple of records by hand or with an independent method. Cross-check function names in the official docs.
6
Iterate. Feed back what you saw ("the totals are 10× too high — could it be a unit issue?") and refine.

Trust, but verify: common AI pitfalls in GIS

⚠️ AI can be confidently wrong. Watch for these GIS-specific failure modes:

  • Invented functions/arguments: AI sometimes "hallucinates" a plausible-sounding ST_* function or option that doesn't exist. Confirm names in the PostGIS reference.
  • SRID / units mistakes: measuring distance on SRID 4326 gives degrees, not meters. AI may forget to ST_Transform to a projected CRS or to use the geography type. Always check the units of your answer.
  • Invalid geometries: real-world data has self-intersections and nulls. Suggested queries may fail or mislead until you run ST_IsValid / ST_MakeValid.
  • Performance blind spots: a "correct" query can be unusably slow without a spatial index or with the wrong join. Ask it to EXPLAIN ANALYZE and to confirm index use.
  • Outdated patterns: training data may reference old versions (e.g., psycopg2 instead of psycopg v3). State your versions explicitly.
  • Statistical / epidemiological subtlety: spatial clustering, the modifiable areal unit problem (MAUP), edge effects, and ecological fallacy are easy for AI to gloss over. Apply your domain expertise.

🔒 Data privacy & ethics (read this twice)

Never paste protected health information (PHI/PII) or precise patient locations into a consumer AI chat. Patient addresses and coordinates are among the most sensitive and re-identifiable data that exist. Pasting them into a public tool may violate HIPAA, GDPR, IRB protocols, and your institution's data-use agreements.

  • Share schema, not data. Give the AI your table and column structure, types, and SRIDs — never the actual rows.
  • Use synthetic or fake examples when you need sample values ("a clinic at lng/lat 0,0").
  • De-identify and aggregate before any sharing — e.g., aggregate points to census tracts or H3 cells; apply your jurisdiction's small-cell suppression rules.
  • Prefer approved/enterprise AI tools with a zero-data-retention or no-training agreement, or self-hosted/local models, when working near sensitive data — and follow your institution's policy first.
  • Keep a human in the loop for decisions. AI output informs analysis; it does not make public-health decisions.

Good practice: describe your problem in terms of structure and goals. For example — "I have a points table of de-identified case locations (SRID 4326) and a polygons table of census tracts; help me aggregate cases per tract" — gives the AI everything it needs without exposing a single real record.

The AI tooling landscape for GIS

💬 Chat assistants

Claude, ChatGPT, Gemini. Best for explaining concepts, drafting queries, translating between languages, and debugging from pasted errors.

⌨️ In-editor assistants

GitHub Copilot, Cursor, and notebook AI features autocomplete and refactor code where you work — handy in RStudio, VS Code, and Jupyter.

🛠️ Coding agents

Tools like Claude Code can read your files, run queries, inspect results, and iterate on a whole analysis with you — closer to a true copilot than a chatbot.

🔌 MCP & text-to-SQL

The Model Context Protocol (MCP) lets AI assistants connect directly to a (read-only!) PostGIS database to inspect the schema and answer questions in natural language. Powerful — but scope permissions tightly, especially with health data.

Emerging generative-AI applications in the field New

Beyond chat assistants, generative AI is showing up inside the geospatial workflow. These are real, active areas in 2026 — useful to know about even if you don't use all of them yet. Click a card to expand.

🛰️ Geospatial foundation models for imagery

Models like Prithvi (NASA–IBM, trained on Landsat/Sentinel-2 and now even running in orbit) and Clay turn satellite/aerial imagery into ready-to-use features for flood mapping, burn-scar detection, land-use and cloud-gap reconstruction — with little to no manual feature engineering.

Public-health relevance: environmental-exposure assessment, disaster response, monitoring flooding/heat/air-quality drivers of disease.

✂️ Promptable feature extraction (SAM-for-geo)

Segment-Anything-style models (e.g., segment-geospatial/samgeo) let you extract buildings, water bodies, fields, or roads from imagery with a click or a text prompt — no hand-digitizing.

Public-health relevance: rapidly mapping settlements/buildings for population denominators, sanitation, or access studies in data-poor regions.

💬 "Chat with your map / database"

QGIS LLM plugins (GeoAgent, Promptly, IntelliGeo, Whitebox AI Agent) and MCP/text-to-SQL tools let you describe an operation in plain language; the AI inspects your layers/CRS/schema and generates the geoprocessing or SQL to run it.

Public-health relevance: lowers the barrier for epidemiologists to run spatial analyses without deep GIS/SQL skills (still verify the output!).

🧪 Synthetic & privacy-preserving data

Generative models can produce synthetic populations or case locations that preserve realistic spatial distributions without exposing real individuals — for sharing, teaching, or method development.

Public-health relevance: collaborate and publish reproducible spatial methods while protecting PHI; build demos without real patient data.

📍 Geocoding, parsing & record linkage

LLMs are strong at cleaning messy addresses, standardizing place names, parsing free-text locations from clinical notes, and matching records across datasets that lack a shared key.

Public-health relevance: turning inconsistent surveillance/EHR address fields into mappable, linkable data (do this on de-identified data in approved environments).

🔎 Semantic + spatial search (embeddings)

With pgvector inside PostgreSQL/PostGIS you can combine vector embeddings (semantic similarity) with spatial filters — e.g., "facilities similar to X and within 10 km" — in a single database.

Public-health relevance: retrieval-augmented analysis over reports/facilities, smarter matching of services to needs.

🤝 Agentic analysis copilots

Coding agents (e.g., Claude Code, Cursor) now read your files, write and run spatial code, inspect results, and iterate end-to-end — turning a research question into a reproducible pipeline with you supervising.

Public-health relevance: faster, more reproducible analyses; great for prototyping a study workflow before scaling it.

🗺️ AI cartography & narrative reporting

Generative tools assist with map styling, choosing classifications/color schemes, and drafting plain-language summaries of spatial results for briefs and dashboards.

Public-health relevance: communicating findings to decision-makers and the public quickly — with a human checking accuracy and avoiding misleading maps.

⚠️ Same rules apply: these applications are powerful but not infallible. Foundation-model outputs need ground-truthing, LLM-generated code and SQL need verification, and none of them excuse putting real PHI into unapproved tools. Treat AI output as a draft to validate, not a result to publish.

Key takeaway: Generative AI dramatically lowers the barrier to learning and doing GIS — as a tutor, code translator, debugger, and analysis copilot. Get the most from it by giving rich context (versions, schema, SRIDs), asking for explanations, and verifying every result. And never trade your participants' privacy for convenience: share structure, not sensitive data.

Pop Quiz: Test Your Knowledge!

Let's see what you've learned. For each question, consider the options and then click "Show Answer" to see the correct choice and explanation.

Question 1: What is a key advantage of using a spatial database like PostGIS over traditional shapefiles for collaborative public health research?

  • A) Shapefiles support more advanced SQL queries.
  • B) PostGIS allows for robust concurrent access and data integrity.
  • C) Shapefiles are better for handling very large datasets.
  • D) PostGIS does not require understanding coordinate systems.

Question 2: Which PostGIS geometry type would be most suitable for representing the boundaries of administrative districts (e.g., counties)?

  • A) Point
  • B) LineString
  • C) Polygon
  • D) MultiPoint

Question 3: What is the primary purpose of an SRID (Spatial Reference Identifier) in PostGIS?

  • A) To uniquely identify each record in a table.
  • B) To define the visual style of map features.
  • C) To define the coordinate system and give real-world meaning to geometric data.
  • D) To speed up non-spatial attribute queries.

Question 4: When reading spatial data from PostGIS into R using `sf::st_read()` or Python using `geopandas.read_postgis()`, what is a benefit of using a SQL query within these functions?

  • A) It's the only way to load any data.
  • B) It allows for server-side filtering and processing, which can be more efficient for large datasets.
  • C) It bypasses the need for database connection parameters.
  • D) It automatically converts data to a local file format like a shapefile.

Question 5: To find all health clinics within a 2km radius of a patient's address in PostGIS, which spatial function is most directly applicable and efficient for this proximity query?

  • A) `ST_Area()`
  • B) `ST_DWithin()`
  • C) `ST_Union()`
  • D) `ST_Centroid()`

Question 6: You need to do fast, one-off analysis of a 50-million-row GeoParquet file on your laptop, with no database server to set up. Which modern tool is the best fit?

  • A) A shapefile opened in a spreadsheet
  • B) DuckDB with the spatial extension
  • C) Emailing the file to a colleague
  • D) Converting everything to CSV first

Question 7: You're using a generative AI assistant to help with a spatial analysis of de-identified patient case locations. Which practice is most appropriate?

  • A) Paste the full table of patient coordinates so the AI has complete context.
  • B) Describe the schema (table/column names, geometry types, SRIDs) and use synthetic example values — never real PHI.
  • C) Trust the generated query without checking, since AI rarely makes mistakes.
  • D) Skip stating your PostGIS version because it never matters.

Summary & Resources

This final section summarizes the key benefits of PostGIS for public health research and provides a list of resources for continued learning.

Quick Summary & Benefits Recap

Q&A and Further Resources

To continue your learning journey:

Official Websites:

Key R Packages Documentation:

Key Python Packages Documentation:

Modern Geospatial Stack: New

Learning & Doing GIS with AI: New

Emerging GenAI-in-geospatial to explore:

General Tutorial Search Terms:

Hands-on practice with the provided sample database details and code snippets is highly encouraged — and try working through one of the use cases with an AI assistant as your tutor!