Spatial Databases (PostGIS) & AI-Assisted GIS for Public Health Research
Welcome! This interactive application explores the concepts and benefits of using PostGIS spatial databases in public health research. It's designed for researchers who may have limited prior experience with databases or programming but are familiar with public health data.
Navigate through the sections using the sidebar to learn about why spatial databases are crucial, understand PostGIS basics, see how to connect using R and Python, get oriented to the modern (2026) geospatial stack, explore real-world use cases, learn how to use generative AI to study GIS concepts and run spatial analyses safely, test your knowledge with a pop quiz, and find resources for further learning.
Overall Goal: To introduce the concept and benefits of using PostGIS spatial databases in public health research, demonstrate basic data access using R and Python, and equip you to learn and perform GIS work with the help of generative AI — responsibly.
📅 2026 revamp: This edition updates the material for the current geospatial ecosystem (PostGIS 3.6, PostgreSQL 18, GeoParquet, DuckDB, psycopg v3, GeoPandas 1.x) and adds a hands-on guide to learning and doing GIS with generative AI. New and updated content is marked with a New badge.
How to follow along
This is a hands-on lecture: we alternate concept → demo → you try it. When you see a Lab box, pause and run it yourself — solutions follow. Code is labelled by track so you can pick the one you know:
SQL Python R DuckDB Docker
Two ways to practice — set up either (or both) in Part III
A zero-setup DuckDB track that runs on any laptop, and a full Docker PostGIS track for the real multi-user server experience. The spatial SQL is nearly identical across both.
Learning objectives
By the end of today, you should be able to:
- Explain why spatial databases beat CSVs and shapefiles for research
- Stand up a working PostGIS (Docker) and DuckDB spatial environment
- Describe geometry types, SRIDs/CRS, geometry vs. geography
- Create and use a spatial index — and read
EXPLAIN ANALYZE
- Write spatial SQL:
ST_DWithin,ST_Intersects, spatial joins, buffers - Connect from Python (
geopandas) and R (sf) - Carry one public-health analysis end-to-end (clinic accessibility)
- Use generative AI to draft, translate, and debug GIS work — and verify it
The throughline: the spatial SQL you learn today transfers across PostGIS, DuckDB, and AI-assisted workflows alike. Learn it once; use it everywhere.
Today's agenda & timing
Before we go further: if you can, start the Docker download now (Part III, the docker run command) — it pulls in the background while we cover concepts.
Why Spatial Databases?
This section delves into the importance of spatial data in public health and the limitations of traditional data storage methods like CSVs and Shapefiles, highlighting the need for robust database solutions like PostGIS.
What is Spatial Data in Public Health?
Spatial data links information to a specific location on Earth. In public health, understanding "where" health events occur, vulnerable populations reside, and services are located is fundamental.
Examples include:
- Mapping disease outbreaks (e.g., influenza spread, cancer mortality rates by state).
- Locations of hospitals/clinics (assessing geographic accessibility).
- Patient addresses (anonymized/aggregated, for understanding distribution).
- Environmental exposure sources (e.g., pollution sites near communities).
- Demographic data aggregated to geographic areas (e.g., census tracts, zip codes).
Visualizing such data on maps can reveal important characteristics or trends not apparent from tabular data alone. This enables targeted, geographically-focused interventions.
Key takeaway: Spatial context is crucial in understanding and addressing many public health issues.
Limitations of Traditional Data Storage & The Need for Databases
While CSVs and Shapefiles are common, they have significant limitations, especially for large, complex datasets or collaborative work.
Challenges with CSV Files
- No inherent spatial understanding: Coordinates are plain numbers; system must be explicitly defined.
- Data type ambiguity: No enforced data types, can lead to precision loss (e.g., ZIP codes losing leading zeros).
- Inconsistent formatting: Issues with parsing, delimiters, special characters.
- Limited querying: No built-in spatial query capabilities.
- Scalability issues: Poor performance with very large datasets.
- Data integrity/security: Lacks encryption, password protection, validation.
Challenges with Shapefiles
- File proliferation: A single "shapefile" is multiple files (.shp, .shx, .dbf, .prj).
- Data duplication/inconsistency: Easy to create multiple versions, leading to redundancy.
- Attribute limitations: Field names limited to 10 chars; no null values (often uses arbitrary numbers); dates without time; 2GB file size limit.
- Limited querying/analysis: Complex spatial queries require external GIS and convoluted workflows.
- No concurrent access: Not designed for simultaneous multi-user access, leading to versioning issues.
These limitations can undermine research quality, reproducibility, and collaboration. Relational databases (like PostgreSQL) and specialized spatial databases (like PostGIS, an extension for PostgreSQL) address these challenges.
Summary of Challenges: Traditional vs. Database Approach
| Issue | Example with CSVs | Example with Shapefiles | Impact on Public Health Research |
|---|---|---|---|
| Data Integrity | Latitude/Longitude stored as text, potential precision loss. | Field name "ENVIRONMENTAL_EXPOSURE_LEVEL" truncated. Nulls replaced with 0. | Inaccurate analysis, misinterpretation, inability to distinguish missing from true zero. |
| Scalability | Slow processing of 1 million patient address records. | 2GB file size limit reached with detailed national boundary data. | Inability to analyze large-scale or high-resolution datasets efficiently. |
| Spatial Querying | Cannot directly query for clinics within 1km of a patient address. | Requires external GIS and complex workflows for proximity analysis. | Slow, inefficient, and error-prone execution of complex spatial questions. |
| Collaboration | High risk of overwriting data if multiple users edit a shared CSV. | Difficult to merge changes from different team members. | Hinders teamwork, potential data loss, outdated analyses, reproducibility issues. |
| Coordinate System Info | No embedded coordinate system information. | `.prj` file stores system, but can be missing/incorrect. | Risk of incorrect spatial analysis if data misaligned. |
Key takeaway: Spatial databases offer a more robust, efficient, and powerful way to manage and analyze geographic health data, overcoming critical limitations of traditional file-based approaches.
The question we'll answer all day
To make this concrete, we'll carry one realistic analysis through the whole lecture.
Running case study
Clinic accessibility in Riverbend County
"How many people live within 5 km of a primary-care clinic, and which districts are underserved relative to their population?"
Our (synthetic) data — three tables we'll build in Part III:
| Table | Geometry | Key columns |
|---|---|---|
clinics | Point (4326) | clinic_id, name |
districts | MultiPolygon (4326) | district_id, name, population |
blocks | Point (4326) | block_id, pop |
Setup — Your Lab Environment Hands-on
You'll get the most out of today if you can run queries yourself. Pick a track — or do both. The two engines share nearly identical spatial SQL, so the skill transfers either way.
Two tracks, one set of skills
DuckDB Zero-setup, local
- One binary or
pip install— no server - Runs spatial SQL inside Python / R / CLI
- Reads GeoParquet & files directly
- Perfect for labs on any laptop
Best if you want to start immediately.
Docker Real PostGIS server
- A genuine multi-user PostgreSQL + PostGIS
- Concurrency, roles, indexes,
EXPLAIN - Mirrors a production research database
- Needs Docker Desktop installed
Best if you want the authentic workflow.
The SQL is nearly identical. DuckDB's spatial extension implements the same ST_* functions. Learn the query once; run it in either engine.
Track A — DuckDB in 60 seconds DuckDB
The fastest path to running spatial SQL. No server, no accounts.
Option 1 — Python (what we'll use in labs)
pip install "duckdb>=1.1" "geopandas>=1.0"
# then, in Python:
import duckdb
con = duckdb.connect("riverbend.duckdb")
con.sql("INSTALL spatial; LOAD spatial;")
con.sql("SELECT ST_Point(0, 0) AS p;").show()
Option 2 — CLI (great for quick checks)
# macOS / Linux
brew install duckdb # or download the binary
duckdb riverbend.duckdb
-- inside the DuckDB shell:
INSTALL spatial; LOAD spatial;
SELECT ST_AsText(ST_Point(-117.4, 33.9));
-- POINT(-117.4 33.9)
POINT printed, your spatial engine works. ✅
Track B — Docker PostGIS in one command Docker
The official image bundles PostgreSQL 18 + PostGIS 3.6. One command, a real server.
# Pull & run a PostGIS server (downloads ~600 MB the first time)
docker run -d --name almanac-pg \
-e POSTGRES_PASSWORD=almanac \
-e POSTGRES_DB=riverbend \
-p 5432:5432 \
postgis/postgis:18-3.6
Connect with psql (inside the container)
docker exec -it almanac-pg \
psql -U postgres -d riverbend
-- Turn on PostGIS in this database (once)
CREATE EXTENSION IF NOT EXISTS postgis;
SELECT postgis_full_version();
Tip: -d runs it in the background. Stop/restart with docker stop almanac-pg / docker start almanac-pg. Your data persists in the container until you docker rm it.
Teaching only: a hard-coded password and open port are fine on your laptop — never on a shared or internet-facing host.
psql (Docker) or the DuckDB shell. Both should return a version string and a valid point.
Docker / PostGIS
SELECT postgis_version();
-- 3.6 USE_GEOS=1 USE_PROJ=1 ...
SELECT ST_AsText(
ST_SetSRID(ST_MakePoint(-117.4, 33.9), 4326)
);
-- POINT(-117.4 33.9)
DuckDB
LOAD spatial;
SELECT version(); -- v1.1+
SELECT ST_AsText(
ST_Point(-117.4, 33.9)
);
-- POINT (-117.4 33.9)
Loading the sample data SQL
We'll generate the synthetic Riverbend tables directly in SQL, so everyone has identical data.
Clinics — a handful of points
CREATE TABLE clinics (
clinic_id int PRIMARY KEY,
name text,
geom geometry(Point, 4326)
);
INSERT INTO clinics VALUES
(1,'Riverbend Family Health',
ST_SetSRID(ST_MakePoint(-117.40, 33.95),4326)),
(2,'Eastside Community Clinic',
ST_SetSRID(ST_MakePoint(-117.31, 33.92),4326)),
(3,'North Valley Health Center',
ST_SetSRID(ST_MakePoint(-117.36, 34.02),4326));
Blocks — population points (a grid)
CREATE TABLE blocks AS
SELECT
row_number() OVER () AS block_id,
(50 + (random()*450))::int AS pop,
ST_SetSRID(ST_MakePoint(
-117.45 + gx*0.01,
33.88 + gy*0.01), 4326) AS geom
FROM generate_series(0,18) gx,
generate_series(0,16) gy;
~300 population points across the county. (DuckDB: use range() instead of generate_series in a cross join — the AI can translate it; see Part VIII.)
Districts — simple square cells acting as administrative areas
CREATE TABLE districts AS
SELECT
d AS district_id,
'District ' || d AS name,
(8000 + (random()*40000))::int AS population,
ST_SetSRID(
ST_MakeEnvelope( -- xmin, ymin, xmax, ymax
-117.45 + (d%3)*0.10, 33.88 + (d/3)*0.08,
-117.35 + (d%3)*0.10, 33.96 + (d/3)*0.08
), 4326) AS geom
FROM generate_series(0,5) d;
You now have three layers: clinics (points), blocks (population points), and districts (polygons) — all in SRID 4326. This is the data for our case study in Part VII.
Prefer real files? The same tables ship as GeoParquet in the course pack. DuckDB can read them straight from a URL or disk — SELECT * FROM 'clinics.parquet'; — no import step needed.
SELECT count(*) FROM blocks; — expect ~323 rows.
PostGIS Fundamentals
This section introduces PostgreSQL, the PostGIS extension, and core spatial concepts essential for using PostGIS effectively in public health research.
What is PostgreSQL and PostGIS?
PostgreSQL: A powerful, open-source object-relational database system (ORDBMS). Known for reliability, data integrity, scalability, and extensibility. It's free and community-supported.
PostGIS: An extension to PostgreSQL that "teaches" it to understand, store, manage, and analyze geographic objects. It adds:
- Spatial Data Types (e.g., `geometry`, `geography`).
- Spatial Indexes (e.g., R-Tree based on GiST for fast spatial searches).
- Spatial Functions (hundreds for analysis, e.g., distance, intersection, buffers).
Analogy: PostgreSQL is the filing cabinet; PostGIS provides special folders and tools for maps within it.
Current versions (mid-2026): PostgreSQL 18 and PostGIS 3.6.x are the current stable releases. Recent PostGIS versions keep expanding raster, vector-tile (`ST_AsMVT`), and 3D support, and modern PostgreSQL releases bring faster parallel queries and better JSON handling — all useful for large public-health datasets. You don't need the newest version to learn, but knowing what's current helps when reading docs and asking AI assistants for help.
Key takeaway: PostGIS turns a regular (but very powerful) database (PostgreSQL) into a spatial powerhouse, specifically designed to handle and analyze geographic information.
Core Spatial Concepts in PostGIS
1. Geometry Types
PostGIS stores geographic features as specific geometry types. Click each type to see examples:
Point 📍
Represents a single location (e.g., X,Y coordinates).
Public Health Examples: Patient residence, clinic location, disease case site, water sample point.
LineString 〰️
Represents linear features (sequence of connected points).
Public Health Examples: Roads, rivers, public transport routes, disease vector paths.
Polygon 🏞️
Represents an area (closed sequence of points).
Public Health Examples: Administrative boundaries (counties, states), service areas, exposure zones, park boundaries.
PostGIS also supports Multi-geometries (MultiPoint, MultiLineString, MultiPolygon) which are collections of the same geometry type, plus GeometryCollections.
How geometries are written — WKT & WKB
Every geometry has a human-readable and a binary form. You'll see both constantly.
WKT — Well-Known Text (readable)
POINT(-117.4 33.9)
LINESTRING(0 0, 1 1, 2 1)
POLYGON((0 0, 4 0, 4 4, 0 4, 0 0))
-- Constructing & inspecting
SELECT ST_AsText( -- geometry → WKT
ST_GeomFromText('POINT(-117.4 33.9)', 4326)
);
SELECT ST_AsGeoJSON(geom) -- → GeoJSON for web maps
FROM clinics LIMIT 1;
WKB (Well-Known Binary) is the compact form actually stored on disk. You rarely write it by hand — functions convert for you.
Constructor cheat-sheet:
ST_MakePoint(x, y) — fast point
ST_SetSRID(geom, 4326) — stamp a CRS
ST_GeomFromText(wkt, srid) — from WKT
ST_AsText / ST_AsGeoJSON — read it back
Order matters: WKT and ST_MakePoint are (longitude, latitude) — X then Y. Reversing them silently puts your data in the wrong hemisphere.
2. Spatial Reference Systems (SRS) / SRID
An SRS (or CRS) defines how coordinate values relate to actual locations on Earth. It's the "language" or "map projection" your data speaks.
- Crucial for: Aligning different data layers correctly and ensuring meaningful measurements (distance, area).
- SRID (Spatial Reference Identifier): A unique integer ID for each SRS in PostGIS (e.g., stored in `spatial_ref_sys` table).
- Common Examples:
- WGS84 (EPSG:4326): Geographic (lat/lon in degrees), used by GPS.
- Projected Systems (e.g., UTM, State Plane): Transform Earth's surface to a flat plane (units in meters/feet), better for accurate local distance/area calculations.
- Rule: All spatial data in PostGIS MUST have a defined SRID. For operations between layers, SRIDs should match or be transformable (using `ST_Transform()`).
Conceptual Visualization: SRID Alignment
Layer A (SRID: X)
Layer B (SRID: Y)
Misaligned! 🚫
Layer A (Transformed to SRID: Z)
Layer B (Transformed to SRID: Z)
Correctly Aligned! ✅
Illustrates that layers with different SRIDs won't overlay correctly unless transformed to a common system.
The classic trap — degrees are not meters
Distance on lat/lon (SRID 4326) is measured in degrees, which is almost never what you want.
-- WRONG: "0.045" — that's degrees, meaningless
SELECT ST_Distance(
ST_SetSRID(ST_MakePoint(-117.40,33.95),4326),
ST_SetSRID(ST_MakePoint(-117.36,34.02),4326)
);
-- RIGHT (option 1): transform to a metric CRS
SELECT ST_Distance(
ST_Transform(a.geom, 32611), -- UTM 11N, meters
ST_Transform(b.geom, 32611)
) FROM clinics a, clinics b
WHERE a.clinic_id=1 AND b.clinic_id=3;
-- ≈ 8100 (meters)
-- RIGHT (option 2): cast to geography
-- (meters on a sphere)
SELECT ST_Distance(
a.geom::geography,
b.geom::geography
) FROM clinics a, clinics b
WHERE a.clinic_id=1 AND b.clinic_id=3;
-- ≈ 8120 (meters)
If a distance looks like 0.0x, you're in degrees. Transform to a projected CRS (e.g. the right UTM zone) or cast to geography.
geometry vs. geography — which to use
PostGIS offers two spatial types. Choosing well saves you from the degrees-vs-meters trap.
geometry — Cartesian (flat)
- Math on a flat plane — fast
- Units are the CRS's units (degrees for 4326, meters for UTM)
- Hundreds of functions; the default for analysis
- Best when your data is in a projected CRS, or covers a small area
geography — geodetic (round Earth)
- Math on the spheroid — distances/areas in meters, always
- Correct over long distances & across UTM zones
- Fewer functions, somewhat slower
- Best when data spans large areas in lat/lon and you want true meters
Rule of thumb: national or continental lat/lon data and you want meters → geography. Local analysis or you've projected to UTM/State Plane → geometry. You can cast between them: geom::geography.
3. Spatial indexes
Like a book's index — but for locations. They dramatically speed up spatial queries ("what's near X?", "what's inside Y?").
- Analogy: a GPS that lets the database rapidly find spatial data.
- How (simplified): PostGIS builds a GiST index over each geometry's bounding box (an R-Tree), so the planner skips most rows.
- Two-phase search: a fast index filter on bounding boxes, then an exact check on the survivors.
-- Build a spatial index (do this on every geometry column you query)
CREATE INDEX blocks_geom_idx ON blocks USING GIST (geom);
ANALYZE blocks; -- refresh planner statistics
Functions that use the index automatically: ST_DWithin, ST_Intersects, ST_Contains, && — because they begin with a bounding-box test. ST_Distance alone does not; prefer ST_DWithin for proximity.
Proving the index works — EXPLAIN ANALYZE
The planner tells you whether it used your index. Reading this is a core skill.
EXPLAIN ANALYZE
SELECT b.block_id
FROM blocks b, clinics c
WHERE c.clinic_id = 1
AND ST_DWithin(
b.geom::geography,
c.geom::geography, 5000);
What to look for:
- ✅
Index Scan using blocks_geom_idx— good, the index is used - ⚠️
Seq Scanon a big table — the index was skipped (missing index, wrong type, or a function wrapping the column) actual time=— real milliseconds; compare before/after adding the indexrows=— estimate vs. actual; large gaps mean stale stats → runANALYZE
blocks_geom_idx, run the EXPLAIN ANALYZE above, and note the time. Drop the index, re-run, and compare. (DuckDB users: skip — it auto-indexes; focus on the query.)
Spatial relationships — the vocabulary
Most spatial questions reduce to a handful of relationship predicates. They return true/false.
| Function | Asks… |
|---|---|
ST_Intersects(a,b) | do they touch/overlap at all? |
ST_Contains(a,b) | is b fully inside a? |
ST_Within(a,b) | is a fully inside b? |
ST_DWithin(a,b,d) | are they within distance d? |
ST_Touches(a,b) | do they share only a boundary? |
ST_Crosses(a,b) | does a line cross a polygon? |
Under the hood these implement the DE-9IM model — a 3×3 matrix describing how two geometries' interiors, boundaries, and exteriors relate. You rarely write DE-9IM directly, but it's why the named predicates are precise and composable.
Measurement companions: ST_Distance, ST_Length, ST_Area, ST_Buffer, ST_Intersection, ST_Union, ST_Centroid.
The workhorse — spatial joins
A spatial join connects two tables by a spatial relationship instead of a shared key. This is where PostGIS earns its keep.
-- Count population points inside each district
SELECT d.name,
count(b.*) AS n_blocks,
sum(b.pop) AS people
FROM districts d
JOIN blocks b
ON ST_Contains(d.geom, b.geom) -- the join!
GROUP BY d.name
ORDER BY people DESC;
How it works:
- The
ONclause uses a spatial predicate, nota.id = b.id - The planner uses the GiST index to avoid comparing every pair
JOINkeeps matches;LEFT JOINkeeps districts with zero points too (watch for those!)
Pattern to memorize: points-in-polygons via JOIN … ON ST_Contains(poly, pt) + GROUP BY. It answers a huge share of public-health questions.
Part IV key takeaway: PostGIS stores shapes (geometries) with a real-world coordinate system (SRID), lets you choose geometry (flat, fast) or geography (true meters), uses GiST spatial indexes to stay fast, and answers questions through relationship predicates and spatial joins. Watch the two classic traps: degrees vs. meters, and queries that skip the index (EXPLAIN ANALYZE is your friend).
Connecting to & Querying PostGIS
This section covers general principles for database connections and demonstrates how to connect to PostGIS and retrieve data using R and Python, along with a brief note on SQL.
General Connection Principles
To access any database, you need standard parameters:
- Hostname (Server Address): Network address of the database server (e.g., `localhost`, IP address, domain name).
- Port: Communication endpoint (default for PostgreSQL is `5432`).
- Database Name (dbname): The specific database you want to connect to.
- Username: Your database username.
- Password: Password for the username.
For demos, a read-only account is used to prevent accidental data modification.
Key takeaway: These five pieces of information are the standard keys needed to unlock access to almost any database.
Using R with PostGIS
Key R Packages: `sf` (Simple Features for spatial data), `DBI` (DataBase Interface), `RPostgres` (PostgreSQL driver).
The following snippet shows how to connect, read data, and disconnect. Replace placeholder details with your actual database credentials.
# 1. Load necessary libraries
# install.packages(c("sf", "DBI", "RPostgres")) # Run once if not installed
library(sf)
library(DBI)
library(RPostgres)
# 2. Database connection parameters (REPLACE with actual details)
db_host <- "your_host_address"
db_port <- 5432
db_name <- "your_sample_db_name"
db_user <- "readonly_user"
db_password <- "readonly_password"
# 3. Establish connection
con <- dbConnect(RPostgres::Postgres(),
dbname = db_name,
host = db_host,
port = db_port,
user = db_user,
password = db_password)
# 4. Read a spatial table into an sf object
# Example: Read 5 features from 'health_clinics' table
sql_query_clinics <- "SELECT objectid, clinic_name, address, geom FROM health_clinics LIMIT 5;"
health_data_sf <- st_read(con, query = sql_query_clinics)
# 5. View the loaded data
print(health_data_sf)
# plot(st_geometry(health_data_sf)) # Simple plot
# 6. Close the connection
dbDisconnect(con)
Key takeaway: R, with packages like `sf` and `DBI`, can easily connect to PostGIS and bring spatial data directly into a familiar analysis environment.
Brief Note on SQL
The `query` parameter in `st_read()` (R) and `read_postgis()` (Python) uses SQL (Structured Query Language). PostGIS extends SQL with many spatial functions (e.g., `ST_Distance`, `ST_Intersects`, `ST_Contains`, `ST_Buffer`, `ST_DWithin`).
Performing spatial operations directly in the database using these SQL functions is often very efficient, as it processes data on the server before transferring it to R/Python.
Key takeaway: Learning basic SQL, augmented with PostGIS's spatial functions, unlocks the ability to perform powerful and efficient data selection and manipulation directly within the database.
Same skills, no server — DuckDB from Python DuckDB Python
The identical analysis, running in-process over your local tables or files — no database server required.
import duckdb, geopandas as gpd
con = duckdb.connect("riverbend.duckdb")
con.sql("LOAD spatial;")
# The SAME spatial SQL you'd run in PostGIS
df = con.sql("""
SELECT d.name, sum(b.pop) AS people
FROM districts d
JOIN blocks b ON ST_Contains(d.geom, b.geom)
GROUP BY d.name ORDER BY people DESC
""").df()
# Read big open data with zero import:
con.sql("SELECT count(*) FROM 'overture_places.parquet'")
Notice: the ST_Contains spatial join is byte-for-byte the same as the PostGIS version. That's the payoff of learning spatial SQL once.
- Connect from Python (PostGIS engine or DuckDB connection).
- Run the population-per-district spatial join.
- Print the busiest district. Bonus:
gdf.explore()to see it on a map.
# Works against either engine — just change how you read:
sql = """SELECT d.name, sum(b.pop) AS people
FROM districts d
JOIN blocks b ON ST_Contains(d.geom, b.geom)
GROUP BY d.name ORDER BY people DESC"""
# PostGIS: df = gpd.read_postgis(sql, engine)
# DuckDB : df = con.sql(sql).df()
print(df.head(1)) # -> the most-populated district
Checkpoint: you've now connected, run a spatial join, and pulled results into a dataframe — the core loop of every analysis to come. (For our Docker server: host localhost, port 5432, db riverbend, user postgres, password almanac.)
The Modern Geospatial Stack (2026) New
PostGIS remains a cornerstone of spatial data work, but the surrounding ecosystem has evolved quickly. This section is a map of the landscape so you know what tools exist, what they're good at, and how they fit alongside PostGIS in a public-health workflow.
PostGIS is no longer the only place to run spatial SQL
A major shift in recent years is that you can now run powerful spatial SQL without a database server at all. The biggest example is DuckDB with its spatial extension — an in-process analytical engine (think "SQLite for analytics") that speaks much of the same ST_* spatial SQL you'll learn for PostGIS.
DuckDB + spatial — the lightweight analytics powerhouse
- No server to install or manage — runs inside R, Python, or the command line.
- Reads and writes GeoParquet, GeoJSON, Shapefiles, and can query files directly on your disk or in cloud storage (S3/GCS/Azure) without loading them first.
- Extremely fast for analytical scans and joins over tens-to-hundreds of millions of features on a laptop.
- When to reach for it: ad-hoc analysis, one-off transformations, reading big open-data files, reproducible notebook pipelines.
- When PostGIS is still better: a shared, central, multi-user database; enforced data integrity; transactional editing; a long-lived authoritative dataset many people query.
You can even use them together: pg_duckdb and DuckDB's `postgres` extension let DuckDB query a live PostGIS database directly.
Cloud-native geospatial formats
The "shapefile era" is fading. Modern formats are designed for the cloud, for big data, and for streaming only the bytes you need.
📦 GeoParquet
Columnar, compressed vector format. Fast, tiny, and the emerging standard for sharing large vector datasets. Read/written by GeoPandas, DuckDB, QGIS, and more. A modern replacement for shapefiles/CSV exports.
🗺️ PMTiles & vector tiles
A single-file format for serving interactive maps cheaply (even from static hosting). PostGIS can generate vector tiles directly with ST_AsMVT.
🛰️ COG & STAC (raster)
Cloud-Optimized GeoTIFFs let you stream just the pixels you need from satellite/environmental imagery; STAC is the catalog standard for finding it — relevant for environmental-exposure work.
🌐 Open base data: Overture Maps
A large, openly-licensed dataset (places, buildings, transportation, administrative areas) distributed as GeoParquet — useful free context for health analyses.
Spatial indexing beyond R-Trees: H3
Discrete global grid systems like H3 (hexagonal cells) have become popular for aggregating and joining point data at scale. The h3-pg extension brings H3 functions into PostGIS, which is handy for privacy-preserving aggregation of patient locations into uniform cells.
Choosing a tool: a quick guide
| Your situation | Good default tool | Why |
|---|---|---|
| Shared, authoritative dataset, many analysts, ongoing edits | PostGIS | Concurrency, integrity, security, central source of truth. |
| Ad-hoc analysis of large files on your laptop / in a notebook | DuckDB + spatial | Zero setup, fast, reads GeoParquet & cloud files directly. |
| Interactive analysis in a familiar dataframe workflow | GeoPandas (Py) / sf (R) | Ergonomic, great for modeling, plotting, and small-to-medium data. |
| Sharing a dataset with collaborators or for archiving | GeoParquet | Compact, typed, portable; far better than shapefiles/CSV. |
| Publishing an interactive web map | PMTiles / ST_AsMVT | Cheap, scalable, no heavy map server required. |
Key takeaway: Think in terms of a stack, not a single tool. PostGIS is your durable, shared spatial database; DuckDB and GeoParquet make ad-hoc, large-scale, and cloud analysis fast and cheap; GeoPandas/sf remain your interactive analysis home. The spatial SQL you learn for PostGIS transfers across much of this ecosystem.
Case Study — Clinic Accessibility Hands-on
One question, carried end-to-end on your own Riverbend database. We'll answer it in five steps, each a short lab. By the end you'll have a reusable accessibility pipeline — every step runnable in PostGIS or DuckDB.
Question: "How many people live within 5 km of a clinic, and which districts are underserved relative to population?"
Step 1 — inventory & sanity-check SQL
Before any analysis: confirm what you have. Counts, coordinate systems, and validity.
-- How much data, and what SRID?
SELECT 'clinics' t, count(*), ST_SRID(geom) FROM clinics GROUP BY 3
UNION ALL
SELECT 'blocks', count(*), ST_SRID(geom) FROM blocks GROUP BY 3
UNION ALL
SELECT 'districts', count(*), ST_SRID(geom) FROM districts GROUP BY 3;
-- Any invalid polygons? (real data often has them)
SELECT district_id
FROM districts
WHERE NOT ST_IsValid(geom);
What we're checking:
- All three layers share SRID 4326 → they'll align without transforming
- Row counts match what we loaded (~3 / ~323 / 6)
- No invalid geometries — if any, fix with
ST_MakeValid(geom)
Step 2 — population within 5 km of a clinic SQL
The core accessibility measure: which population points are "served"?
-- Mark each block as served / unserved, using true meters via geography
SELECT
sum(pop) AS total_pop,
sum(pop) FILTER (WHERE served) AS served_pop,
round(100.0 * sum(pop) FILTER (WHERE served) / sum(pop), 1) AS pct_served
FROM (
SELECT b.pop,
EXISTS (
SELECT 1 FROM clinics c
WHERE ST_DWithin(b.geom::geography, c.geom::geography, 5000)
) AS served
FROM blocks b
) s;
Why ST_DWithin + ::geography: it uses the spatial index and measures real meters — both traps from Part IV, avoided in one line.
Step 3 — service areas as buffers SQL
Sometimes you want the zone itself — to map it, or intersect it with other layers.
-- A 5 km service area around each clinic
SELECT clinic_id,
ST_Buffer(geom::geography, 5000)::geometry AS svc
FROM clinics;
-- Union them into one combined coverage polygon
SELECT ST_Union(
ST_Buffer(geom::geography, 5000)::geometry
) AS coverage
FROM clinics;
ST_Bufferongeographygrows a true-meter zoneST_Uniondissolves overlapping buffers into one coverage shape- Export
coverageas GeoJSON to drop onto a web map, or intersect it with districts (next step)
Buffers are approximations of circles (segmented polygons) and ignore roads/terrain. For travel-time access you'd use a routing service — note the assumption.
Step 4 — access rate by district SQL
Now the decision-relevant output: a rate per district, so we compare fairly across populations.
SELECT
d.name,
d.population,
sum(b.pop) AS pop_in_blocks,
sum(b.pop) FILTER (WHERE served) AS served_pop,
round(100.0 * sum(b.pop) FILTER (WHERE served)
/ nullif(sum(b.pop),0), 1) AS pct_served
FROM districts d
JOIN blocks b ON ST_Contains(d.geom, b.geom) -- points-in-polygons
CROSS JOIN LATERAL (
SELECT EXISTS (SELECT 1 FROM clinics c
WHERE ST_DWithin(b.geom::geography, c.geom::geography, 5000)) AS served
) x
GROUP BY d.name, d.population
ORDER BY pct_served ASC; -- worst-served first
Step 5 — validate before you trust it SQL
A number is not an answer until you've checked it. Build the habit now.
Cross-checks:
- Totals reconcile — does
sum(served) ≤ total_pop? Does district pop roughly matchpopulation? - Spot a known case — pick one block next to a clinic; is it flagged served?
- Boundary effects — blocks near the county edge may have clinics just outside your data → undercount
- Units — distances in meters?
geographyconfirms it - Sensitivity — does the ranking hold at 3 km vs 5 km?
-- One block you can reason about by hand
SELECT b.block_id,
ST_Distance(b.geom::geography,
c.geom::geography)::int AS m
FROM blocks b
JOIN clinics c ON c.clinic_id = 1
ORDER BY m
LIMIT 3; -- nearest blocks to clinic 1
Validation is the analyst's job — not the database's, and certainly not the AI's.
Extending the pattern — more public-health questions
The same building blocks answer many questions. Notice the recurring verbs: contain, within-distance, buffer, intersect, aggregate.
🦠 Disease surveillance & clusters
Cases (points) in districts (polygons), normalized by population.
SELECT d.name, count(c.*) AS cases
FROM districts d
JOIN cases c ON ST_Contains(d.geom,c.geom)
GROUP BY d.name; -- ÷ population → rate
🏭 Environmental exposure
Who lives inside a pollutant buffer?
SELECT sum(b.pop) AS exposed
FROM blocks b, sources s
WHERE ST_DWithin(
b.geom::geography,
s.geom::geography, 2000); -- 2 km
Key takeaway: PostGIS answers sophisticated spatial questions efficiently — and once you know the handful of core predicates, new questions are recombinations of the same moves.
Learning GIS with Generative AI 🤖 New
Generative AI assistants (such as Claude, ChatGPT, Gemini, and coding tools like GitHub Copilot, Cursor, and Claude Code) have become some of the most effective ways to learn spatial concepts and to accelerate real GIS analysis. They are especially valuable for public-health researchers who know their domain deeply but are newer to databases and code. This section shows you how to use them well — and, just as importantly, how to use them safely.
Mindset: Treat the AI as a tireless, fast, occasionally-overconfident teaching assistant and pair-programmer — not as an oracle. It is excellent at drafting, explaining, and translating. It is you who remains the analyst responsible for correctness, ethics, and interpretation.
Four ways AI helps with GIS
Click each card to expand. These map to the whole learning-to-doing journey.
1. Tutor & explainer 📖
Ask it to explain concepts at your level, with public-health examples and analogies. Great for SRIDs, geometry vs. geography, spatial joins, projections, and "why is my distance in degrees?".
Try: "Explain the difference between geometry and geography in PostGIS to someone who knows statistics but not GIS, using a clinic-access example."
2. Code generator & translator 🔁
Turn a plain-English question into spatial SQL, or translate the same analysis between SQL, R (sf), and Python (geopandas). Excellent for getting unstuck on syntax.
Try: "Write a PostGIS query that counts disease cases per district, then show the equivalent in R sf and Python geopandas."
3. Debugger & reviewer 🐞
Paste an error message and your query; ask what's wrong and how to fix it. Ask it to review a query for correctness and performance (e.g., "is this using the spatial index?").
Try: "This query returns 0 rows but I expected matches. Here's the SQL and the SRIDs of each table — what's likely wrong?"
4. Analysis copilot 🚀
Plan a whole analysis with it: what data you need, which functions to use, how to validate results. Coding-agent tools (Claude Code, Cursor) can even run code, inspect data, and iterate with you.
Try: "I want to assess healthcare accessibility for a county. Help me plan the data, the PostGIS steps, and how I'd sanity-check the output."
How to write a good GIS prompt
The quality of the answer depends heavily on the context you give. A reliable recipe:
- Role & level: "You are a PostGIS expert helping a public-health researcher new to SQL."
- The environment: tool and version ("PostGIS 3.6, PostgreSQL 18"), language (SQL / R
sf/ Pythongeopandas). - The data: table names, key columns, geometry column name, geometry type, and SRID of each layer.
- The goal: the precise question and the units you want (meters? rate per 1,000?).
- The ask: "explain each step", "use a spatial index", "keep it to one query".
Copy-ready prompt templates
Fill in the [brackets] and paste into your AI assistant of choice.
Using your own Riverbend schema, ask an AI assistant to write the 5 km access rate by district query (our Step 4). Then verify it against the answer you already have.
Then check the AI's output against reality:
- Does it use
::geographyorST_Transform(meters, not degrees)? - Does it use
ST_DWithin(index) — not bareST_Distance? - Run it. Does the ranking match your Step 4 result?
- Ask a follow-up: "add
EXPLAIN ANALYZEand confirm the index is used."
If its numbers differ from yours, you find out why. That gap-hunting is the skill.
A safe AI-assisted workflow
Use AI as a loop, not a one-shot. A dependable pattern:
LIMIT, a single district, or a known case where you can predict the answer.Trust, but verify: common AI pitfalls in GIS
⚠️ AI can be confidently wrong. Watch for these GIS-specific failure modes:
- Invented functions/arguments: AI sometimes "hallucinates" a plausible-sounding
ST_*function or option that doesn't exist. Confirm names in the PostGIS reference. - SRID / units mistakes: measuring distance on SRID 4326 gives degrees, not meters. AI may forget to
ST_Transformto a projected CRS or to use thegeographytype. Always check the units of your answer. - Invalid geometries: real-world data has self-intersections and nulls. Suggested queries may fail or mislead until you run
ST_IsValid/ST_MakeValid. - Performance blind spots: a "correct" query can be unusably slow without a spatial index or with the wrong join. Ask it to
EXPLAIN ANALYZEand to confirm index use. - Outdated patterns: training data may reference old versions (e.g.,
psycopg2instead ofpsycopgv3). State your versions explicitly. - Statistical / epidemiological subtlety: spatial clustering, the modifiable areal unit problem (MAUP), edge effects, and ecological fallacy are easy for AI to gloss over. Apply your domain expertise.
🔒 Data privacy & ethics (read this twice)
Never paste protected health information (PHI/PII) or precise patient locations into a consumer AI chat. Patient addresses and coordinates are among the most sensitive and re-identifiable data that exist. Pasting them into a public tool may violate HIPAA, GDPR, IRB protocols, and your institution's data-use agreements.
- Share schema, not data. Give the AI your table and column structure, types, and SRIDs — never the actual rows.
- Use synthetic or fake examples when you need sample values ("a clinic at lng/lat 0,0").
- De-identify and aggregate before any sharing — e.g., aggregate points to census tracts or H3 cells; apply your jurisdiction's small-cell suppression rules.
- Prefer approved/enterprise AI tools with a zero-data-retention or no-training agreement, or self-hosted/local models, when working near sensitive data — and follow your institution's policy first.
- Keep a human in the loop for decisions. AI output informs analysis; it does not make public-health decisions.
Good practice: describe your problem in terms of structure and goals. For example — "I have a points table of de-identified case locations (SRID 4326) and a polygons table of census tracts; help me aggregate cases per tract" — gives the AI everything it needs without exposing a single real record.
The AI tooling landscape for GIS
💬 Chat assistants
Claude, ChatGPT, Gemini. Best for explaining concepts, drafting queries, translating between languages, and debugging from pasted errors.
⌨️ In-editor assistants
GitHub Copilot, Cursor, and notebook AI features autocomplete and refactor code where you work — handy in RStudio, VS Code, and Jupyter.
🛠️ Coding agents
Tools like Claude Code can read your files, run queries, inspect results, and iterate on a whole analysis with you — closer to a true copilot than a chatbot.
🔌 MCP & text-to-SQL
The Model Context Protocol (MCP) lets AI assistants connect directly to a (read-only!) PostGIS database to inspect the schema and answer questions in natural language. Powerful — but scope permissions tightly, especially with health data.
Emerging generative-AI applications in the field New
Beyond chat assistants, generative AI is showing up inside the geospatial workflow. These are real, active areas in 2026 — useful to know about even if you don't use all of them yet. Click a card to expand.
🛰️ Geospatial foundation models for imagery
Models like Prithvi (NASA–IBM, trained on Landsat/Sentinel-2 and now even running in orbit) and Clay turn satellite/aerial imagery into ready-to-use features for flood mapping, burn-scar detection, land-use and cloud-gap reconstruction — with little to no manual feature engineering.
Public-health relevance: environmental-exposure assessment, disaster response, monitoring flooding/heat/air-quality drivers of disease.
✂️ Promptable feature extraction (SAM-for-geo)
Segment-Anything-style models (e.g., segment-geospatial/samgeo) let you extract buildings, water bodies, fields, or roads from imagery with a click or a text prompt — no hand-digitizing.
Public-health relevance: rapidly mapping settlements/buildings for population denominators, sanitation, or access studies in data-poor regions.
💬 "Chat with your map / database"
QGIS LLM plugins (GeoAgent, Promptly, IntelliGeo, Whitebox AI Agent) and MCP/text-to-SQL tools let you describe an operation in plain language; the AI inspects your layers/CRS/schema and generates the geoprocessing or SQL to run it.
Public-health relevance: lowers the barrier for epidemiologists to run spatial analyses without deep GIS/SQL skills (still verify the output!).
🧪 Synthetic & privacy-preserving data
Generative models can produce synthetic populations or case locations that preserve realistic spatial distributions without exposing real individuals — for sharing, teaching, or method development.
Public-health relevance: collaborate and publish reproducible spatial methods while protecting PHI; build demos without real patient data.
📍 Geocoding, parsing & record linkage
LLMs are strong at cleaning messy addresses, standardizing place names, parsing free-text locations from clinical notes, and matching records across datasets that lack a shared key.
Public-health relevance: turning inconsistent surveillance/EHR address fields into mappable, linkable data (do this on de-identified data in approved environments).
🔎 Semantic + spatial search (embeddings)
With pgvector inside PostgreSQL/PostGIS you can combine vector embeddings (semantic similarity) with spatial filters — e.g., "facilities similar to X and within 10 km" — in a single database.
Public-health relevance: retrieval-augmented analysis over reports/facilities, smarter matching of services to needs.
🤝 Agentic analysis copilots
Coding agents (e.g., Claude Code, Cursor) now read your files, write and run spatial code, inspect results, and iterate end-to-end — turning a research question into a reproducible pipeline with you supervising.
Public-health relevance: faster, more reproducible analyses; great for prototyping a study workflow before scaling it.
🗺️ AI cartography & narrative reporting
Generative tools assist with map styling, choosing classifications/color schemes, and drafting plain-language summaries of spatial results for briefs and dashboards.
Public-health relevance: communicating findings to decision-makers and the public quickly — with a human checking accuracy and avoiding misleading maps.
⚠️ Same rules apply: these applications are powerful but not infallible. Foundation-model outputs need ground-truthing, LLM-generated code and SQL need verification, and none of them excuse putting real PHI into unapproved tools. Treat AI output as a draft to validate, not a result to publish.
Key takeaway: Generative AI dramatically lowers the barrier to learning and doing GIS — as a tutor, code translator, debugger, and analysis copilot. Get the most from it by giving rich context (versions, schema, SRIDs), asking for explanations, and verifying every result. And never trade your participants' privacy for convenience: share structure, not sensitive data.
Pop Quiz: Test Your Knowledge!
Let's see what you've learned. For each question, consider the options and then click "Show Answer" to see the correct choice and explanation.
Question 1: What is a key advantage of using a spatial database like PostGIS over traditional shapefiles for collaborative public health research?
Question 2: Which PostGIS geometry type would be most suitable for representing the boundaries of administrative districts (e.g., counties)?
Question 3: What is the primary purpose of an SRID (Spatial Reference Identifier) in PostGIS?
Question 4: When reading spatial data from PostGIS into R using `sf::st_read()` or Python using `geopandas.read_postgis()`, what is a benefit of using a SQL query within these functions?
Question 5: To find all health clinics within a 2km radius of a patient's address in PostGIS, which spatial function is most directly applicable and efficient for this proximity query?
Question 6: You need to do fast, one-off analysis of a 50-million-row GeoParquet file on your laptop, with no database server to set up. Which modern tool is the best fit?
Question 7: You're using a generative AI assistant to help with a spatial analysis of de-identified patient case locations. Which practice is most appropriate?
Summary & Resources
This final section summarizes the key benefits of PostGIS for public health research and provides a list of resources for continued learning.
Quick Summary & Benefits Recap
- Overcomes Flat File Limitations: Structured, secure, scalable data storage; better data integrity, attribute handling, concurrent access.
- Enables Complex Spatial Analysis: Rich library of spatial SQL functions for proximity, containment, overlay, buffering directly in the database.
- Promotes Data Integrity and Scalability: Enforces data types, relationships; efficiently manages very large datasets.
- Facilitates Collaboration: Multi-user environment for centralized, consistent data access.
- Integrates with Familiar Analysis Environments: Easy access from R (sf, DBI) and Python (geopandas, SQLAlchemy, psycopg v3).
- Fits a Modern Stack: Complements cloud-native formats (GeoParquet), serverless engines (DuckDB + spatial), and open base data (Overture Maps) — the spatial SQL you learn transfers across them.
- Pairs Well with Generative AI: AI assistants accelerate learning and analysis — when you supply rich context (versions, schema, SRIDs), verify results, and protect sensitive data.
Q&A and Further Resources
To continue your learning journey:
Official Websites:
- PostGIS: postgis.net
- PostgreSQL: postgresql.org
Key R Packages Documentation:
- `sf` (Simple Features): r-spatial.github.io/sf/
- `DBI` (DataBase Interface): dbi.r-dbi.org
- `RPostgres`: rpostgres.r-dbi.org
Key Python Packages Documentation:
- `geopandas`: geopandas.org
- `SQLAlchemy`: sqlalchemy.org
- `psycopg2` (PostgreSQL adapter): psycopg.org/docs/ (psycopg2/3)
Modern Geospatial Stack: New
- DuckDB Spatial: duckdb.org → spatial
- GeoParquet: geoparquet.org
- Overture Maps (open base data): overturemaps.org
- Cloud-Native Geospatial Forum: cloudnativegeo.org
- Spatial SQL reference (PostGIS functions): postgis.net/docs/reference.html
Learning & Doing GIS with AI: New
- Use the copy-ready prompt templates in the "Learning GIS with AI" section as starting points.
- Always state your tool + version (e.g., "PostGIS 3.6, PostgreSQL 18") and your schema + SRIDs in prompts.
- Verify AI-suggested function names against the official PostGIS reference, and check the units of any distance/area result.
- Privacy: never paste PHI/PII or precise patient coordinates into AI tools — share structure and synthetic examples, follow your institution's data policy, and prefer approved/enterprise or local models near sensitive data.
Emerging GenAI-in-geospatial to explore:
- Prithvi geospatial foundation model: ibm-nasa-geospatial (Hugging Face)
- Clay foundation model: madewithclay.org
- segment-geospatial (SAM for imagery): samgeo.gishub.org
- QGIS LLM plugins (GeoAgent, Promptly, IntelliGeo): QGIS natural-language plugins
- pgvector (embeddings in PostgreSQL): github.com/pgvector/pgvector
General Tutorial Search Terms:
- "PostGIS Introduction tutorial"
- "GeoPandas PostGIS tutorial"
- "R sf PostGIS tutorial"
- "DuckDB spatial GeoParquet tutorial"
- "spatial SQL ST_DWithin example"
Hands-on practice with the provided sample database details and code snippets is highly encouraged — and try working through one of the use cases with an AI assistant as your tutor!