Skip to main content

PostGIS

What is PostGIS?

PostGIS is a powerful spatial extension for the PostgreSQL database. It adds support for geographic objects, allowing you to store, query, and analyze spatial data like points, lines, and polygons directly within your database. Think of it as a tool that lets PostgreSQL understand geographical concepts, similar to how it handles standard data types like text or numbers.

Imagine you have a regular spreadsheet. You can store a list of addresses in it, but you can't ask it, "Which addresses are within 1 kilometer of this cafe?" PostGIS is like an upgrade for your database (PostgreSQL) that gives it a brain for geography.

It lets you store and work with spatial data, like points (a specific building), lines (a road), or polygons (a city boundary). This allows you to ask and answer complex geographical questions directly in your database.


Core Concepts

To use PostGIS effectively, we'll focus on three main concepts:

Points and Polygons (Data Types)

  • Point: A single location, defined by its longitude and latitude.

  • Polygon: A closed shape made up of points. The boundaries of our geo_areas (subzones or planning area) are polygons.

Coordinate System (SRID 4326)

  • A coordinate system is the method we use to precisely define a location on Earth. The most common system, used by GPS and Google Maps, is WGS 84(World Geodetic System 1984).

  • The SRID (Spatial Reference System Identifier) is a unique number that identifies a specific projection.

  • We will use SRID 4326. This is the most common system, used by GPS and Google Maps. It's based on longitude and latitude. It's crucial that all our spatial data uses this same SRID to ensure accurate comparisons.

https://www.cockroachlabs.com/docs/stable/srid-4326

Power-up (The Spatial Index)

  • A spatial index is a special kind of index that makes geographical queries incredibly fast.

  • Without it, if we want to find a job within a specific subzone, the database would have to check the boundaries of every single subzone or planning area like searching for a book by reading every single book in a library.

  • With a spatial index, the database can instantly narrow down the search to just a few likely subzones, making the query almost instantaneous. We will use a GiST(Generalized Search Tree) index for this.

https://postgis.net/workshops/postgis-intro/indexing.html


Key Functions We'll Use

While PostGIS has hundreds of functions, our migration will rely on just a few simple ones.

  • ST_GeomFromGeoJSON(geojson_string): This is the key function used in our data population Rake task. It takes a GeoJSON string (the boundary data from our file) and converts it into a PostGIS geometry object that can be stored in our boundary column.

  • ST_MakePoint(longitude, latitude): This function takes our regular longitude and latitude numbers and turns them into a PostGIS Point geometry.

  • ST_SetSRID(geometry, srid): This function is a tagger. It takes a geometry (like a point) and tags it with the correct SRID. We'll use this to tag our job location points with SRID 4326.

  • ST_Contains(geometry_A, geometry_B): This is the magic function we'll use. It asks the question, "Does geometry_A completely contain geometry_B?" For us, this will be: "Does the geo_areas polygon contain the job location point?" This is how we'll match jobs to their correct geographical area.


Implementing with Rails

Local Requirements

Before you can run the migration and populate the data, you need to ensure your local environment is set up correctly. This involves having a PostgreSQL database with the PostGIS extension installed and adding the necessary gems to your Rails application.

  • Database: You must have a PostgreSQL database running locally.

  • PostGIS Extension: The PostGIS extension needs to be installed on your local PostgreSQL server.

### macOS using brew
brew install postgresql postgis

### Linux (Debian/Ubuntu)
sudo apt-get install postgresql postgresql-contrib postgis

boundary column in geo_area

The datatype of the boundary column must be a PostGIS geometry type. Because our data may contain both single Polygon features and multi-part MultiPolygon features, we will use the more generic geometry type to ensure compatibility.

  • geometry: The base PostGIS data type for all spatial objects. It is flexible enough to store various shapes, including points, lines, polygons, and multi-part geometries.

  • 4326: This is the SRID(Spatial Reference System Identifier). It explicitly tells PostGIS that the coordinates are in the WGS 84 system, which uses latitude and longitude.

Implementation

We will implement this by creating a database migration to modify the schema and a separate Rake task or seeder for data population. In this case, we'll use a Rake task.

Migration

First, generate a Rails migration to add the boundary column to the geo_areas table and enable the PostGIS extension. This migration ensures the database schema is correctly set up.

class AddPostgisBoundaryToGeoAreas < ActiveRecord::Migration[7.0]
def change
# Enables the PostGIS extension
enable_extension 'postgis'

add_column :geo_areas, :boundary, :geometry, limit: { srid: 4326 }
add_index :geo_areas, :boundary, using: :gist # Creates a spatial index for fast lookups
end
end

After running rails db:migrate, our geo_areas table will have the new boundary column.

Data Population

We'll need to create a Rake task to read the GEOJSON file and fill the boundary column. The task will use the RGeo gem to parse the GEOJSON and the ST_GeomFromGeoJSON function in SQL to convert the data into a PostGIS-compatible format.

# lib/tasks/geo.rake
require 'json'

namespace :geo do
desc 'Populates geo_areas with boundary data from GEOJSON'
task populate_boundaries: :environment do
puts 'Loading GEOJSON data...'
geojson_file = File.read(Rails.root.join('storage', 'subzone_boundary.geojson'))
geojson_data = JSON.parse(geojson_file)
updated_count = 0

Geo::Area.transaction do
geojson_data['features'].each do |feature|
properties = feature['properties']
geometry_geojson_string = feature['geometry'].to_json

# Extract the SUBZONE_C code from the Description HTML string using a regex
subzone_code_match = properties['Description'].match(/<th>SUBZONE_C<\/th> <td>(.*?)<\/td>/)

if subzone_code_match && subzone_code_match[1]
subzone_code = subzone_code_match[1]
geo_area = Geo::Area.find_by(code: subzone_code)

if geo_area
# Use raw SQL to update the boundary to prevent ActiveRecord from
# treating the ST_GeomFromGeoJSON function as a literal string.
sql = Geo::Area.sanitize_sql_array([
"UPDATE #{Geo::Area.table_name} SET boundary = ST_GeomFromGeoJSON(?) WHERE id = ?",
geometry_geojson_string,
geo_area.id
])
Geo::Area.connection.execute(sql)
updated_count += 1
else
puts "Warning: No GeoArea found for code #{subzone_code}. Skipping."
end
else
puts 'Warning: Could not find SUBZONE_C in properties for a feature. Skipping.'
end
end
end
puts "Done. Total records updated: #{updated_count}"
end
end

We would then run this task with rails geo:populate_boundaries. This approach is fundamental to a clean, maintainable Rails application, as it avoids external gem dependencies for a task that can be handled with standard libraries and database functions.


Using gems vs Raw SQL

Using Gems(RGeo)

RGeo is the best gem for handling spatial data in Ruby and excellent for daily application tasks and single-record lookups because they prioritize developer experience and safety

Pros

  • Readability: The code is clean, concise, and easy to read. It fits seamlessly into a Rails application's conventions.

  • Safety: The gem handles SQL injection protection automatically. It properly quotes and escapes values, preventing security vulnerabilities.

  • Abstraction: It abstracts away complex PostGIS function names and syntax, reducing the chance of developer error.

  • Object-Oriented: work with familiar Ruby objects instead of raw data hashes. (e.g., RGeo::Point)

Cons

  • Inefficiency for Bulk Operations: A major drawback is the N+1 query pattern that results from looping through records to perform a lookup for each one. This is extremely slow for large datasets due to multiple database round trips.

  • Limited Optimization: The high-level API prevents you from easily implementing advanced database optimizations like the VALUES clause.

Using Raw SQL

Raw SQL is the optimal choice when absolute performance is the primary concern, such as in data migrations or background jobs.

Pros

  • Peak Performance: Using the VALUES clause, you can send a massive batch of data to the database in a single query. The database processes this efficiently, eliminating network latency and overhead.

  • Full Control: You have complete control over the query. You can fine-tune it for maximum efficiency by leveraging specific PostGIS functions and query structures.

  • No Abstraction: There are no gem layers to add overhead or obscure the underlying database operations.

Cons

  • Security Risks: You are fully responsible for preventing SQL injection. You must meticulously sanitize all input using methods like ActiveRecord::Base.sanitize_sql_array.

  • Complexity: Raw SQL strings can be long, complex, and difficult to read or debug. This can make the code harder to maintain.

  • Lack of Portability: The code is specific to your database (PostgreSQL with PostGIS) and may not work with others without significant modification.

Conclusion

  • For our specific task of migrating Gig::TempJobs to the Jobs table, using a raw SQL query is the optimal solution. It's the fastest way to move large amounts of data at once because it lets the database do all the heavy lifting in a single, efficient action. This approach avoids the slow N+1 query problem that gems can sometimes create with bulk operations. We can use hybrid of using raw sql and the RGeo library, but since we are not performing any complex queries involving spatial data, then we can use raw sql for now.

  • For future functionality involving spatial data, we will use the RGeo gem. RGeo is a better fit for complex queries and single-record lookups because it provides a safer, more readable, and more maintainable way to work with geographic data directly in our code.

To summarize:

  • Use raw SQL for bulk data migration or processing a large dataset to achieve peak performance
  • Use RGeo for all future application functionality that will involve spatial data queries, as it prioritizes safety and developer experience