BigQuery’s GIS functions are surprisingly performant because they leverage a spatial index built on the Well-Known Text (WKT) representation of your geographic data, allowing for efficient intersection checks without a full table scan.
Let’s see this in action. Imagine you have a table called my_dataset.earthquakes with columns latitude (FLOAT64), longitude (FLOAT64), and magnitude (FLOAT64). You want to find all earthquakes that occurred within a specific geographic area, say, a bounding box around California.
First, define your bounding box. For California, a rough bounding box could be:
- Min Latitude: 32.5
- Max Latitude: 42.0
- Min Longitude: -124.5
- Max Longitude: -114.0
Now, let’s construct a polygon representing this bounding box and query for earthquakes within it.
WITH california_bbox AS (
SELECT ST_GEOGPOINT(-124.5, 32.5) AS sw, ST_GEOGPOINT(-114.0, 42.0) AS ne
),
earthquake_locations AS (
SELECT
magnitude,
ST_GEOGPOINT(longitude, latitude) AS earthquake_point
FROM
`my_dataset.earthquakes`
)
SELECT
e.magnitude
FROM
earthquake_locations e,
california_bbox b
WHERE
ST_CONTAINS(ST_RECTFROMBOUNDINGBOX(b.sw, b.ne), e.earthquake_point)
This query first creates a Common Table Expression (CTE) california_bbox to define the southwest (sw) and northeast (ne) corners of our bounding box using ST_GEOGPOINT. Another CTE, earthquake_locations, converts your latitude and longitude into BigQuery’s GEOGRAPHY type using ST_GEOGPOINT. The main SELECT statement then uses ST_CONTAINS to check if each earthquake_point is contained within the rectangle defined by ST_RECTFROMBOUNDINGBOX.
The magic here is ST_RECTFROMBOUNDINGBOX. BigQuery internally converts your WKT geometries into a hierarchical grid system (similar to S2 cells). When you perform a spatial query like ST_CONTAINS, BigQuery doesn’t compare every point to every polygon. Instead, it uses the spatial index to quickly identify potential candidates. For a point-in-polygon query, it checks if the cell(s) containing the point overlap with the cell(s) containing the polygon. If they do, a more precise geometric check is performed. This avoids scanning the entire table for millions of points.
The core problem BigQuery’s GIS functions solve is performing complex spatial analysis on massive datasets efficiently. Traditionally, this required specialized databases or significant engineering effort. BigQuery brings this capability directly into a scalable data warehouse. You can perform operations like:
- Proximity Analysis: Finding points within a certain distance of another point or polygon (
ST_DWITHIN). - Spatial Joins: Joining tables based on spatial relationships (e.g., finding all stores within a specific city’s boundaries).
- Area Calculations: Determining the area of polygons (
ST_AREA). - Intersections and Unions: Combining or finding overlapping areas of geometries (
ST_INTERSECTION,ST_UNION).
The GEOGRAPHY data type is fundamental. It stores data in a spherical model (using WGS84 ellipsoid), which is crucial for accurate calculations on Earth’s surface. This is different from GEOMETRY, which assumes a flat plane and is suitable for planar maps. For anything involving latitude and longitude, GEOGRAPHY is your go-to.
When you use functions like ST_GEOGPOINT(longitude, latitude), BigQuery takes those two numbers and creates a point object in its internal spatial representation. This object is then indexed. For ST_CONTAINS(polygon, point), BigQuery first checks if the spatial index for the point overlaps with the spatial index for the polygon. If there’s a potential overlap, it then performs the exact geometric calculation. This two-stage process is what makes it fast.
A common pitfall is not using the GEOGRAPHY type. If your data is stored as simple latitude/longitude FLOAT64 columns, you’ll have to convert them on the fly for every query, which is inefficient. Always convert your raw coordinates to GEOGRAPHY once and store them, or at least ensure your queries consistently use ST_GEOGPOINT or similar functions to create GEOGRAPHY objects for spatial operations.
Another aspect to consider is the choice of spatial functions. ST_COVERS is similar to ST_CONTAINS but has a slightly different semantic meaning: ST_COVERS(a, b) is true if every point in b is also in a. ST_CONTAINS(a, b) is true if a contains b and the interiors of a and b intersect. For point-in-polygon, ST_CONTAINS is usually what you want. If you’re checking if a polygon is entirely within another, ST_CONTAINS is also appropriate. However, if you want to know if a polygon touches or encloses another, ST_COVERS might be more suitable depending on edge cases.
The most surprising thing is how BigQuery handles invalid geometries. Instead of failing outright, many GIS functions will attempt to repair or simplify invalid geometries or return NULL for the operation, allowing your query to complete. This is incredibly useful when dealing with user-generated or imperfect spatial data. For instance, if you have a polygon with self-intersections, ST_AREA might return NULL or a simplified area rather than halting the entire query.
The next step after mastering basic spatial queries is often performing spatial joins, where you combine data from two tables based on their geographic relationships.