Introduction
Geospatial analytics involves analysing data with a focus on location awareness, managing relationships between different locations, and measuring quantities at various sites. Location data poses additional challenges - the Earth is not flat, not even a sphere or a perfect ellipsoid. Point locations are usually grouped into regions, which too tend to be irregularly shaped. Geographic reachability is also important, as not all points are equally accessible.
Spotfire offers rich, multi-layered map charts to represent geospatial data. These charts can include points, lines, polygons, and TMS and WMS layers that display geology, live weather, custom images, terrain, and other relevant information.
Spotfire data functions are scripts that can interact with these maps to dynamically generate or update geospatial calculations.
In this article, we will describe how the Spotfire DSML Python Toolkit can be used to create and integrate new data functions, enhancing Spotfire maps with advanced geo-analytics. We will reference the toolkit as spotfire-dsml, which is the name of the Python module. This article specifically covers the geo-analytics sub-module of spotfire-dsml. The current content refers to version 1.1.2.
Many of the examples shown here are drawn from the Spotfire demo that accompanies the geo-analytics sub-module. You can download all demos from here.
Before you begin
Coordinate reference systems
A coordinate reference system (CRS) is a framework to locate points on Earth. While latitude and longitude are the most commonly known, there are hundreds of coordinate systems designed to represent global or local regions. Each CRS has specific applicability, advantages, and limitations.
Datasets can be expressed in various CRS, and certain systems are more suitable for specific calculations. Therefore, it is essential to have the capability to transform coordinates between different CRS to ensure accuracy and compatibility in geospatial analyses. This capability enables analysts to select the most suitable CRS for their specific requirements and to combine data from various sources, thereby improving the overall accuracy and dependability of their analyses.
Specifying a coordinate reference system
There are different ways to specify a CRS.
-
One of the most common and effective ways is through an EPSG code. EPSG codes are unique numeric identifiers assigned by the European Petroleum Survey Group. Examples:
-
4326 or “EPSG:4326” for WGS84; 3857 or “EPSG:3857” for Web Mercator.
-
-
Alternatively, a CRS can be specified using a PROJ.4 string, which is a text-based representation of projection parameters. PROJ.4 is a library and standard for performing conversions between different CRS. An example of a PROJ.4 string for the Web Mercator projection is:
-
“+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs”
-
Types of coordinate systems
Geographic (GCS) and projected (PCS) coordinate systems are two fundamental ways to represent locations on the Earth's surface. They serve different purposes and are used in various applications depending on the needs of the task at hand.
-
WGS84, the World Geodetic System of latitude and longitude, used by the GPS satellite navigation system, is geographic. Data is represented on an ellipsoid, and its coordinates are angles, expressed in decimal degrees.
-
A popular projection is the Web Mercator (or WGS84 / Pseudo-Mercator), which is widely used across web applications. This is also the projection used within Spotfire maps. Data is represented on a flat two-dimensional surface (plane) and its coordinates are Cartesian coordinates, expressed in metres.
Strictly speaking, reprojecting a CRS involves transforming coordinates from one reference system to another. The term "projected", however, specifically refers to a CRS that has been defined by projecting angular coordinates onto a plane.
Projected CRS are useful because they make display and calculations easier, with no need for complex trigonometry. However, these projections come with limitations, as they introduce distortions and errors due to the process of flattening a curved surface.
Projected CRS can be tailored to different analyses. For instance, Equal Area projections preserve area, making them ideal for spatial distribution analyses; Equidistant projections maintain distances, useful for transportation and communication studies; and Equidirectional (or Conformal) projections preserve angles, critical for navigation and meteorology.
Locally projected CRS can be better than their global counterpart at preserving properties, as Earth’s curvature plays an increasingly diminishing role the smaller the region we are dealing with. One such notable class of CRS is the Universal Transverse Mercator (UTM). This set of projections divides the Earth into strips of 6 degrees of longitude in width. It can provide very accurate local distance calculations.
Additional reading
- Coordinate Reference System
- Cartesian coordinates
- EPSG Codes and PROJ strings
- Universal Transverse Mercator (UTM)
Geographic data representations
Maps in Spotfire can handle two types of data (as opposed to reference) layers.
-
A marker layer is a point representation of the data locations. A point is represented by two numeric coordinate columns, for instance its latitude and longitude.
-
A feature layer is used to represent shapes. These shapes are held by Spotfire in a special column, the geometry column, which is internally stored in binary (Well Known Binary, or WKB) form. This is known as a vector representation. Although points too can be represented as vectors, we mostly use feature layers for lines or polygons. We will refer to this type of data generically as “shapes”.
Example described in this article
The Spotfire example for geospatial analytics is based on data for crash events in New York City (NYC). Each row in the Motor Vehicle Collisions crash table represents a crash event. In addition to the motor collisions, we loaded a number of datasets that we used to support our analysis.
- Airports in NYC
- Emergency departments providing mandatory care in NYC
- NYC Boroughs.
For the purpose of demonstrating the capabilities of spotfire-dsml, we made some initial alterations to the data - we changed the original coordinate systems of NYC Boroughs and Airports, and we kept only motor crashes with defined locations from the year 2024.
Additional reading
- WKT (Well Known Text) and WKB (Well Known Binary) formats
- Shapefiles
- Motor vehicle collisions - crashes
Use cases
Transform coordinates
In our example, we are starting with two datasets that are expressed in a different CRS than the main accidents dataset.
-
NYC Boroughs is a vector dataset of polygon shapes, expressed in the New York Long Island projected CRS, with code EPSG:2263.
-
NYC Airports is a dataset with point coordinates, expressed in the Web Mercator projected CRS, with code EPSG:3857.
To analyse all the datasets together, it is essential to reproject both the NYC Boroughs and NYC Airports datasets to the standard geographic coordinate system (WGS84, identified by EPSG code 4326). By using a common CRS, we ensure that all datasets are compatible, enabling accurate and meaningful spatial analysis.
Reproject a vector (shape) dataset
For shapes, the location information is held in the geometry column. We need to specify the CRS we are starting from (source_crs) and the CRS we want to reproject to (target_CRS).
The default name for a geometry column is “geometry”. Otherwise, we need to pass the name of the column in an optional parameter (geometry_column_name).
If we want to carry forward additional columns, so that they are incorporated in the resulting table, we add them to a separate input parameter (extra_columns) which is a data table. The result is a new data table (projected_df) containing the location information in the new CRS, together with any additional columns carried forward.
# Change CRS of a shape dataset from spotfire_dsml.geo_analytics import crs_utilities projected_df = crs_utilities.change_crs_geometry(geometry, target_crs, source_crs, geo_column_name = geometry_column_name, extra_columns = extra_columns, to_spotfire = True)
Original (EPSG:2263) and transformed (EPSG:4326) NYC Boroughs.
Advanced: import shapes from (export shapes to) Spotfire
In the NYC Accidents example Spotfire dashboard, all the I/O interactions between the data functions and Spotfire is already taken care of.
However, for Python developers, it is important to provide a brief description of how inputting and outputting shape data works in Spotfire.
Spotfire stores shapes in a data table that has a few notable characteristics.
-
It contains a single geometry column, stored as a binary data type. In order for it to be rendered as a feature layer, it needs to be recognized as a Location and geo-coded as a geometry.
-
In most cases, you will also see calculated columns for the shape bounds (XMin, XMax, YMin, YMax) and centroid (XCenter, YCenter). These columns are also of Location type.
-
In addition, any number of standard columns can be added to the data table.
When passing a geometry column into a Spotfire data function, it must first be turned into a representation that can be used by geospatial modules in Python (or R, for that matter). Conversely, when exporting a shape back into Spotfire, the geometry must be turned into a binary column. Furthermore, the correct Spotfire metadata must be specified for the data table to be geo-coded and correctly positioned.
All of the above is taken care of by spotfire-dsml. Inputting shapes is handled automatically, whereas outputting shapes is implemented in some functions via an optional Boolean parameter (to_spotfire) or by directly calling prepare_geo_dataframe_for_spotfire.
For instance, in the function change_crs_geometry, the input parameter to_spotfire is True by default. If we wanted to continue using the projected_df data table within the data function, we would set it to False, then invoke prepare_geo_dataframe_for_spotfire at the end of the script, when we are ready to output to Spotfire.
Advanced: centroid calculation
The centroid of a shape is a point representing its centre of gravity. There are multiple ways in which a centroid can be calculated in spotfire-dsml.
-
When centroid_style = ‘representative’, we will return a representative point that is guaranteed to be inside the shape. This is the quickest calculation and works well in most cases.
-
Otherwise, the full centroid will be calculated. If centroid_style = ‘original’, the centroid is calculated on the original data, regardless of the CRS. There is a potential for introducing distortions if the CRS is geographic, but in practice these tend to be small. Setting centroid_style = ‘projected’ would temporarily re-project the data to EPSG:6933 (Lambert Cylindrical Equal Area) to calculate the centroid.
Reproject a dataset of point-coordinates
For marker layers, the location information is contained in two separate columns. The corresponding input parameters for CRS transformations are called first_coordinate and second_coordinate. The reason for these column names is that their nature depends on the original coordinate system. In a GCS, the first coordinate represents latitude, indicating the north-south position, while the second coordinate represents longitude, indicating the east-west position. Conversely, in a PCS, the first coordinate denotes the x-coordinate (Easting), and the second coordinate denotes the y-coordinate (Northing). The order of these coordinates is important for accurate geospatial data representation and analysis.
As before, if we want to carry forward additional columns, so that they are incorporated in the resulting table, we add them to a separate input parameter (extra_columns) which is a data table.
The result is a new data table (projected_coords) containing the coordinates in the new CRS, together with any additional columns carried forward.
# Change CRS of a set of point coordinates from spotfire_dsml.geo_analytics import crs_utilities projected_coords = crs_utilities.change_crs_coordinates(source_crs, target_crs, first_coordinate, second_coordinate, extra_columns = extra_columns)
Display the CRS
A coordinate reference system is more than just a set of coordinates; it is a mathematical model that relates data to the Earth's surface. The definition of a CRS encompasses various crucial details beyond simple coordinates. Many of these details can be accessed by using functions such as serialize_crs, which displays the relevant information for any given CRS.
# Optional step: generate info on the CRS from spotfire_dsml.geo_analytics import crs_utilities info_crs = crs_utilities.serialize_crs(crs)
The serialize_crs function outputs a formatted string that provides detailed information about the CRS. For example, the Web Mercator CRS is described as a global projected system that uses metres as its unit of measurement. This CRS is not accurate near the poles, as its validity extends only to latitudes approximately between -85 and 85 degrees. This level of detail helps in understanding the applicability and limitations of different CRS for geospatial analysis.
Other CRS utility functions are:
- crs_coordinate_units: return the CRS units.
- crs_is_projected: return True if the CRS is projected.
- get_crs_code: return the EPSG code of an input CRS, if defined.
- get_crs: create and validate a CRS object from an input integer or string.
Advanced: calculating the UTM zone CRS
Using the Universal Transverse Mercator can be a useful way to perform accurate calculations on a small geographic scale. In spotfire-dsml we provide a way of returning the UTM zone CRS of a given location, expressed as latitude and longitude (calculate_utm) or the median UTM zone CRS for a vector of locations (calculate_median_utm).
Providing the set of locations sits primarily on a single UTM zone, this method can be used to determine a general distance-preserving local projection .
Advanced: shapes touching the 180th meridian
The 180th meridian is a line where longitude values transition from 180 degrees East to -180 degrees West. Calculations near this line are particularly challenging. Although shape datasets typically do not cross the 180th meridian, they may include regions that touch it, such as when the XMax value is 180 or the XMin value is -180. In such cases, certain reprojections can produce incorrect results, like shapes that artificially span half the globe. This is avoided in change_crs_geometry by automatically identifying such regions and shifting them away from the discontinuity by a very small value (adjust_dateline: default 1.E-5). This behaviour can be turned off by setting the input parameter fix_dateline to False.
Note: missing data handling
Any rows where either the geometry (for a shape data table) or either input coordinate (for a table with point coordinates) are missing will be removed when processing the dataset.
Additional reading
Create shapes
Given an initial set of seed points on a map, buffers of a specified size can be created around them. Circular buffers are a straightforward method to define areas around locations for proximity analysis, such as calculating population density for urban planning purposes.
In spotfire-dsml, you can easily create both circular and rectangular buffers. For example, to analyse accidents near the two NYC airports, you can create buffers around these airports and later use spatial joins (Points in Polygons, see next section) to identify accidents within these buffered areas.
You can specify the radius for a circular buffer or the height and width for a rectangular buffer, along with the desired units (metres, miles, kilometres, or feet). Optionally, a column called “circle_id” or “box_id” for a circular or rectangular buffer respectively, containing the values of the column that represents the id of the seed points, can be returned.
Currently, the buffer creation functionality is implemented only for geographic latitude and longitude coordinates.
#Creating a circular buffer: from spotfire_dsml.geo_analytics import shape_utilities buffer = shape_utilities.create_circle_buffer(radius, unit, first_coordinate, second_coordinate, circle_id = point_id) #Creating a rectangular buffer: buffer = shape_utilities.create_bounding_box(width, height, unit, first_coordinate,second_coordinate, box_id = point_id)
Rectangular buffers with height of 80KM and width of 50 KM.
Advanced: split buffers across the 180th meridian
Whenever the seed points are close to the (180,-180) discontinuity, the generated buffer might end up lying on both sides of this discontinuity. In this case, the buffer will be automatically split in two polygons, each one at a different side of the discontinuity.
Calculate shape properties
The surface area of each region can be calculated using a choice of two methods:
- Geodesic. This is the default method, known for its accuracy but slower processing time. If the CRS of the data is not geographic, it will be temporarily reprojected to EPSG:4326 (WGS84) for the calculation. The geodesic method accounts for the Earth's curvature, providing precise area measurements.
- Projection-based. This method is typically faster but can be less accurate, especially for large or irregularly shaped areas. If the CRS of the data is geographic, it will be temporarily reprojected to EPSG:6933 (area-preserving PCS) for the calculation.
#Calculate area of geometry column from spotfire_dsml.geo_analytics import shape_utilities area = shape_utilities.calculate_area(geometry, crs=crs, unit=unit, method='geodesic')
NYC Boroughs coloured and labelled by their area in square KM.
Additional reading
EPSG:6933 (Lambert cylindrical equal area projection)
Export to file
Once we have modified a shape data set, or generated a new one, we might want to export it into the file system. This can be done in spotfire-dsml by writing into either a shapefile or a geojson format. The following call will export a file called file_name in the path_folder sub-directory, which must already exist. Any file type extension in file_name is ignored. The type of export is governed instead by the driver input parameter: “‘shapefile” (the default), or “geojson”.
The bounds_and_centroid parameter is set to “remove” by default. In this case, columns representing bounds and centroid in the dataset will not be written to file. This is because, when importing a shape dataset back into Spotfire, bounds and centroid are automatically re-calculated by Spotfire and, if they were already present, we would see duplicated columns. The bounds and centroid can be kept in the export by setting bounds_and_centroid to a different value.
# Export data to path_folder/file_name from spotfire_dsml.geo_analytics import io_utilities io_utilities.export_to_file(path_folder, file_name, driver, data, crs = 'EPSG:4326', geo_column_name = 'geometry', bounds_and_centroid = 'remove')
The default CRS is the standard WGS84 one.
Additional reading
Join spatial datasets
Spatial Joins
A spatial join combines information from two datasets based on their spatial relationships, often used to integrate diverse types of data associated with geographic features. In spotfire-dsml, we have currently implemented Points in Polygons, which works, as the name suggests, by determining which shape (polygon) a point falls within. This method has various applications, such as detecting intruders, identifying the regions where crimes occur, or assigning utility meters to specific zones.
Points in polygons
In our demo example, once we have created the buffers around NYC airports, we would like to see which motor crashes occur within these buffers. A simple call to the points_in_polygons function looks like this:
from spotfire_dsml.geo_analytics import spatial_joins polygon_id, distance_metres = spatial_joins.points_in_polygons( polygon_geometry, polygon_id, point_first_coordinate, point_second_coordinate, point_id, return_near_matches = False, handle_multiple_matches = 'join', source_crs = source_crs)
Parameters polygon_geometry and polygon_id are respectively the geometry and the id of the shape dataset we match the points against. Parameters point_first_coordinate, point_second_coordinate and point_id are the corresponding information for the points.
In this case, we only match points to polygons that fall exactly inside the polygon (return_near_matches = False). The column polygon_id (returned as a pandas Series) represents the id of the matched polygon. The distance_metres column will always be equal to zero and can be ignored in this particular scenario.
Crash sites within 3 KM of La Guardia Airport.
See next sections for handling near or multiple matches.
Advanced: near matches
When rendering geographical regions as polygons, some irregularities, such as a rugged coastline, might be smoothed out. As a result, points that are geographically within the boundaries could end up slightly outside the polygon representation. Conversely, users might be interested in identifying points that are closest to a polygon without necessarily being inside it.
We have introduced a parameter called return_near_matches that, when set to True, allows to mark points that fall within a tolerance (max_distance) from the polygon. This maximum distance will be expressed in distance_unit. For the purpose of calculating near matches, a projected_CRS is required; by default this is set to Web Mercator (EPSG:3857).
For points falling within max_distance from a polygon shape, the distance_metres column will return this distance. This enables you to tell the difference between matches (distance_metres =0) and near matches (distance_metres >0).
Crash sites within 3 KM of La Guardia Airport, with near matches within 0.5KM coloured black.
Advanced: multiple matches
If the shapes we are testing the points against overlap, a point might end up matching multiple polygons. In this case we have two options:
-
When handle_multiple_matches = 'join', a point matching multiple polygons will be marked with the concatenated id of all the matching polygons (in alphabetical order).
-
When handle_multiple_matches = 'random_select', a point matching multiple polygons will be marked with a polygon id randomly sampled from the id of the matching polygons.
Example of handling multiple matches as a third category (‘join’). The buffer radius was set to 10 KM to obtain an overlap. Multiple matches are shown in brown.
Additional reading
Calculate distances
This section focuses on calculating distances between pairs of points on the Earth's surface without considering road infrastructure. Example applications include finding the nearest production wells to a water injection well, determining well spacing, managing the spread of infectious diseases, or pest control.
In our demonstration, we use distance calculations to identify which emergency rooms are closest to each accident site. Distance calculations can be constrained within a maximum distance buffer and/or used to determine the nearest neighbours. Distances can be calculated using points from one or two datasets.
We refer to distances calculated on a geographic CRS as "angular distances" and Euclidean distances calculated on Cartesian coordinates as "Cartesian distances." The Spotfire DSML functions can compute either angular or Cartesian distances, providing flexibility for various geospatial analysis needs.
Distance calculation methods
The way distances can be calculated depends on the CRS of the data points.
-
Geographic CRS must use angular distances.
-
Projected CRS must use Cartesian distances.
Choosing a distance method (and a CRS) is a trade-off and depends on:
-
Size of the area to cover.
-
Precision requirements.
-
Number of points vs speed requirements.
In general, angular distances are more precise, but also more complex and time-consuming, given the amount of trigonometry involved in the calculations.
In spotfire-dsml, we can choose between three angular distance methods:
-
Geodesic. The most accurate but also the slowest.
-
Haversine (Great Circle). The shortest path between two points on a sphere. This is very fast and results tend to be within ~0.5% of Geodesic.
-
Haversine with local radius. This method uses Haversine with a local approximation for the Earth radius. It is much faster than Geodesic and nearly as accurate in most latitudes.
When calculating distances between faraway points, it is best to choose angular distances (and therefore a GCS), as all global PCS distort distances away from the Equator. The ranking of nearest neighbours also depends on the CRS. For instance, although Web Mercator mostly returns distorted distances on large scales, its ranking of nearest neighbours is better than other projected CRS, at least for the first few neighbours (using the ranking from geodesic as a benchmark).
Example of relative timings for angular distance calculations as a function of the number of pairs. Red: Geodesic; Green: Haversine with local Radius; Blue: Haversine.
When the scale of distances is small, we can calculate accurate Cartesian distances using a locally projected CRS. One such example is a local UTM.
Nearest neighbour
In our demo, we calculate the nearest ER departments to each accident site. Here lat1, lon1, id1 are the columns containing coordinates and id of the motor crashes, and
lat2, lon2, id2 those of the ER departments.
from spotfire_dsml.geo_analytics import distances nearest_neighbor, lines = distances.calculate_nearest_neighbors(crs,unit, distance_metric, lat1, lon1, id1, lat2, lon2, id2)
Our crs is a geographic one (in this case, it is WGS84). The unit is specified in KM in this example. The distance metric is a choice between ‘haversine’, ‘haversine_r’ (i.e. Haversine with local radius) and ‘geodesic’.
The columns representing coordinates and id string of each dataset need to be specified. If we wanted to calculate nearest neighbours within the first dataset only, we would be able to omit lat2, lon2, id2.
The resulting nearest_neighbor data table contains the distance matrix between the points of the first dataset (motor crashes) and their nearest neighbour from the second dataset (ER departments). A second data table containing the lines between nearest neighbours is also returned, for compatibility with a pre-existing TERR data function, and can be used to display the connections on the map, although it is not utilised in the demo.
Distance matrix with buffer
When we want to limit the results to only the points that lie within a specified maximum distance from each other, we will use a different function. This is not used in the current demo, however the call is very similar to the one in the previous section, with the addition of a distance buffer.
from spotfire_dsml.geo_analytics import distances distance_matrix = distances.calculate_distance_matrix(crs, unit, buffer, distance_metric, lat1, lon1, id1, lat2, lon2, id2)
The resulting distance matrix also contains the nearest neighbour ranking for the pair in each row.
Advanced: full functionality to compare distance methods and timings
The functions calculate_nearest_neighbors and calculate_distance_matrix are both wrappers around a function called calculate_earth_distance. This function accepts a list of multiple distance methods and can help compare accuracy and timings across different choices of method and CRS.
For instance, we might want to calculate and compare the distances within all pairs of points in a dataset. A potential sequence of calls would be as follows:
- Start with a dataset of points expressed in standard latitude and longitude.
- Calculate all three angular distances. Evaluate nearest neighbours using the geodesic distance.
- Project the points to their UTM zone CRS
- Calculate the Cartesian distances of the projected points
- Combine resulting distances and timings.
Example code
import pandas as pd from spotfire_dsml.geo_analytics import distances, crs_utilities start_crs = ‘EPSG:4326’ distance_unit = 'km' # Calculate angular distances with latitude and longitude distance_matrix1, times1 = distances.calculate_earth_distance(start_crs, distance_unit ,['haversine', 'haversine_r', 'geodesic'],'geodesic', latitude, longitude ,id) # Calculate UTM zone for this set of points target_crs = crs_utilities.calculate_median_utm(latitude, longitude) # Project the points projected_df=crs_utilities.change_crs_coordinates(start_crs, target_crs, latitude, longitude) # Calculate the Cartesian distances for the projected points distance_matrix2, times2 = distances.calculate_earth_distance(target_crs, distance_unit , 'cartesian', 'cartesian' ,projected_df['x'],projected_df['y'],id) # Combine the timing dictionaries times1.update(times2) times_df = pd.DataFrame([times1]) # Combine the distance matrices combo= pd.merge(distance_matrix1,distance_matrix2,on=['id1','id2'])
Distance distributions comparison for a datasets of 300 points within 10 KM of Madrid. Projected CRS is Web Mercator (EPSG:3857)
As above, but using UTM zone CRS (EPSG:32630) for Cartesian distances.
Additional reading
- Geodesic distances
- Haversine (great circle) distances
- Local Earth radius
- Euclidean (Cartesian) distances
Combining functions
The distance calculations in the previous section exemplify how multiple calls to spotfire-dsml can be integrated into a single data function. This approach leverages the toolkit's modularity, eliminating the need to create multiple data functions for a single use case. In our demo, we illustrate another example of this technique, to help streamline the process and enhance efficiency in geospatial analysis tasks.
The data function [Geospatial] Combine Demo first creates a buffer around the airports, then calculates the crashes that fall within this buffer, lastly it exports the buffer on to the file system. It effectively replaces data functions: [Geospatial] Generate Buffer, [Geospatial] Points in Polygons and [Geospatial] Export Shape to File.
Additional Resources
More details about the spotfire-dsml package can be found in the Community article Python toolkit for data science and machine learning in Spotfire. Example Spotfire applications can be downloaded from the Exchange page DSML Toolkit for Python - Documentation and Spotfire® Examples.
Recommended Comments
There are no comments to display.