How We Think About Data

Why Spatiotemporal?

If you’re coming from a Geographic Information Systems (GIS) background, you’re probably familiar with the term "spatiotemporal". Spatiotemporal data are data that have both a location and a time. Virtually all data in the world are derived from a location and a time. If you consider the attributes that data can have, it essentially boils down to a list of properties or measurements, the location and time that these properties or measurements correspond to, and any connections to other data that they can have. Take for example a cell phone picture. Let’s assume the photo has a picture of a family on a beach.

If we look at this photo from a data perspective, it potentially contains the following information:

The GPS location of where it was taken
The time it was taken
The RGB pixel content of the image
The make/model of the camera
The focal length of the camera
Numerous other pieces of metadata related to the photo

Imagine this photo was uploaded to a social media site - it’s now part of a graph; it’s connected to other data. The subjects of the photo have social relationships to other people, who in turn have their own photos. Even something that’s “just” a photo contains much more information than we typically think about. The location data can associate it to a landmark, or add critical context to understand the photo’s content. The time information adds yet another dimension. This simple example illustrates a core tenet of our philosophy at SeerAI:

Sometimes the location and/or time are not relevent to the question we are trying to answer, but that’s okay - in that case they can just be ignored. If you start with spatiotemporal, you’re starting with almost everything.

Another reason we focus so strongly on spatiotemporal data is that it’s a hard problem. Appropriately dealing with spatiotemporal data is not as simple as adding the latitude and longtude. Spatiotemporal data take many forms and formats. They come from from many different APIs and sources. The difficulty of working with spatiotemporal data itself is made significantly more challenging when those data are large in volume - Spatiotemporal Big Data is a whole other ball game.

But that’s where Geodesic comes in. When it comes to working with connected spatiotemporal data, Geodesic makes hard things easy, and impossible things possible. Geodesic integrates with other tools in your data science arsenal to make solving solving spatiotemporal data challenges- accessing data sources of varying formats and from various APIs, integrating disparate data easily, and scaling up to planetary scale workloads - much simpler.

Geospatial data

Let’s forget about time for a moment - if we ignore the time dimension, geospatial data can be broken down into two main categories: raster data and vector, or feature, data. With Geodesic, we focus on these two types of data, although other types, such as triangulated irregular networks (TIN), network dataset such as road graphs, and point clouds, are possible.

Raster Data

Raster data is data that is represented by values defined on a uniform grid, such as an image. Raster data can be composed of multiple bands, which represent different measurements. The photo above is an example of raster data with three bands: red, green, and blue (RGB). The RGB pixel values visually represent a picture, but in general, raster data doesn’t need to be visualized in order to be useful. Other examples of raster include satellite images (which may include non-visibile light, such as infrared), land cover/land use maps, weather data, digital elevation/terrain models, cost surfaces, and many more. If the information you are after is easily represented as a grid of numbers, raster may be the best representation. To work with raster data in Geodesic, it’s typically one of a few ways:

Numpy Arrays - these are a great way to represent raster data and our prefered way of working in Python-land
STAC Items - Rather than work with the the raw data directly (it’s often times very very large in size), a STAC Item is a way to reference external data, such as a GeoTiff in cloud storage. We will discuss STAC in more detail later.
GDAL Datasets - Geodesic makes heavy usage of GDAL under the hood. Geodesic all but eliminates the need for users to work with GDAL, while exposing much of its functionality. GDAL Datasets can be directly read into Numpy Arrays with Geodesic.
PyTorch/TensorFlow/Other - Geodesic is designed to get your data to your AI model more quickly and easily than ever before

Feature Data

Feature data is data represented by a planar geometry such as a point, linestring, polygon or combination of same, optionally combined with other information called properties. A geospatial feature can represent many different things, such as an event that occurred on the earth, the footprint of a building and metadata about that building, a detection from a satellite image such as a field boundary, or the location of a cell phone ping. Mostly commonly, feature data is represented as GeoJSON, a popular format for geospatial vector data.

{
    "id": "null-island"
    "type": "Feature",
    "geometry": {
        "type": "Point",
        "coordinates": [0.0, 0.0]
    },
    "bbox": [0.0, 0.0, 0.0, 0.0],
    "properties": {
        "name": "Null Island",
        "population": 0
    }
}

The above shows a basic example of a geospatial feature in GeoJSON. We can make this geospatial feature a spatiotemporal feature by adding either a datetime or start/end_datetime fields to represent when this feature exists in time. Geodesic has tools for working with this that reduce some of the burden of working with the time dimension. In the Python API, a new feature can be created/instantiated in many different ways:

Direct Creation

feature = geodesic.Feature(
          id="null-island",
          geometry='POINT(0.0 0.0)',
          properties={'name': 'Null Island', 'population': 0}
      )

returns:

{
    'type': 'Feature',
    'geometry':
        {'type': 'Point',
        'coordinates': (0.0, 0.0)},
    'bbox': (0.0, 0.0, 0.0, 0.0),
    'properties': {'name': 'Null Island', 'population': 0}
}

Note that, as a convenience to the user, the geometry can be provided in a few different ways:

Well Known Text/Binary (WKT/B) as above.
A python dict of GeoJSON
A shapely geometry
Anything that implements the __geo_interface__ convention.

From Various Geospatial Formats

Feature data from a variety of sources are easy to add using Geodesic. Some examples are below:

geojson = geodesic.FeatureCollection.from_geojson('geojson.geojson')
shp = geodesic.FeatureCollection.from_shapefile('shapefile.shp')
fgdb = geodesic.FeatureCollection.from_file_geodatabase('features.fgdb', layer="my_layer")
gpx = geodesic.FeatureCollection.from_gpx('my_strava_ride.gpx')
other = geodesic.FeatureCollection.from_file('...', layer="my_layer")

Spatiotemporal Asset Catalog (STAC)

While we primarily deal with rasters and features, we don’t like to restrict things to just these types. We need a truly data-agnostic way to look at spatiotemporal data. There are many approaches to this, but we have chosen to settle around the Spatiotemporal Asset Catalog (STAC) spec. Above we mentioned how geospatial feature data constists of a geometry and optional properties/attributes. STAC extends this concept by adding in assets. An asset is some remote (or local) file or service that shares the same spatiotemporal extent as the features. For a concrete example of this, consider the beach image above. If we represent the location of the image extracted from the EXIF metadata as a geospatial feature, it might look something like this:

{
    "id": "null-island-beach"
    "type": "Feature",
    "geometry": {
        "type": "Point",
        "coordinates": [0.0, 0.0]
    },
    "bbox": [0.0, 0.0, 0.0, 0.0],
    "properties": {
        "name": "Family Photo",
        "aperture": "f/1.8",
        "datetime": "2022-07-01T07:01:31Z"
    }
}

This is a good way to represent the metadata associated with that image - as a geospatial feature. We can do better though and represent the entire thing as a STAC Item:

{
    "id": "null-island-beach"
    "type": "Feature",
    "geometry": {
        "type": "Point",
        "coordinates": [0.0, 0.0]
    },
    "bbox": [0.0, 0.0, 0.0, 0.0],
    "properties": {
        "name": "Family Photo",
        "aperture": "f/1.8",
        "datetime": "2022-07-01T07:01:31Z"
    },
    "stac_version": "1.0.0",
    "assets": {
        "image": {
            "title": "Family Photo",
            "descritpion": "a photo of a family at a beach",
            "type": "image/jpeg",
            "href": "https://images.unsplash.com/photo-1612392987205-c53f0200a175?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=2670&q=80"
        }
    }
}

Boson: The Geospatial Service Mesh

Boson can be thought of as a horizontally scalable geospatial service proxy. That is a mouthfull, so let's break it down. "Horizontally scalable" simply means that Boson can handle a massive number of requests simultaneously. "Geospatial service proxy" means that Boson serves as an intermediary between various geospatial services, such as Google Earth Engine or ArcGIS Online. What this means is that a user can request data from just about any source and get it in just about whatever form is desired.

Boson operates on a three-part paradigm: there is a Servicer, the Boson middle representation, and a Provider. The Provider is the source of the data. It could be an API, or even static files in a cloud bucket. The Servicer is the vehicle by which the user requests the data, and the way it is served out to the world at large. The Boson middle representation is an intermediate form that allows Boson to connect ANY Provider to ANY Servicer. For example, the user could seamlessly put Google Earth Engine data on an ArcGIS map, or pull ArcGIS data into a Python environment. And all the while, the user doesn't need to know where the data come from, only how they wish to receive them.

If we put everything above together, we can now describe the concept of a Dataset in Geodesic.

Datasets

Datasets are the way that any data is represented in the Geodesic Platform. A Dataset is simply a collection of metadata that describe the remote data source as well as a small configuration object that tells Boson how to access it, including any secure credentials required to access a downstream API. A Dataset can have zero or more assets, just as a STAC Item/Collection can. You can, in some ways, think of our Dataset as a superset of a STAC Collection. You’ll see references to assets scattered through the documentation. When you query data from Boson, you are referencing a Dataset and one or more assets to return some raster or feature data. For more information about Boson and Datasets see the next section.

Why Spatiotemporal?​

All data are spatiotemporal data

Geospatial data​

Raster Data​

Feature Data​

Direct Creation​

From Various Geospatial Formats​

Spatiotemporal Asset Catalog (STAC)​

Boson: The Geospatial Service Mesh​

Datasets​