Running Tesseract Jobs#

Tesseract is a spatiotemporal computation engine that is designed to run arbitrary processing on any data that is in the Geodesic Platform. Tesseract jobs are defined either by a JSON document or using the Geodesic Python API. In this tutorial we will go over how to make a basic Tesseract job and run it on the Geodesic Platform.

Note

This tutorial assumes that you have already installed the Geodesic Platform and have a running instance of the platform. If you have not done so, please see the Getting Started guide.

Creating a Tesseract Job#

To create a tesseract job we use the Geodesic Python API to create and empty geodesic.tesseract.job.Job. The top level properties in the job define some of the basic information needed in the job such as the name and the output spatial reference. The job also must live in a project. When the job is run it will create a node in the knowledge graph in the particular project you specify. Lets create a tesseract-tutorial project and then a job with the name tutorial-job.

import geodesic
import geodesic.tesseract as T
from datetime import datetime
import numpy as np

project = geodesic.create_project(
    name='tesseract-tutorial',
    alias='tesseract tutorial',
    description='A project for the Tesseract jobs tutorial',
    keywords=['tesseract', 'tutorial'],
)

job = T.Job(
    name='tutorial-job',
    alias='Tutorial Job',
    description='A tutorial job to demonstrate the use of Tesseract',
    project=project,
    bbox=(-94.691162,44.414164,-94.218750,44.676466), # Southern Minnesota
    bbox_epsg=4326,
    output_epsg=3857,
)

Here we are using a bounding box that covers a portion of southern Minnesota. We are also setting the output spatial reference to 3857 which is the Web Mercator projection. This means that all outputs of the job will be transformed in to this projection. This is the default setting and doesn’t need to be specified unless you want a different output projection. We also need to tell Tesseract which project the job should be run in. This will ensure that the output is stored in the correct project.

We need an input dataset to use in the job. This is the data that we want to prepare, then feed to a model to do some sort of processing. In this case we will use the Landsat-8 dataset. We can use the Dataset class to create a dataset in the knowledge graph that will then be available to use in any of our jobs. We will add this dataset from Google Earth Engine. You will need a Google Earth Engine credential added to Geodesic to use this data source. For more details on how to add datasets to the knowledge graph, see Boson Overview and Boson Examples.

landsat = geodesic.Dataset.from_google_earth_engine(
    name='landsat-8',
    asset='LANDSAT/LC08/C02/T1_L2',
    credential='gee-service-account',
    alias='Landsat 8 Surface Reflectance',
    domain='earth-bservation',
    category='satellite',
    type='electro-optical',
)
landsat.save()

Next we need to add a step to the job. A step is a block of work to be completed by Tesseract. Steps can be chained together to create processing pipelines and DAGs (Directed Acyclic Graphs). The first step of any job should be to create input assets. In this case we will create an input asset from the Landsat-8 dataset that we just created.

job.add_create_assets_step(
    name="add-landsat",
    asset_name="landsat",
    dataset="landsat-8",
    dataset_project=project,
    asset_bands=[
        {"asset": "SR_B4", "bands": [0]},
        {"asset": "SR_B5", "bands": [0]},
    ],
    output_time_bins=dict(
        user=T.User(
            bins=[[datetime(2021, 6, 18), datetime(2021, 6, 19)]],
        )
    ),
    output_bands=["red", "nir"],
    chip_size=1024,
    pixel_dtype=np.uint16,
    pixels_options=dict(
        pixel_size=(30.0, 30.0),
    ),
    fill_value=0.0,
    workers=1,
)

Let’s go over each of the parameters in this step.

name is what you want to call this step in the job. If this is not specified, the step will be called “add-<asset_name>”

asset_name is the name of the asset in the output Dataset that will be created. This means that any subsequent steps in the Tesseract job can use the asset_name to reference this asset.

dataset is the name of the dataset from the Geodesic Platform to use as input in this step. In this case this is Landsat-8. You can also provide a geodesic.entanglement.dataset.Dataset instead of the name.

dataset_project is the project that the dataset is in. In this case we are using the same project as the job. This can also be the global project for datasets that are available to all Geodesic platform users. This argument does not need to be specified if a geodesic.entanglement.dataset.Dataset object is provided for the dataset argument.

asset_bands is a list of bands to include in the asset. This parameter will depend on how the input dataset is structured. In this case we are using the SR_B4 and SR_B5 assets from the Landsat-8 dataset. The SR_B4 asset is red and the SR_B5 asset is near infrared. We also specify that we want to use the first band of each asset. This is because the Landsat-8 dataset has every band split into different assets. However, some datasets have all of their bands combined into a single asset. In that case you must specify the asset name and a list of band indices to use.

output_time_bins is used to specify how you would like the data to be binned in time. There are several different binning options available. For this job we want to use the User time binning. This allows completely user specified time bins as a list of lists. The outer list defines how many time bins there are and each inner list are the start and end of each individual bin. In this case we are using a single bin that spans from June 18st, 2021 to June 19th, 2021. This will get us a single image from the Landsat-8 dataset for the area specified in the job.

The time binning allowed in Tesseract is very flexible. You can specify any number of bins using a number of different methods. For example, the StridedBinning option which allows us to define a bin width and stide to evenly divide the time range. Here you can see that we are using 8 day bin widths with an 8 day stride. This means that each bin will contain 8 days of data and the bin start edges will be 8 days apart. This is shown in the figure below.

../../_images/8dayStride.png

As another example, let’s imagine that for this step we wanted to have 1 day bins with a 3 day stride. This would mean that each of our bins would contain 1 day of data and the bin start edges would be 3 days apart. The result is 2 empyt bins in between the ones we requested. Any data that falls into these bins will not be included in the Tesseract job. This is shown in the figure below.

../../_images/3dayStride.png

output_bands is a list of band names to use in the output asset. In this case we are using the red and nir bands. These bands will be created in the output asset.

chip_size is the size of the chips that will be created in the output data. This size is in pixels and will determine how the data is chunked. In this case we are using a chip size of 512x512 pixels.

pixel_dtype is the data type of the pixels in the output asset. This does not need to be the same as the input data type. The conversion will be done automatically in case a different dtype is required by later steps. In this case we are using uint16. Be careful not to use dtypes that could cause numerical underflow or overflow.

pixels_options is a dictionary of options that control other aspects of how the pixels are created. In this case we are setting the pixel size to 30x30 meters. The units of this parameter will always be in the units of the output spatial reference. 30 meters is the native resolution of the Landsat-8 dataset, but different values can be specified here and Tesseract will automatically resample the data. For other options available here, see geodesic.tesseract.components.PixelsOptions.

fill_value is the value to use for pixels that are outside of the input data. This should no be confused with a no_data value. Fill values are used by Zarr when reading output data from this job, but do not affect the operation of the job. In contrast, No Data values are values that are ignored during certain processing steps, such as aggregations.

workers will determine how many workers are created to complete this step of the job.

This is all that is needed to create a basic Tesseract job. This job will simpley gather the data for the requested dataset, bands, times and area which can then be retrieved as a zarr file in cloud storage. This is a great option if you are just building a dataset either for later analysis, or a training dataset to use in a deep learning model. However, we can also add more steps to the job to create a processing pipeline.

Let’s add a step to the job that will calculate the NDVI (Normalized Difference Vegetation Index) from the 2 bands that we defined on the input. NDVI is used often in vegetation health analysis and defined as:

../../_images/ndvi_equation.png

To add a modeling step that will perform this calculation we can use the add_model_step method. This allows us to define a docker image built with the Tesseract Python SDK (details in the next tutorial) that will be used to run the model. We can also define the inputs and outputs of the model. In this case we will use the landsat asset that we created in the previous step as the input. We will also define the ndvi asset as the output. This will be the name of the asset that is created.

job.add_model_step(
    name="calculate-ndvi",
    container=T.Container(
        repository="docker.io/seerai",
        image="calculate_ndvi",
        tag="v0.0.2",
    ),
    inputs=[T.StepInput(
        asset_name="landsat",
        dataset_project=project,
        spatial_chunk_shape=(1024, 1024),
        type="tensor",
        time_bin_selection=T.BinSelection(all=True),
    )],
    outputs=[T.StepOutput(
        asset_name="ndvi",
        chunk_shape=(1, 1, 1024, 1024),
        type="tensor",
        pixel_dtype=np.float32,
        fill_value="nan"
    )],
    workers=1,
)

Let’s go over each of the parameters in this step. We are specifically telling Tesseract that we want to create a models step which uses a container to do some sort of processing. This is different than the first step we created which just gathers, aggregates and prepares data.

name is what you want to call this step in the job.

container is a geodesic.tesseract.job.Container object that defines the docker image to use for this step. In this case we are using the seerai/tesseract-tutorial image which is available on Docker Hub. This image is built with the Tesseract Python SDK and contains a script that will calculate the NDVI from the input data. The next tutorial goes over how to build this container using the SDK.

inputs is a list of geodesic.tesseract.job.StepInput objects that define the inputs to the model. In this case we are using the landsat asset that we created in the previous step. We are also defining the time_bin_selection to be all. This means that we want to use all of the time bins in the asset. In this case since we are just calculating a band combination that doesnt depend on time, we dont actually need all of the time steps to be passed but since there are only 10, it will fit in memory easily and we can do all of them at once. We are also defining the spatial_chunk_shape to be (1024, 1024). This means that we want to process the data in chunks of 1024x1024 pixels. This shape corresponds to the chip_size parameter of the input dataset which allows for efficient processing. The model will read one chunk of input data per chunk of input to the model. There can be reasons for these two chunk sizes to be different, but in general it will be much more efficient if the chunk shapes match. This is a good size for the NDVI calculation since it is a simple calculation. However, if you are doing a more complex calculation or processing a large number of time steps, you may want to use a smaller chunk size to avoid running out of memory.

outputs is a list of geodesic.tesseract.job.StepOutput objects that define the outputs of the model. In this case we are defining the ndvi asset as the output. This will be the name of the asset that is created. We are also defining the chunk_shape to be (10, 1, 1024, 1024). This means that we want to create chunks of 10 time bins, 1 band and 1024x1024 pixels. This is the same as the input chunk shape. This means that the output chunks will be the same size as the input chunks except that we are going from 2 bands down to 1. We are also telling Tesseract that we would like the output to have the float32 data type and use nan as the fill value.

workers just like in the previous step, this will determine how many workers are created to complete this step of the job.

We have now defined a job that has two steps in sequence. The first step will create an asset from the Landsat-8 dataset and the second step will calculate the NDVI from the input data.

../../_images/tesseract-job-chart.png

With the job defined we can now run it on the Geodesic Platform. We can use the submit method on the job.

job.submit(dry_run=True)

With the dry_run parameter set to True this will send the job description to the Tesseract service and run validation on it. This is a good way to make sure that you have defined the job correctly before submitting it. Running again with dry_run=False will submit the job and begin to spin up machines to run it. The job is first split into small units of work we call “quarks” then each quark is processed and reassembled into the output. To monitor the job as its running we can use the watch method.

job.watch()

This will bring up a map that shows the chunks of data that will be processed. They start with a white outline and turn yellow when that chunk is being processed. When the chunk is done processing it will turn green.

../../_images/tesseract-job-widget.png

You can also check the status of the job using the geodesic.tesseract.Job.status() function. This will give a simple text report with the state the job is in as well as how many quarks there are total and how many have been completed.

As soon as the job begins running, the output dataset will be available, however, until at least one quark has been processed there will be no data in the output to view. As work is completed, the dataset will be filled in. The output dataset is accessible in a few different ways. First, the output is always added to Entanglement as a new node in the knowledge graph in whatever project the job was run in. This operates exactly the same as any other dataset meaning it can be displayed on a map, queried using all available methods through Boson, or used as the input dataset in another Tesseract job.

To get the dataset that was created by the Tesseract job, you can use the function geodesic.get_objects().

objects = geodesic.get_objects()
objects

This will return a list of all of the objects in the project. You should see something like:

[dataset:earth-bservation:satellite:electro-optical:landsat-8,
dataset:*:*:*:tutorial-job]

The second object in the list is the output dataset from the Tesseract job. You can get the dataset object by grabbing the [1] index of the list. Lets take a look at what assets are in the dataset.

ndvi = objects[1]
ndvi.item['item_assets']

You should see all of the assets available in the dataset including ‘ndvi’ which is the output or our model:

{'landsat': {'description': 'Zarr formatted dataset',
     'eo:bands': [{'name': 'red'}, {'name': 'nir'}],
     'href': 'gs://tesseract-zarr/2023/12/07/eac3410e47d595abcce1269336e0174a089cee24/tensors.zarr/landsat',
     'roles': ['dataset'],
     'title': 'landsat',
     'type': 'application/zarr'},
 'logs': {'description': 'GeoParquet formatted dataset',
     'href': 'gs://tesseract-zarr/2023/12/07/eac3410e47d595abcce1269336e0174a089cee24/features/logs/logs.parquet',
     'roles': ['dataset'],
     'title': 'logs',
     'type': 'application/parquet'},
 'ndvi': {'description': 'Zarr formatted dataset',
     'href': 'gs://tesseract-zarr/2023/12/07/eac3410e47d595abcce1269336e0174a089cee24/tensors.zarr/ndvi',
     'roles': ['dataset'],
     'title': 'ndvi',
     'type': 'application/zarr'},
 'zarr-root': {'description': "The root group for this job's tensor outputs",
     'href': 'gs://tesseract-zarr/2023/12/07/eac3410e47d595abcce1269336e0174a089cee24/tensors.zarr',
     'roles': ['dataset'],
     'title': 'zarr-root',
     'type': 'application/zarr'}}

The other assets are also results of running the Tesseract Job. landsat is the input asset that was used to run the model. logs is a log of the job that was run. This can be useful for debugging jobs. zarr-root is the root of the Zarr file that contains all of the output data from the job. This zarr file will not be directly accessible unless a user specified location was used (this will be covered in an advanced Tesseract usage tutorial).

Lets use the geodesic.Dataset.get_pixels() method to read the ndvi asset from the dataset using Boson. To query this dataset we must specify at least the bounding box to be read as well as the asset that we want to read.

ndvi_pixels = ndvi.get_pixels(
    bbox=(-94.691162,44.414164,-94.218750,44.676466),
    asset_bands=[{"asset": "ndvi", "bands": [0]}],
)

This will return a numpy array with the requested data. Let take a look at the data:

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(10,10))
plt.imshow(ndvi_pixels[0])
../../_images/ndvi_results.png

Accessing the Zarr File Directly (Advanced)#

Another way is to directly access the output files in cloud storage. Tesseract jobs that output raster data are stored as Zarr files in cloud storage. This option will only be available if you specify a location that the job should be written out to that you have credentials to access. See the output parameter of :class:`geodesic.tesseract.Job` for more information. The location of the files themselves is listed in the Entanglement dataset itself, and when using the python API the zarr() method of the job object can be used. To access a particular asset in the zarr file, you can do:

zarr_asset = job.zarr('ndvi')

This will get that asset as a Zarr group that contains a few arrays in it that contain the output data as well as metadata associated with it. Zarr arrays are accessed much like python dictionaries with the key in square brackets. The arrays you will find in the group are:

tesseract: This is the actual data array output by the model. In this example this will be the calculated NDVI value for each pixel in the input. The ‘tesseract’ asset is always a 4-dimensional array with indices (time, band, y, x). If the dataset is a non-temporal dataset then the size of dim 0 will be 1.

times: These are the timestamps (numpy.datetime64[ms]). This will be a 2-dimensional array with size equal to the number of time steps in the output dataset by 2 (n time steps, 2). The first dimension just corresponds to each time step in the output dataset and the second dimension are the bin edges for the time bin as defined in the job.

y: These are the Y (typically North-South) coordinates of each pixel in the array in the output spatial reference as defined in the job (EPSG:3857 in this example).

x: Same as above but for the X (typically East-West) coordinate.

Each of these arrays operates much the same as a numpy array. If you want to read the entire array into memory as a numpy array you can simply access all of the data using the numpy array format: np_array=zarr_asset['tesseract'][:] for example would read the entire ‘tesseract’ output into local memory. See the Zarr Documentation for more details on usage.

The assets created by Tesseract are always 4-dimensional arrays with indices (time, band, y, x). If the dataset is a non-temporal dataset, then the size of dim 0 will be 1.

Lets check the results of the job.

import matplotlib.pyplot as plt

ndvi = job.zarr('ndvi')['tesseract']
plt.imshow(ndvi[0, 0])
../../_images/ndvi_results.png

If at any time you restart your python kernel and lose reference to the job, you can get it again by searching all of the jobs in that project. Just use the Geodesic function geodesic.tesseract.get_jobs().

What’s Next?#

In this tutorial we covered how to build a basic tesseract job and run in it on the Geodesic Platform. We created an input data asset from the Landsat-8 dataset and then ran a simple model on it giving us the output asset NDVI. In the next tutorial we will look at how to build that model using the Tesseract Python SDK.