Job¶

class geodesic.tesseract.job.Job(job_id=None, **spec)[source]¶

Bases: geodesic.bases._APIObject

represents a Tesseract Job

The class can be initialized either with a dictionary (**spec) that represents the request for the particular type, or can be given an job ID. If a job ID is provided it will query for that job on the tesseract service and then update this class with the specifics of that job.

Parameters

**spec – A dictionary representing the job request.
job_id – The job ID string. If provided the job will be initialized with info by making a request to tesseract.

name¶

(str) - a unique name for the dataset created by this job.

Descriptor: _StringDescr

alias¶

(str) - a human readable name for the dataset created by this job

Descriptor: _StringDescr

description¶

(str) - a longer description for the dataset created by this job

Descriptor: _StringDescr

bbox¶

(tuple, list, dict, str, bytes, BaseGeometry) - the rectangular extent of this job. Can be further filtered by a geometry

Descriptor: _BBoxDescr

bbox_epsg¶

(int) - the EPSG code of the bounding box spatial reference.

Descriptor: _IntDescr

output_epsg¶

(int) - the EPSG code of the output spatial reference. Pixel size will be with respect to this.

Descriptor: _IntDescr

geometry¶

(str, dict, bytes, BaseGeometry) - A geometry to filter the job with only assets intersecting this will be processed. Inputs can be WKT, WKB, GeoJSON, or a anything that implements a __geo_interface__

Descriptor: _GeometryDescr

global_properties¶

(GlobalProperties, dict) - DEPRECATED. Will be removed in v1.0.0. Properties applied to unspecified fields in an asset spec

Descriptor: _TypeConstrainedDescr

asset_specs¶

the initial assets to compute in the job

Descriptor: AssetSpecListDescr

workers¶

(int) - Number of workers to use for each step in the job. Can also be specified on each step individually.

Descriptor: _IntDescr

steps¶

(Step, dict) - A list of steps to execute

Descriptor: _ListDescr

hooks¶

(Webhook, dict) - NOT YET IMPLEMENTED. A list of webhooks to execute when job is complete

Descriptor: _ListDescr

output¶

(Bucket, dict) - the output, other than default storage

Descriptor: _TypeConstrainedDescr

project¶

the project that this job will be assigned to

Descriptor: _ProjectDescr

load(job_id, dry_run=False)[source]¶

Loads job information for job_id if the job exists

Parameters

job_id (str) – The job ID to load
dry_run (bool) – If True, only loads the job information, not the dataset or item.

submit(overwrite=False, dry_run=False, timeout_seconds=30.0)[source]¶

Submits a job to be processed by tesseract

This function will take the job defined by this class and submit it to the tesseract api for processing. Once submitted the dataset and items fields will be populated containing the SeerAI dataset and STAC item respectively. Keep in mind that even though the links to files in the STAC item will be populated, the job may not yet be completed and so some of the chunks may not be finished.

Parameters

overwrite – if the job exists, deletes it and creates a new one
dry_run – runs this as a dry run (no work submitted, only estimated.)
timeout_seconds – how long to wait for the job to be submitted before timing out.

zarr(asset_name=None)[source]¶

Returns the Zarr group for the corresponding asset name

Parameters: asset_name – name of the asset to open and return
Returns: zarr file pointing to the results.

ndarray(asset_name)[source]¶

Returns a numpy.ndarray for specified asset name.

USE WITH CAUTION! RETURNS ALL OF WHAT COULD BE A HUGE ARRAY

Parameters: asset_name – name of the asset to open and return
Returns: numpy array of all the results.

status(return_quark_geoms=False, return_quark_status=False, return_alerts=False, warn=False)[source]¶

Status queries the tesseract service for the jobs status.

Parameters

return_quark_geoms (bool) – Should the query to the service ask for all of the quarks geometries. If True it will populate the geometry in this class.
return_quark_status (bool) – If True will query for the status of each individual quark associated with the job.
return_alerts (bool) – If True, will return all alerts (planning errors, warnings, etc) for the job.
warn (bool) – If any alerts are returned, warns the user with a Python warning

Returns

A dictionary with the response from the Tesseract service

add_create_assets_step(name, asset_name, workers=1, dataset=None, dataset_project=None, stac_items=None, asset_bands=None, output_time_bins=None, pixels_options=None, warp_options=None, rasterize_options=None, no_data=None, pixel_dtype=None, fill_value=None, ids=None, filter=None, datetime=None, chip_size=512, output_bands=None, compression='blosc', page_size=1000)[source]¶

add a Data input to this Tesseract Job

This adds a data input to this Job. Although there are many arguments, many of them don’t need to be specified. The following rules apply

You MUST specified either a dataset or stac_items, but not both. Specifying both is undefined and will raise an Exception.
You MUST specify pixels_options, or rasterize_options, or leave both as None. Specifying both is undefined and will raise an exception. If you specify neither, the Features/Items will be added in vector/GeoJSON format
If pixels_options is specified, you must specify asset_bands as there is no general way to know what asset and band list from the dataset is desired.
You do not need to specify the dataset_project unless the dataset’s project is ambiguous based on the name alone. This will check the active_project first, followed by global, and raise an exception if the dataset is not in either. If you specify a Dataset object, the dataset_project will be pulled from that.

This method returns self so that it can be chained together with other methods.

Parameters

name – the name of the step. This must be unqiue across the whole job.
asset_name – the name of the output asset in Tesseract that this will create. This name can be referenced by future Step’s in the job.
workers – the number of workers to use for this step. (default=1)
dataset – the name of the Dataset or a Dataset object that has been saved in Entanglement.
dataset_project – the project that the Dataset belongs to. This is to resolve ambiguity between Dataset’s that have the same name as each other.
stac_items – A list of Features or STAC Items that used in lieu of a Dataset as this step’s inputs. Do not specify more than a handful of features via this method as the job performance may suffer or the Job may fail to submit successfully.
asset_bands – a list of asset/bands combinations. The combination of the asset and the list of bands will be extracted from the dataset, if available. It’s not always possible for Tesseract to guarantee that the asset/bands are available without starting the job. Double check your arguments to avoid Job failure after the job has been submitted.
output_time_times – a specification of how to create the output time bins for the job.
pixels_options – If this is set, Tesseract will assume that this step will create a tensor output from either the specified dataset or the stac_items provided.
warp_options (deprecated) – If this is set, Tesseract will assume that this step will create a tensor output from either the specified dataset or the stac_items provided.
rasterize_options – If this is set, Tesseract will assume that this step will create a tensor output by rasterizing either the feature outputs from querying the dataset or using the provided stac_items.
no_data – For pixels jobs, this will be used as the no_data value for the input rasters.
pixel_dtype – The data type of tensor outputs. Not needed for features.
fill_value – The value to set as the no data value for tensor output. This will be set as the “fill_value” in the resulting zarr output file.
ids – A list of IDs to filter the dataset to. Useful if you know exactly what data you wish for Tesseract to use.
filter – a CQL2 JSON filter as a Python dict. This will be used to filter the data if the dataset supports filtering.
datetime – The range to query for input items from the dataset. This may be specified either as a tuple of datetimes/rfc3339 strings or as a STAC style range, ‘YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]/YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]’, ‘YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]/..’ or ‘../YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]’
chip_size – for tensor outputs, what size in pixels should each of the chips be? This can be 256>= chip_size >= 2048
output_bands – a list of string names of what the output bands should be called. Length must match the asset_bans total count of bins.
compression – what compression algorithm to use on compressed tensor chunks. ‘blosc’ is default and usually very effective.
page_size – how many items to query at a time from Boson. For complex features, this may need to be a smaller value (default 1000 is usually fine), but for simpler features using a large value will speed up the processing.

Returns

This Job after this step has been added. This is so that these can be chained. If you want to suppress the output, call like so: _ = job.add_data_input(…)

Examples

Add an asset from the “srtm-gl1” dataset. This will use the pixels functionality to reproject/resample

>>> job = Job()
>>> _ = job.add_data_input(
...         name="add-srtm",
...         asset_name="elevation",
...         dataset="srtm-gl1",
...         asset_bands=[{"asset": "elevation", "bands": [0]}],
...         pixels_options={
...             "pixel_size": 30.0
...         },
...         chip_size=2048
...     )

Add an asset from a feature dataset. This will use the rasterize functionality to rasterize the features

>>> job = Job()
>>> _ = job.add_data_input(
...         name="add-usa-counties",
...         asset_name="counties",
...         dataset="usa-counties",
...         rasterize_options={
...             "pixel_size": [500.0, 500.0],
...             "value": "FIPS"
...         },
...         chip_size=1024
...     )

Add the same as the previous step, but do not rasterize

>>> job = Job()
>>> _ = job.add_data_input(
...         name="add-usa-counties",
...         asset_name="counties",
...         dataset="usa-counties",
...     )

add_data_input(name, asset_name, workers=1, dataset=None, dataset_project=None, stac_items=None, asset_bands=None, output_time_bins=None, pixels_options=None, warp_options=None, rasterize_options=None, no_data=None, pixel_dtype=None, fill_value=None, ids=None, filter=None, datetime=None, chip_size=512, output_bands=None, compression='blosc', page_size=1000)¶

add a Data input to this Tesseract Job

This adds a data input to this Job. Although there are many arguments, many of them don’t need to be specified. The following rules apply

You MUST specified either a dataset or stac_items, but not both. Specifying both is undefined and will raise an Exception.
You MUST specify pixels_options, or rasterize_options, or leave both as None. Specifying both is undefined and will raise an exception. If you specify neither, the Features/Items will be added in vector/GeoJSON format
If pixels_options is specified, you must specify asset_bands as there is no general way to know what asset and band list from the dataset is desired.
You do not need to specify the dataset_project unless the dataset’s project is ambiguous based on the name alone. This will check the active_project first, followed by global, and raise an exception if the dataset is not in either. If you specify a Dataset object, the dataset_project will be pulled from that.

This method returns self so that it can be chained together with other methods.

Parameters

name – the name of the step. This must be unqiue across the whole job.
asset_name – the name of the output asset in Tesseract that this will create. This name can be referenced by future Step’s in the job.
workers – the number of workers to use for this step. (default=1)
dataset – the name of the Dataset or a Dataset object that has been saved in Entanglement.
dataset_project – the project that the Dataset belongs to. This is to resolve ambiguity between Dataset’s that have the same name as each other.
stac_items – A list of Features or STAC Items that used in lieu of a Dataset as this step’s inputs. Do not specify more than a handful of features via this method as the job performance may suffer or the Job may fail to submit successfully.
asset_bands – a list of asset/bands combinations. The combination of the asset and the list of bands will be extracted from the dataset, if available. It’s not always possible for Tesseract to guarantee that the asset/bands are available without starting the job. Double check your arguments to avoid Job failure after the job has been submitted.
output_time_times – a specification of how to create the output time bins for the job.
pixels_options – If this is set, Tesseract will assume that this step will create a tensor output from either the specified dataset or the stac_items provided.
warp_options (deprecated) – If this is set, Tesseract will assume that this step will create a tensor output from either the specified dataset or the stac_items provided.
rasterize_options – If this is set, Tesseract will assume that this step will create a tensor output by rasterizing either the feature outputs from querying the dataset or using the provided stac_items.
no_data – For pixels jobs, this will be used as the no_data value for the input rasters.
pixel_dtype – The data type of tensor outputs. Not needed for features.
fill_value – The value to set as the no data value for tensor output. This will be set as the “fill_value” in the resulting zarr output file.
ids – A list of IDs to filter the dataset to. Useful if you know exactly what data you wish for Tesseract to use.
filter – a CQL2 JSON filter as a Python dict. This will be used to filter the data if the dataset supports filtering.
datetime – The range to query for input items from the dataset. This may be specified either as a tuple of datetimes/rfc3339 strings or as a STAC style range, ‘YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]/YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]’, ‘YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]/..’ or ‘../YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]’
chip_size – for tensor outputs, what size in pixels should each of the chips be? This can be 256>= chip_size >= 2048
output_bands – a list of string names of what the output bands should be called. Length must match the asset_bans total count of bins.
compression – what compression algorithm to use on compressed tensor chunks. ‘blosc’ is default and usually very effective.
page_size – how many items to query at a time from Boson. For complex features, this may need to be a smaller value (default 1000 is usually fine), but for simpler features using a large value will speed up the processing.

Returns

This Job after this step has been added. This is so that these can be chained. If you want to suppress the output, call like so: _ = job.add_data_input(…)

Examples

Add an asset from the “srtm-gl1” dataset. This will use the pixels functionality to reproject/resample

>>> job = Job()
>>> _ = job.add_data_input(
...         name="add-srtm",
...         asset_name="elevation",
...         dataset="srtm-gl1",
...         asset_bands=[{"asset": "elevation", "bands": [0]}],
...         pixels_options={
...             "pixel_size": 30.0
...         },
...         chip_size=2048
...     )

Add an asset from a feature dataset. This will use the rasterize functionality to rasterize the features

>>> job = Job()
>>> _ = job.add_data_input(
...         name="add-usa-counties",
...         asset_name="counties",
...         dataset="usa-counties",
...         rasterize_options={
...             "pixel_size": [500.0, 500.0],
...             "value": "FIPS"
...         },
...         chip_size=1024
...     )

Add the same as the previous step, but do not rasterize

>>> job = Job()
>>> _ = job.add_data_input(
...         name="add-usa-counties",
...         asset_name="counties",
...         dataset="usa-counties",
...     )

add_model_step(name, container, inputs, outputs, args={}, gpu=False, workers=1)[source]¶

add a Model step to this Tesseract Job

This adds a model step to this Job and runs some validation.

This method returns self so that it can be chained together with other methods.

Parameters

name – the name to give this step
container – either a Container object or the image tag of the container for the model
inputs – a list of StepInputs. Must refer to previous steps in the model
outputs – a list of StepOutputs detailing the output of this model
args – an optional list of arguements for this container at runtime. These will be provided to the user inference func if written so-as to accept arguments
gpu – if this model requires a GPU to run, set to True. Unless your code is specifically configured for an NVIDIA GPU and your image has the appropriate drivers, this will not be necessary or improve performance of non-GPU optimized code.
workers – How many workers to split this step over

Returns

self - this Job

Examples

Add a step that runs a harmonic regression model using a previous asset step called ‘landsat’.

>>> from geodesic.tesseract import Job, Container, StepInput, StepOutput
>>> job = Job()
... job.add_model_step(
...     name="run-har-reg",
...     container=Container(
...         repository="us-central1-docker.pkg.dev/double-catfish-291717/seerai-docker/images/",
...         image="har-reg",
...         tag="v0.0.7",
...         args={"forder": 4}
...     ),
...     inputs=[StepInput(
...         asset_name="landsat",
...         dataset_project=proj,
...         spatial_chunk_shape=(512, 512),
...         type="tensor",
...         time_bin_selection=T.BinSelection(all=True),
...     )],
...     outputs=[
...         StepOutput(
...             asset_name="brightness-params",
...             chunk_shape=(1, 10, 512, 512),
...             type="tensor",
...             pixel_dtype="<f8",
...             fill_value="nan",
...         ),
...         StepOutput(
...             asset_name="greenness-params",
...             chunk_shape=(1, 10, 512, 512),
...             type="tensor",
...             pixel_dtype="<f8",
...             fill_value="nan",
...         ),
...         StepOutput(
...             asset_name="wetness-params",
...             chunk_shape=(1, 10, 512, 512),
...             type="tensor",
...             pixel_dtype="<f8",
...             fill_value="nan",
...         )
...     ],
...     workers=10
... )

update_step_params(step_name, input_index=None, output_index=None, **params)[source]¶

updates the parameters for an existing step by looking up the step by name and then applying parameters

This method can be used to update the info in a step that’s already been added to a job. You can modify parameters at either the top level of the Step or in any of the inputs or outputs by specifying an input_index or an output_index.

Parameters

step_name – must match one of the steps in the job. This is the step that will be updated
input_index – the index of the input you would like to modify
output_index – the index of the output you would like to modify
**params – key/values to set on the Step, StepInput, or StepOutput selected

Returns

True if the step passes DAG validation, False otherwise

Examples

Rename the step: >>> job.update_step_params(‘old_name’, name=’new_name’)

Change the input dataset for 0th input >>> job.update_step_params(‘step’, input_index=0, dataset=”new-dataset”)

Change the 3rd output’s pixel_dtype >>> job.update_step_params(‘step’, output_index=3, pixel_dtype=np.float32)

delete(remove_data=False)[source]¶

Deletes a job in the Tesseract service.

Unless specified, data created by this job will remain in the underlying storage. Set remove_data to True to remove created asset data.

Parameters: remove_data – Delete underlying data created by this job

watch()[source]¶

Monitor the tesseract job with the SeerAI widget.

Will create a jupyter widget that will watch the progress of this tesseract job.

add_rechunk_step(step_name, asset_name, chunk_shape, workers=1)[source]¶

adds a rechunk step to the job

A rechunk step will create a new zarr array for a given asset called “rechunk” which will with copy of the tesseract array with the new given chunk shape. Note it is best to not decrease any dimension of the chunk shape too much. For example (1,1,1,1000) to (1,1,1000, 1) will be an extrememly inefficent operation (not to mention impractical).

Parameters

step_name – name for the rechunking step
asset_name – output asset name from a previous step to rechunk
chunk_shape – a list of integers of length four which will
workers – the number of workers to use for this step. (default=1)
**params – key/values to set on the Step, StepInput, or StepOutput selected

Example

Add rechunk an asset from (1,1,1000,1000) to (4,2,1000,1000):

>>> job.add_rechunk_step(
...    step_name='rechunk_sentinel',
...    asset_name="sentinel-out",
...    chunk_shape=[4,2,1000,1000],
...    workers=3
...    )

add_multiscale_step(step_name, asset_name, min_zoom=0, workers=1)[source]¶

adds a multiscales step to the job

Parameters

step_name – name for the multiscale step
asset_name – output asset name from a previous step to create multiscales for
min_zoom – minimum zoom level for the multiscale step to generate. This might be useful if the asset is large with a small pixel size and all 20 zoom levels are not desired.
workers – the number of workers to use for this step. (default=1)

Example

Add a multiscale step for asset sentinel

>>> job.add_multiscale_step(
...    step_name='multiscale_sentinel',
...    asset_name="sentinel-out",
...    min_zoom=10,
...    workers=3
... )

Job¶

Docs

Tutorials

Resources