Job¶
- class geodesic.tesseract.job.Job(job_id=None, **spec)[source]¶
Bases:
geodesic.bases._APIObject
represents a Tesseract Job
The class can be initialized either with a dictionary (**spec) that represents the request for the particular type, or can be given an job ID. If a job ID is provided it will query for that job on the tesseract service and then update this class with the specifics of that job.
- Parameters
**spec – A dictionary representing the job request.
job_id – The job ID string. If provided the job will be initialized with info by making a request to tesseract.
- name¶
(str) - a unique name for the dataset created by this job.
Descriptor:
_StringDescr
- alias¶
(str) - a human readable name for the dataset created by this job
Descriptor:
_StringDescr
- description¶
(str) - a longer description for the dataset created by this job
Descriptor:
_StringDescr
- bbox¶
(tuple, list, dict, str, bytes,
BaseGeometry
) - the rectangular extent of this job. Can be further filtered by a geometryDescriptor:
_BBoxDescr
- output_epsg¶
(int) - the EPSG code of the output spatial reference. Pixel size will be with respect to this.
Descriptor:
_IntDescr
- geometry¶
(str, dict, bytes,
BaseGeometry
) - A geometry to filter the job with only assets intersecting this will be processed. Inputs can be WKT, WKB, GeoJSON, or a anything that implements a __geo_interface__Descriptor:
_GeometryDescr
- global_properties¶
(
GlobalProperties
, dict) - DEPRECATED. Will be removed in v1.0.0. Properties applied to unspecified fields in an asset specDescriptor:
_TypeConstrainedDescr
- asset_specs¶
the initial assets to compute in the job
Descriptor:
AssetSpecListDescr
- workers¶
(int) - Number of workers to use for each step in the job. Can also be specified on each step individually.
Descriptor:
_IntDescr
- steps¶
(
Step
, dict) - A list of steps to executeDescriptor:
_ListDescr
- hooks¶
(
Webhook
, dict) - NOT YET IMPLEMENTED. A list of webhooks to execute when job is completeDescriptor:
_ListDescr
- output¶
(
Bucket
, dict) - the output, other than default storageDescriptor:
_TypeConstrainedDescr
- project¶
the project that this job will be assigned to
Descriptor:
_ProjectDescr
- submit(overwrite=False, dry_run=False, timeout_seconds=30.0)[source]¶
Submits a job to be processed by tesseract
This function will take the job defined by this class and submit it to the tesseract api for processing. Once submitted the dataset and items fields will be populated containing the SeerAI dataset and STAC item respectively. Keep in mind that even though the links to files in the STAC item will be populated, the job may not yet be completed and so some of the chunks may not be finished.
- Parameters
overwrite – if the job exists, deletes it and creates a new one
dry_run – runs this as a dry run (no work submitted, only estimated.)
timeout_seconds – how long to wait for the job to be submitted before timing out.
- zarr(asset_name=None)[source]¶
Returns the Zarr group for the corresponding asset name
- Parameters
asset_name – name of the asset to open and return
- Returns
zarr file pointing to the results.
- ndarray(asset_name)[source]¶
Returns a numpy.ndarray for specified asset name.
USE WITH CAUTION! RETURNS ALL OF WHAT COULD BE A HUGE ARRAY
- Parameters
asset_name – name of the asset to open and return
- Returns
numpy array of all the results.
- status(return_quark_geoms=False, return_quark_status=False, return_alerts=False, warn=False)[source]¶
Status queries the tesseract service for the jobs status.
- Parameters
return_quark_geoms (bool) – Should the query to the service ask for all of the quarks geometries. If True it will populate the geometry in this class.
return_quark_status (bool) – If True will query for the status of each individual quark associated with the job.
return_alerts (bool) – If True, will return all alerts (planning errors, warnings, etc) for the job.
warn (bool) – If any alerts are returned, warns the user with a Python warning
- Returns
A dictionary with the response from the Tesseract service
- add_create_assets_step(name, asset_name, workers=1, dataset=None, dataset_project=None, stac_items=None, asset_bands=None, output_time_bins=None, pixels_options=None, warp_options=None, rasterize_options=None, no_data=None, pixel_dtype=None, fill_value=None, ids=None, filter=None, datetime=None, chip_size=512, output_bands=None, compression='blosc', page_size=1000)[source]¶
add a Data input to this Tesseract Job
This adds a data input to this Job. Although there are many arguments, many of them don’t need to be specified. The following rules apply
You MUST specified either a
dataset
orstac_items
, but not both. Specifying both is undefined and will raise an Exception.You MUST specify
pixels_options
, orrasterize_options
, or leave both as None. Specifying both is undefined and will raise an exception. If you specify neither, the Features/Items will be added in vector/GeoJSON formatIf pixels_options is specified, you must specify
asset_bands
as there is no general way to know what asset and band list from the dataset is desired.You do not need to specify the
dataset_project
unless thedataset
’s project is ambiguous based on the name alone. This will check theactive_project
first, followed byglobal
, and raise an exception if thedataset
is not in either. If you specify aDataset
object, thedataset_project
will be pulled from that.
This method returns
self
so that it can be chained together with other methods.- Parameters
name – the name of the step. This must be unqiue across the whole job.
asset_name – the name of the output
asset
in Tesseract that this will create. This name can be referenced by futureStep
’s in the job.workers – the number of workers to use for this step. (default=1)
dataset – the name of the
Dataset
or aDataset
object that has been saved in Entanglement.dataset_project – the project that the
Dataset
belongs to. This is to resolve ambiguity betweenDataset
’s that have the same name as each other.stac_items – A list of Features or STAC Items that used in lieu of a
Dataset
as this step’s inputs. Do not specify more than a handful of features via this method as the job performance may suffer or theJob
may fail to submit successfully.asset_bands – a list of
asset
/bands
combinations. The combination of theasset
and the list ofbands
will be extracted from the dataset, if available. It’s not always possible for Tesseract to guarantee that the asset/bands are available without starting the job. Double check your arguments to avoidJob
failure after the job has been submitted.output_time_times – a specification of how to create the output time bins for the job.
pixels_options – If this is set, Tesseract will assume that this step will create a tensor output from either the specified
dataset
or thestac_items
provided.warp_options (deprecated) – If this is set, Tesseract will assume that this step will create a tensor output from either the specified
dataset
or thestac_items
provided.rasterize_options – If this is set, Tesseract will assume that this step will create a tensor output by rasterizing either the feature outputs from querying the
dataset
or using the providedstac_items
.no_data – For pixels jobs, this will be used as the no_data value for the input rasters.
pixel_dtype – The data type of tensor outputs. Not needed for features.
fill_value – The value to set as the no data value for tensor output. This will be set as the “fill_value” in the resulting zarr output file.
ids – A list of IDs to filter the dataset to. Useful if you know exactly what data you wish for Tesseract to use.
filter – a CQL2 JSON filter as a Python dict. This will be used to filter the data if the
dataset
supports filtering.datetime – The range to query for input items from the
dataset
. This may be specified either as a tuple of datetimes/rfc3339 strings or as a STAC style range, ‘YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]/YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]’, ‘YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]/..’ or ‘../YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]’chip_size – for tensor outputs, what size in pixels should each of the chips be? This can be 256>=
chip_size
>= 2048output_bands – a list of string names of what the output bands should be called. Length must match the asset_bans total count of bins.
compression – what compression algorithm to use on compressed tensor chunks. ‘blosc’ is default and usually very effective.
page_size – how many items to query at a time from Boson. For complex features, this may need to be a smaller value (default 1000 is usually fine), but for simpler features using a large value will speed up the processing.
- Returns
This
Job
after this step has been added. This is so that these can be chained. If you want to suppress the output, call like so: _ = job.add_data_input(…)
Examples
Add an
asset
from the “srtm-gl1” dataset. This will use the pixels functionality to reproject/resample>>> job = Job() >>> _ = job.add_data_input( ... name="add-srtm", ... asset_name="elevation", ... dataset="srtm-gl1", ... asset_bands=[{"asset": "elevation", "bands": [0]}], ... pixels_options={ ... "pixel_size": 30.0 ... }, ... chip_size=2048 ... )
Add an
asset
from a feature dataset. This will use the rasterize functionality to rasterize the features>>> job = Job() >>> _ = job.add_data_input( ... name="add-usa-counties", ... asset_name="counties", ... dataset="usa-counties", ... rasterize_options={ ... "pixel_size": [500.0, 500.0], ... "value": "FIPS" ... }, ... chip_size=1024 ... )
Add the same as the previous step, but do not rasterize
>>> job = Job() >>> _ = job.add_data_input( ... name="add-usa-counties", ... asset_name="counties", ... dataset="usa-counties", ... )
- add_data_input(name, asset_name, workers=1, dataset=None, dataset_project=None, stac_items=None, asset_bands=None, output_time_bins=None, pixels_options=None, warp_options=None, rasterize_options=None, no_data=None, pixel_dtype=None, fill_value=None, ids=None, filter=None, datetime=None, chip_size=512, output_bands=None, compression='blosc', page_size=1000)¶
add a Data input to this Tesseract Job
This adds a data input to this Job. Although there are many arguments, many of them don’t need to be specified. The following rules apply
You MUST specified either a
dataset
orstac_items
, but not both. Specifying both is undefined and will raise an Exception.You MUST specify
pixels_options
, orrasterize_options
, or leave both as None. Specifying both is undefined and will raise an exception. If you specify neither, the Features/Items will be added in vector/GeoJSON formatIf pixels_options is specified, you must specify
asset_bands
as there is no general way to know what asset and band list from the dataset is desired.You do not need to specify the
dataset_project
unless thedataset
’s project is ambiguous based on the name alone. This will check theactive_project
first, followed byglobal
, and raise an exception if thedataset
is not in either. If you specify aDataset
object, thedataset_project
will be pulled from that.
This method returns
self
so that it can be chained together with other methods.- Parameters
name – the name of the step. This must be unqiue across the whole job.
asset_name – the name of the output
asset
in Tesseract that this will create. This name can be referenced by futureStep
’s in the job.workers – the number of workers to use for this step. (default=1)
dataset – the name of the
Dataset
or aDataset
object that has been saved in Entanglement.dataset_project – the project that the
Dataset
belongs to. This is to resolve ambiguity betweenDataset
’s that have the same name as each other.stac_items – A list of Features or STAC Items that used in lieu of a
Dataset
as this step’s inputs. Do not specify more than a handful of features via this method as the job performance may suffer or theJob
may fail to submit successfully.asset_bands – a list of
asset
/bands
combinations. The combination of theasset
and the list ofbands
will be extracted from the dataset, if available. It’s not always possible for Tesseract to guarantee that the asset/bands are available without starting the job. Double check your arguments to avoidJob
failure after the job has been submitted.output_time_times – a specification of how to create the output time bins for the job.
pixels_options – If this is set, Tesseract will assume that this step will create a tensor output from either the specified
dataset
or thestac_items
provided.warp_options (deprecated) – If this is set, Tesseract will assume that this step will create a tensor output from either the specified
dataset
or thestac_items
provided.rasterize_options – If this is set, Tesseract will assume that this step will create a tensor output by rasterizing either the feature outputs from querying the
dataset
or using the providedstac_items
.no_data – For pixels jobs, this will be used as the no_data value for the input rasters.
pixel_dtype – The data type of tensor outputs. Not needed for features.
fill_value – The value to set as the no data value for tensor output. This will be set as the “fill_value” in the resulting zarr output file.
ids – A list of IDs to filter the dataset to. Useful if you know exactly what data you wish for Tesseract to use.
filter – a CQL2 JSON filter as a Python dict. This will be used to filter the data if the
dataset
supports filtering.datetime – The range to query for input items from the
dataset
. This may be specified either as a tuple of datetimes/rfc3339 strings or as a STAC style range, ‘YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]/YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]’, ‘YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]/..’ or ‘../YYYY-mm-ddTHH:MM:SS[Z][+-HH:MM]’chip_size – for tensor outputs, what size in pixels should each of the chips be? This can be 256>=
chip_size
>= 2048output_bands – a list of string names of what the output bands should be called. Length must match the asset_bans total count of bins.
compression – what compression algorithm to use on compressed tensor chunks. ‘blosc’ is default and usually very effective.
page_size – how many items to query at a time from Boson. For complex features, this may need to be a smaller value (default 1000 is usually fine), but for simpler features using a large value will speed up the processing.
- Returns
This
Job
after this step has been added. This is so that these can be chained. If you want to suppress the output, call like so: _ = job.add_data_input(…)
Examples
Add an
asset
from the “srtm-gl1” dataset. This will use the pixels functionality to reproject/resample>>> job = Job() >>> _ = job.add_data_input( ... name="add-srtm", ... asset_name="elevation", ... dataset="srtm-gl1", ... asset_bands=[{"asset": "elevation", "bands": [0]}], ... pixels_options={ ... "pixel_size": 30.0 ... }, ... chip_size=2048 ... )
Add an
asset
from a feature dataset. This will use the rasterize functionality to rasterize the features>>> job = Job() >>> _ = job.add_data_input( ... name="add-usa-counties", ... asset_name="counties", ... dataset="usa-counties", ... rasterize_options={ ... "pixel_size": [500.0, 500.0], ... "value": "FIPS" ... }, ... chip_size=1024 ... )
Add the same as the previous step, but do not rasterize
>>> job = Job() >>> _ = job.add_data_input( ... name="add-usa-counties", ... asset_name="counties", ... dataset="usa-counties", ... )
- add_model_step(name, container, inputs, outputs, args={}, gpu=False, workers=1)[source]¶
add a Model step to this Tesseract Job
This adds a model step to this Job and runs some validation.
This method returns self so that it can be chained together with other methods.
- Parameters
name – the name to give this step
container – either a Container object or the image tag of the container for the model
inputs – a list of StepInputs. Must refer to previous steps in the model
outputs – a list of StepOutputs detailing the output of this model
args – an optional list of arguements for this container at runtime. These will be provided to the user inference func if written so-as to accept arguments
gpu – if this model requires a GPU to run, set to True. Unless your code is specifically configured for an NVIDIA GPU and your image has the appropriate drivers, this will not be necessary or improve performance of non-GPU optimized code.
workers – How many workers to split this step over
- Returns
self - this Job
Examples
Add a step that runs a harmonic regression model using a previous asset step called ‘landsat’.
>>> from geodesic.tesseract import Job, Container, StepInput, StepOutput >>> job = Job() ... job.add_model_step( ... name="run-har-reg", ... container=Container( ... repository="us-central1-docker.pkg.dev/double-catfish-291717/seerai-docker/images/", ... image="har-reg", ... tag="v0.0.7", ... args={"forder": 4} ... ), ... inputs=[StepInput( ... asset_name="landsat", ... dataset_project=proj, ... spatial_chunk_shape=(512, 512), ... type="tensor", ... time_bin_selection=T.BinSelection(all=True), ... )], ... outputs=[ ... StepOutput( ... asset_name="brightness-params", ... chunk_shape=(1, 10, 512, 512), ... type="tensor", ... pixel_dtype="<f8", ... fill_value="nan", ... ), ... StepOutput( ... asset_name="greenness-params", ... chunk_shape=(1, 10, 512, 512), ... type="tensor", ... pixel_dtype="<f8", ... fill_value="nan", ... ), ... StepOutput( ... asset_name="wetness-params", ... chunk_shape=(1, 10, 512, 512), ... type="tensor", ... pixel_dtype="<f8", ... fill_value="nan", ... ) ... ], ... workers=10 ... )
- update_step_params(step_name, input_index=None, output_index=None, **params)[source]¶
updates the parameters for an existing step by looking up the step by name and then applying parameters
This method can be used to update the info in a step that’s already been added to a job. You can modify parameters at either the top level of the Step or in any of the inputs or outputs by specifying an
input_index
or anoutput_index
.- Parameters
step_name – must match one of the steps in the job. This is the step that will be updated
input_index – the index of the input you would like to modify
output_index – the index of the output you would like to modify
**params – key/values to set on the Step, StepInput, or StepOutput selected
- Returns
True if the step passes DAG validation, False otherwise
Examples
Rename the step: >>> job.update_step_params(‘old_name’, name=’new_name’)
Change the input dataset for 0th input >>> job.update_step_params(‘step’, input_index=0, dataset=”new-dataset”)
Change the 3rd output’s pixel_dtype >>> job.update_step_params(‘step’, output_index=3, pixel_dtype=np.float32)
- delete(remove_data=False)[source]¶
Deletes a job in the Tesseract service.
Unless specified, data created by this job will remain in the underlying storage. Set remove_data to True to remove created asset data.
- Parameters
remove_data – Delete underlying data created by this job
- watch()[source]¶
Monitor the tesseract job with the SeerAI widget.
Will create a jupyter widget that will watch the progress of this tesseract job.
- add_rechunk_step(step_name, asset_name, chunk_shape, workers=1)[source]¶
adds a rechunk step to the job
A rechunk step will create a new zarr array for a given asset called “rechunk” which will with copy of the tesseract array with the new given chunk shape. Note it is best to not decrease any dimension of the chunk shape too much. For example (1,1,1,1000) to (1,1,1000, 1) will be an extrememly inefficent operation (not to mention impractical).
- Parameters
step_name – name for the rechunking step
asset_name – output asset name from a previous step to rechunk
chunk_shape – a list of integers of length four which will
workers – the number of workers to use for this step. (default=1)
**params – key/values to set on the Step, StepInput, or StepOutput selected
Example
Add rechunk an asset from (1,1,1000,1000) to (4,2,1000,1000):
>>> job.add_rechunk_step( ... step_name='rechunk_sentinel', ... asset_name="sentinel-out", ... chunk_shape=[4,2,1000,1000], ... workers=3 ... )
- add_multiscale_step(step_name, asset_name, min_zoom=0, workers=1)[source]¶
adds a multiscales step to the job
- Parameters
step_name – name for the multiscale step
asset_name – output asset name from a previous step to create multiscales for
min_zoom – minimum zoom level for the multiscale step to generate. This might be useful if the asset is large with a small pixel size and all 20 zoom levels are not desired.
workers – the number of workers to use for this step. (default=1)
Example
Add a multiscale step for asset sentinel
>>> job.add_multiscale_step( ... step_name='multiscale_sentinel', ... asset_name="sentinel-out", ... min_zoom=10, ... workers=3 ... )