End-to-end bulk update¶
Running example
To walk through this using an example, and to compare and contrast the approaches, imagine you want to:
- Mark all views (including materialized views) in a particular schema as verified, unless they already have some certificate.
- Change the owner of the same views.
Step-by-step¶
The usual end-to-end pattern for updating many assets efficiently involves three steps:
- Finding the assets you want to update.
- Applying your updates to each asset (in-memory).
- Sending those changes to Atlan (in batches).
You can do each of these steps in sequence, for example:
1. Find the assets¶
You start by first finding the assets you want to update. This is usually best done through a search. (For other common examples, have a look at the search snippets.)
Example: get all views in a schema | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
-
The
qualifiedName
of every view starts with thequalifiedName
of its parent (schema), so we can limit the results to a particular schema by using thequalifiedName
. -
To start building up a query with multiple conditions, you can use the
select()
helper on any client'sassets
member. -
You can chain
where()
methods to define all the conditions the search results must match. You can use the static constants within any given type to select a particular attribute (likeQUALIFIED_NAME
in this example), and then limit results to only those assets whosequalifiedName
starts with thequalifiedName
of the schema (by using thestartsWith()
predicate). In this example, that means only assets that are within this particular schema will be returned as results. -
Since there could be tables, views, materialized views and columns in this schema — but you only want views and materialized views — you can use the
Asset.TYPE_NAME.in()
method to restrict results to only views and materialized views. -
Since you only want to update views that do not already have a certificate, you can further limit the results using the
whereNot()
method. This will exclude any assets where a certificate alreadyhasAnyValue()
. -
Here you can play around with different page sizes, to further limit API calls by retrieving more results per page.
-
Add as many attributes as needed. Each attribute you add here will ensure that detail is included in each search result. So in this example, every view will include its description, certificate, and individual owners. (Limit these attributes to the minimum you need about each view to do your intended work.)
-
You can translate the object you've built up into various outputs, for example immediately calculating a count of how many results match or streaming them directly for processing. In this case, the
toRequest()
method will give us the resulting set of criteria back as a complete index search request. -
You can then execute the search based on the request.
Example: get all views in a schema | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
-
The
qualified_name
of every view starts with thequalified_name
of its parent (schema), so we can limit the results to a particular schema by using thequalified_name
. -
To start building up a query with multiple conditions, you can use a
FluentSearch()
object. -
You can chain
where()
methods to define all the conditions the search results must match. You can use the class variables within any given type to select a particular attribute (likeQUALIFIED_NAME
in this example), and then limit results to only those assets whosequalified_name
starts with thequalified_name
of the schema (by using thestartswith()
predicate). In this example, that means only assets that are within this particular schema will be returned as results. -
Since there could be tables, views, materialized views and columns in this schema — but you only want views and materialized views — you can use the
CompoundQuery.asset_types()
helper method to restrict results to only views and materialized views. -
Since you only want to update views that do not already have a certificate, you can further limit the results using the
where_not()
method. This will exclude any assets where a certificate alreadyhas_any_value()
. -
Here you can play around with different page sizes, to further limit API calls by retrieving more results per page.
-
Add as many attributes as needed. Each attribute you add here will ensure that detail is included in each search result. So in this example, every view will include its description, certificate, and individual owners. (Limit these attributes to the minimum you need about each view to do your intended work.)
-
You can translate the object you've built up into various outputs, for example immediately calculating a count of how many results match or executing the query to start processing results directly. In this case, the
to_request()
method will give us the resulting set of criteria back as a complete index search request. -
You can then execute the search based on the request.
Example: get all views in a schema | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
-
The
qualifiedName
of every view starts with thequalifiedName
of its parent (schema), so we can limit the results to a particular schema by using thequalifiedName
. -
To start building up a query with multiple conditions, you can use the
select()
helper on any client'sassets
member. -
You can chain
where()
methods to define all the conditions the search results must match. You can use the static constants within any given type to select a particular attribute (likeQUALIFIED_NAME
in this example), and then limit results to only those assets whosequalifiedName
starts with thequalifiedName
of the schema (by using thestartsWith()
predicate). In this example, that means only assets that are within this particular schema will be returned as results. -
Since there could be tables, views, materialized views and columns in this schema — but you only want views and materialized views — you can use the
Asset.TYPE_NAME.in()
helper method to restrict results to only views and materialized views. -
Since you only want to update views that do not already have a certificate, you can further limit the results using the
whereNot()
method. This will exclude any assets where a certificate alreadyhasAnyValue()
. -
Here you can play around with different page sizes, to further limit API calls by retrieving more results per page.
-
Add as many attributes as needed. Each attribute you add here will ensure that detail is included in each search result. So in this example, every view will include its description, certificate, and individual owners. (Limit these attributes to the minimum you need about each view to do your intended work.)
-
You can translate the object you've built up into various outputs, for example immediately calculating a count of how many results match or streaming them directly for processing. In this case, the
toRequest()
method will give us the resulting set of criteria back as a complete index search request. -
You can then execute the search based on the request.
POST /api/meta/search/indexsearch | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
-
Run a search to find the views and materialized views.
-
To start building up a query with multiple conditions, you can use a
bool
query in Elasticsearch. -
You can use the
filter
criteria to define all the conditions the search results must match in a binary way (either matches or doesn't). This avoids the need to calculate a score for each result. -
The
qualifiedName
of every view starts with thequalifiedName
of its parent (schema), so we can limit the results to a particular schema by using thequalifiedName
. -
Since there could be tables, views, materialized views and columns in this schema — but you only want views and materialized views — you can use an exact match on multiple types to restrict results to only views and materialized views.
-
Searches by default will return all assets that are found — whether active or archived (soft-deleted). In most cases, you probably only want the active ones.
-
Since you only want to update views that do not already have a certificate, you can further limit the results using the
must_not
clause. This will exclude any assets that already have a certificate present. -
When paging through results, you should specify a sort to give a stable set of results across pages. The most reliable sort will be by GUID of the asset, as this is guaranteed to be unique for every asset.
-
Here you can play around with different page sizes, to further limit API calls by retrieving more results per page.
-
Add as many attributes as needed. Each attribute you add here will ensure that detail is included in each search result. So in this example, every view will include its description, certificate, and individual owners. (Limit these attributes to the minimum you need about each column to do your intended work.)
2. Build-up your changes¶
Next, you iterate through those results and make the changes you want to each one. Use the multiple operations pattern to make multiple changes to each asset.
Example: iterate through results and make changes | |
---|---|
13 14 15 16 17 18 19 20 21 |
|
-
Create a batch of assets to build-up the changes across multiple assets before applying those changes in Atlan itself.
- The first parameter defines the Atlan tenant on which the batch will be processed
- The second specifies the maximum number of assets to build-up before sending them across to Atlan
Additional parameters
By default (using only the options above) no classifications or custom metadata will be added or changed on the assets in each batch. To also include classifications and custom metadata, you need to use these additional parameters:
- A third parameter of
true
to replace all classifications on the assets in the batch, which would include removing classifications if none are provided for the assets in the batch itself (orfalse
if you still want to ignore classifications) - A fourth parameter to control how custom metadata should be handled for the assets:
IGNORE
any custom metadata changes in the batch,OVERWRITE
to replace all custom metadata with what's provided in the batch (including removing custom metadata that already exists on an asset), orMERGE
to only add or update custom metadata based on what's in the batch (leaving other existing custom metadata unchanged) - a fifth parameter to control whether failures should be captured across batches (
true
) or ignored (false
) - a sixth parameter to control whether the batch should only attempt to update assets that already exist (
true
) or also create assets if they do not yet exist (false
) - a seventh parameter to control whether details about each created and updated asset across batches should be tracked (
true
) or ignored (false
) — counts will always be kept - an eighth parameter to control whether the matching for determining whether an asset already exists should be done in a case-insensitive way (
true
) or case-sensitively (false
) - a ninth parameter to control what kind of assets to create, if not running in
updateOnly
mode: partial assets (only available in lineage), or full assets - a tenth parameter to control whether the matching for determining whether an asset already exists should be done strictly according to the data type specified (
false
), or if tables, views and materialized views should be treated interchangeably (true
)
-
This is the pattern for iterating through all results (across pages) covered in the Searching for assets portion of the SDK documentation.
-
Every asset implements the
trimToRequired()
method, which gives you a builder containing only the bare minimum information needed to update that asset.Limit your asset to only what you intend to update
When you send an update to Atlan, it will only attempt to change the information you send in your request — leaving any information not in your request as-is (unchanged) on the asset in Atlan. By using
trimToRequired()
you can remove all information you do not want to update, and then chain on only the details you do want to update. -
In this running example, you are updating the certificate to verified and setting a new owner — so you simply chain those updates onto the trimmed builder.
-
You can then add your (in-memory) modified asset to the batch.
Auto-saves as it goes
As long as the number of assets built-up is below the maximum batch size specified when creating the batch, this will simply continue to build up the batch. As soon as you hit the size limit for the batch, though, this same method will call the
save()
operation to batch-update all of those assets in a single API call.Remember to flush
Since your loop could finish before you reach another full batch, you must always remember to
flush()
the batch. This will send any remaining assets that were queued up, when the size of the queue did not yet reach the full batch size.
Example: iterate through results and make changes | |
---|---|
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
-
Create a batch of assets to accumulate changes across multiple assets before applying those changes in Atlan itself. The
Batch()
takes the following parameters:client
: an instance ofAssetClient
.max_size
: the maximum size of each batch to be processed (per API call).
Additional optional parameters
By default (using only the options above) no classifications or custom metadata will be added or changed on the assets in each batch. To also include classifications and custom metadata, you need to use these additional parameters:
replace_atlan_tags
(default: False): IfTrue
replace all classifications (tags) on the assets in the batch, which would include removing classifications (tags) if none are provided for the assets in the batch itself (orFalse
if you still want to ignore classifications)custom_metadata_handling
(default: CustomMetadataHandling.IGNORE): control how custom metadata should be handled for the assets:IGNORE
any custom metadata changes in the batch,OVERWRITE
to replace all custom metadata with what's provided in the batch (including removing custom metadata that already exists on an asset), orMERGE
to only add or update custom metadata based on what's in the batch (leaving other existing custom metadata unchanged)capture_failures
(default: False): control whether failures should be captured across batches (True
) or ignored (False
)update_only
(default: False): control whether the batch should only attempt to update assets that already exist (True
) or also create assets if they do not yet exist (False
)track
(default: False): control whether details about each created and updated asset across batches should be tracked (True
) or ignored (False
) — counts will always be keptcase_insensitive
(default: False): control whether the matching for determining whether an asset already exists should be done in a case-insensitive way (True
) or case-sensitively (False
)creation_handling
(default: AssetCreationHandling.FULL): control what kind of assets to create, if not running inupdate_only
mode;PARTIAL
assets (only available in lineage), orFULL
assetstable_view_agnostic
(default: False): control whether the matching for determining whether an asset already exists should be done strictly according to the data type specified (False
), or if tables, views and materialized views should be treated interchangeably (True
)
-
This is the pattern for iterating through all results (across pages) covered in the Searching for assets portion of the SDK documentation.
-
Every asset implements the
trim_to_required()
method, which gives you an object containing only the bare minimum information needed to update that asset.Limit your asset to only what you intend to update
When you send an update to Atlan, it will only attempt to change the information you send in your request — leaving any information not in your request as-is (unchanged) on the asset in Atlan. By using
trimToRequired()
you can remove all information you do not want to update, and then chain on only the details you do want to update. -
In this running example, you are updating the certificate to verified and setting a new owner — so you simply add those updates onto the trimmed object.
-
You can then add your (in-memory) modified asset to the batch.
Auto-saves as it goes
As long as the number of assets built-up is below the maximum batch size specified when creating the batch, this will simply continue to build up the batch. As soon as you hit the size limit for the batch, though, this same method will call the
save()
operation to batch-update all of those assets in a single API call.Remember to flush
Since your loop could finish before you reach another full batch, you must always remember to
flush()
the batch. This will send any remaining assets that were queued up, when the size of the queue did not yet reach the full batch size.
Example: iterate through results and make changes | |
---|---|
13 14 15 16 17 18 19 20 21 |
|
-
Create a batch of assets to build-up the changes across multiple assets before applying those changes in Atlan itself.
- The first parameter defines the Atlan tenant on which the batch will be processed
- The second specifies the maximum number of assets to build-up before sending them across to Atlan
Additional parameters
By default (using only the options above) no classifications or custom metadata will be added or changed on the assets in each batch. To also include classifications and custom metadata, you need to use these additional parameters:
- A third parameter of
true
to replace all classifications on the assets in the batch, which would include removing classifications if none are provided for the assets in the batch itself (orfalse
if you still want to ignore classifications) - A fourth parameter to control how custom metadata should be handled for the assets:
IGNORE
any custom metadata changes in the batch,OVERWRITE
to replace all custom metadata with what's provided in the batch (including removing custom metadata that already exists on an asset), orMERGE
to only add or update custom metadata based on what's in the batch (leaving other existing custom metadata unchanged) - a fifth parameter to control whether failures should be captured across batches (
true
) or ignored (false
) - a sixth parameter to control whether the batch should only attempt to update assets that already exist (
true
) or also create assets if they do not yet exist (false
) - a seventh parameter to control whether details about each created and updated asset across batches should be tracked (
true
) or ignored (false
) — counts will always be kept - an eighth parameter to control whether the matching for determining whether an asset already exists should be done in a case-insensitive way (
true
) or case-sensitively (false
) - a ninth parameter to control what kind of assets to create, if not running in
updateOnly
mode: partial assets (only available in lineage), or full assets - a tenth parameter to control whether the matching for determining whether an asset already exists should be done strictly according to the data type specified (
false
), or if tables, views and materialized views should be treated interchangeably (true
)
-
This is the pattern for iterating through all results (across pages) covered in the Searching for assets portion of the SDK documentation.
-
Every asset implements the
trimToRequired()
method, which gives you a builder containing only the bare minimum information needed to update that asset.Limit your asset to only what you intend to update
When you send an update to Atlan, it will only attempt to change the information you send in your request — leaving any information not in your request as-is (unchanged) on the asset in Atlan. By using
trimToRequired()
you can remove all information you do not want to update, and then chain on only the details you do want to update. -
In this running example, you are updating the certificate to verified and setting a new owner — so you simply chain those updates onto the trimmed builder.
-
You can then add your (in-memory) modified asset to the batch.
Auto-saves as it goes
As long as the number of assets built-up is below the maximum batch size specified when creating the batch, this will simply continue to build up the batch. As soon as you hit the size limit for the batch, though, this same method will call the
save()
operation to batch-update all of those assets in a single API call.Remember to flush
Since your loop could finish before you reach another full batch, you must always remember to
flush()
the batch. This will send any remaining assets that were queued up, when the size of the queue did not yet reach the full batch size.
Up to your own code
There are no API calls to make to change the results in-memory. How you implement this will be entirely up to how you are writing your code.
3. Save them in batches¶
Finally, send the changes you have queued up in batches. Use the multiple assets pattern to update multiple assets at the same time.
Example: save the changes in batches | |
---|---|
22 23 24 25 |
|
-
The
AssetBatch
'sadd()
method used in the previous step will automatically save as its internal queue of assets reaches a full batch size.Remember to flush
However, since your loop could finish before you reach another full batch, you must always remember to
flush()
the batch. This will send any remaining assets that were queued up. -
Both the
.add()
and.flush()
operations of theAssetBatch
could send a request over to Atlan. Either can therefore also run into trouble and raise an error through anAtlanException
. It is up to you to handle such potential errors as you see fit.
Example: save the changes in batches | |
---|---|
30 31 32 |
|
-
The
Batch
'sadd()
method used in the previous step will automatically save as its internal queue of assets reaches a full batch size.Remember to flush
However, since your loop could finish before you reach another full batch, you must always remember to
flush()
the batch. This will send any remaining assets that were queued up. -
Both the
.add()
and.flush()
operations of theBatch
could send a request over to Atlan. Either can therefore also run into trouble and raise an error through anAtlanError
. It is up to you to handle such potential errors as you see fit.
Example: save the changes in batches | |
---|---|
22 23 24 25 |
|
-
The
AssetBatch
'sadd()
method used in the previous step will automatically save as its internal queue of assets reaches a full batch size.Remember to flush
However, since your loop could finish before you reach another full batch, you must always remember to
flush()
the batch. This will send any remaining assets that were queued up. -
Both the
.add()
and.flush()
operations of theAssetBatch
could send a request over to Atlan. Either can therefore also run into trouble and raise an error through anAtlanException
. It is up to you to handle such potential errors as you see fit.
POST /api/meta/entity/bulk | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
- All details must still be included in an outer
entities
array. - You need to specify the type for each asset you are updating.
- You need to specify other required attributes for each asset, such as its name and qualifiedName.
- Add on any other attributes or relationships you want to set on the asset, such as in the running example a verified certificate and new individual owner.
- Add another object to the payload to represent another asset that should be updated by the same API call. Once again specify all the required information for that kind of asset, and any of the details for attributes or relationships you want to set.
Pipelining¶
Alternatively, when using an SDK, you can pipeline these operations together. The pipeline will run just as efficiently as the step-by-step approach above:
- Pushing down the criteria to run as a search on Atlan
- Lazily-fetching each page of results
- Batching up and bulk-saving changes
Example: pipelining | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
-
The
qualifiedName
of every view starts with thequalifiedName
of its parent (schema), so we can limit the results to a particular schema by using thequalifiedName
. -
Create a batch of assets to build-up the changes across multiple assets before applying those changes in Atlan itself. When parallel-processing (see further notes on the
stream(true)
) you need to use a parallel-capableParallelBatch
:- The first parameter defines the Atlan tenant on which the batch will be processed
- The second specifies the maximum number of assets to build-up before sending them across to Atlan
Additional parameters
By default (using only the options above) no classifications or custom metadata will be added or changed on the assets in each batch. To also include classifications and custom metadata, you need to use these additional parameters:
- A third parameter of
true
to replace all classifications on the assets in the batch, which would include removing classifications if none are provided for the assets in the batch itself (orfalse
if you still want to ignore classifications) - A fourth parameter to control how custom metadata should be handled for the assets:
IGNORE
any custom metadata changes in the batch,OVERWRITE
to replace all custom metadata with what's provided in the batch (including removing custom metadata that already exists on an asset), orMERGE
to only add or update custom metadata based on what's in the batch (leaving other existing custom metadata unchanged) - a fifth parameter to control whether failures should be captured across batches (
true
) or ignored (false
) - a sixth parameter to control whether the batch should only attempt to update assets that already exist (
true
) or also create assets if they do not yet exist (false
) - a seventh parameter to control whether details about each created and updated asset across batches should be tracked (
true
) or ignored (false
) — counts will always be kept - an eighth parameter to control whether the matching for determining whether an asset already exists should be done in a case-insensitive way (
true
) or case-sensitively (false
) - a ninth parameter to control what kind of assets to create, if not running in
updateOnly
mode: partial assets (only available in lineage), or full assets - a tenth parameter to control whether the matching for determining whether an asset already exists should be done strictly according to the data type specified (
false
), or if tables, views and materialized views should be treated interchangeably (true
)
-
You can then start defining a pipeline directly against the client's
assets
by using theselect()
method.Including archived (soft-deleted) assets
Searches by default will return all assets that are found — whether active or archived (soft-deleted). In most cases, you probably only want the active ones, so this is the default behavior of
select()
. Sending intrue
to thisselect()
method will start the pipeline to include any archived (soft-deleted) assets in the results, if you do want them. -
You can chain as many
where()
methods as you want to define all the conditions the search results must match. You can use the static constants within any given type to select a particular attribute (likeQUALIFIED_NAME
in this example), and then limit results to only those assets whosequalifiedName
starts with thequalifiedName
of the schema (by using thestartsWith()
predicate). In this example, that means only assets that are within this particular schema will be returned as results. -
Since there could be tables, views, materialized views and columns in this schema — but you only want views and materialized views — you can use the
Asset.TYPE_NAME.in()
method to restrict results to only views and materialized views. -
Since you only want to update views that do not already have a certificate, you can further limit the results using the
whereNot()
method. This will exclude any assets where a certificate alreadyhasAnyValue()
. -
(Optional) You can play around with different page sizes, to further limit API calls by retrieving more results per page.
-
Add as many attributes as needed. Each attribute you add here will ensure that detail is included in each search result. So in this example, every view will include its description, certificate, and individual owners. (Limit these attributes to the minimum you need about each view to do your intended work.)
-
Once you have defined the criteria for your pipeline, call the
stream()
method to push-down the pipeline to Atlan. This will:- Create a search that combines all the criteria you have specified.
- Run that search against Atlan to produce the first page of results.
- Page through the results by lazily fetching each subsequent page as you iterate through them. (So if you use a
limit()
on the stream, for example, you can break out before retrieving all pages.)
Can also run in parallel threads
You can also parallel-stream the results by passing
true
to thestream()
method. This will spawn multiple threads that each independently process a page of results and combine the results in parallel. While this can be significantly faster for processing many results, keep in mind if you are collecting the results into any structure that structure must be thread-safe. (For example, you'll need to use things likeConcurrentHashMap
rather than justHashMap
, and to useParallelBatch
rather thanAssetBatch
if making changes.) -
For each result, you can then carry out your changes and submit them into the batch.
-
Every asset implements the
trimToRequired()
method, which gives you a builder containing only the bare minimum information needed to update that asset.Limit your asset to only what you intend to update
When you send an update to Atlan, it will only attempt to change the information you send in your request — leaving any information not in your request as-is (unchanged) on the asset in Atlan. By using
trimToRequired()
you can remove all information you do not want to update, and then chain on only the details you do want to update. -
In this running example, you are updating the certificate to verified and setting a new owner — so you simply chain those updates onto the trimmed builder.
-
You can then add your (in-memory) modified asset to the batch.
Auto-saves as it goes
As long as the number of assets built-up is below the maximum batch size specified when creating the batch, this will simply continue to build up the batch. As soon as you hit the size limit for the batch, though, this same method will call the
save()
operation to batch-update all of those assets in a single API call.Remember to flush
Since your loop could finish before you reach another full batch, you must always remember to
flush()
the batch. This will send any remaining assets that were queued up, when the size of the queue did not yet reach the full batch size. -
Both the
.add()
and.flush()
operations of theAssetBatch
could send a request over to Atlan. Either can therefore also run into trouble and raise an error through anAtlanException
. It is up to you to handle such potential errors as you see fit. -
The
AssetBatch
'sadd()
method used in the previous step will automatically save as its internal queue of assets reaches a full batch size.Remember to flush
However, since your loop could finish before you reach another full batch, you must always remember to
flush()
the batch. This will send any remaining assets that were queued up.
Example: pipelining | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
-
The
qualifiedName
of every view starts with thequalifiedName
of its parent (schema), so we can limit the results to a particular schema by using thequalifiedName
. -
Create a batch of assets to accumulate changes across multiple assets before applying those changes in Atlan itself. The
Batch()
takes the following parameters:client
: an instance ofAssetClient
.max_size
: the maximum size of each batch to be processed (per API call).
Additional optional parameters
By default (using only the options above) no classifications or custom metadata will be added or changed on the assets in each batch. To also include classifications and custom metadata, you need to use these additional parameters:
replace_atlan_tags
(default: False): IfTrue
replace all classifications (tags) on the assets in the batch, which would include removing classifications (tags) if none are provided for the assets in the batch itself (orFalse
if you still want to ignore classifications)custom_metadata_handling
(default: CustomMetadataHandling.IGNORE): control how custom metadata should be handled for the assets:IGNORE
any custom metadata changes in the batch,OVERWRITE
to replace all custom metadata with what's provided in the batch (including removing custom metadata that already exists on an asset), orMERGE
to only add or update custom metadata based on what's in the batch (leaving other existing custom metadata unchanged)capture_failures
(default: False): control whether failures should be captured across batches (True
) or ignored (False
)update_only
(default: False): control whether the batch should only attempt to update assets that already exist (True
) or also create assets if they do not yet exist (False
)track
(default: False): control whether details about each created and updated asset across batches should be tracked (True
) or ignored (False
) — counts will always be keptcase_insensitive
(default: False): control whether the matching for determining whether an asset already exists should be done in a case-insensitive way (True
) or case-sensitively (False
)creation_handling
(default: AssetCreationHandling.FULL): control what kind of assets to create, if not running inupdate_only
mode;PARTIAL
assets (only available in lineage), orFULL
assetstable_view_agnostic
(default: False): control whether the matching for determining whether an asset already exists should be done strictly according to the data type specified (False
), or if tables, views and materialized views should be treated interchangeably (True
)
-
You can then start defining a pipeline directly using a
FluentSearch()
object. -
You can chain as many
where()
methods as you want to define all the conditions the search results must match. You can use the class variables within any given type to select a particular attribute (likeQUALIFIED_NAME
in this example), and then limit results to only those assets whosequalified_name
starts with thequalified_name
of the schema (by using thestartswith()
predicate). In this example, that means only assets that are within this particular schema will be returned as results. -
Since there could be tables, views, materialized views and columns in this schema — but you only want views and materialized views — you can use the
CompoundQuery.asset_types()
helper method to restrict results to only views and materialized views. -
Since you only want to update views that do not already have a certificate, you can further limit the results using the
where_not()
method. This will exclude any assets where a certificate alreadyhas_any_value()
. -
(Optional) You can play around with different page sizes, to further limit API calls by retrieving more results per page.
-
Add as many attributes as needed. Each attribute you add here will ensure that detail is included in each search result. So in this example, every view will include its description, certificate, and individual owners. (Limit these attributes to the minimum you need about each view to do your intended work.)
-
You can translate the object you've built up into various outputs, for example immediately calculating a count of how many results match or streaming them directly for processing. In this case, the
toRequest()
method will give us the resulting set of criteria back as a complete index search request. -
You can then execute the search based on the request.tore all of those details back into a response object.
-
For each result, you can then carry out your changes and submit them into the batch.
-
Every asset implements the
trim_to_required()
method, which gives you a builder containing only the bare minimum information needed to update that asset.Limit your asset to only what you intend to update
When you send an update to Atlan, it will only attempt to change the information you send in your request — leaving any information not in your request as-is (unchanged) on the asset in Atlan. By using
trim_to_required()
you can remove all information you do not want to update, and then chain on only the details you do want to update. -
In this running example, you are updating the certificate to verified and setting a new owner — so you simply set those updates on the trimmed object.
-
You can then add your (in-memory) modified asset to the batch.
-
The
Batch
'sadd()
method used in the previous step will automatically save as its internal queue of assets reaches a full batch size.Remember to flush
However, since your loop could finish before you reach another full batch, you must always remember to
flush()
the batch. This will send any remaining assets that were queued up.
Example: pipelining | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
-
The
qualifiedName
of every view starts with thequalifiedName
of its parent (schema), so we can limit the results to a particular schema by using thequalifiedName
. -
Create a batch of assets to build-up the changes across multiple assets before applying those changes in Atlan itself. When parallel-processing (see further notes on the
stream(true)
) you need to use a parallel-capableParallelBatch
:- The first parameter defines the Atlan tenant on which the batch will be processed
- The second specifies the maximum number of assets to build-up before sending them across to Atlan
Additional parameters
By default (using only the options above) no classifications or custom metadata will be added or changed on the assets in each batch. To also include classifications and custom metadata, you need to use these additional parameters:
- A third parameter of
true
to replace all classifications on the assets in the batch, which would include removing classifications if none are provided for the assets in the batch itself (orfalse
if you still want to ignore classifications) - A fourth parameter to control how custom metadata should be handled for the assets:
IGNORE
any custom metadata changes in the batch,OVERWRITE
to replace all custom metadata with what's provided in the batch (including removing custom metadata that already exists on an asset), orMERGE
to only add or update custom metadata based on what's in the batch (leaving other existing custom metadata unchanged) - a fifth parameter to control whether failures should be captured across batches (
true
) or ignored (false
) - a sixth parameter to control whether the batch should only attempt to update assets that already exist (
true
) or also create assets if they do not yet exist (false
) - a seventh parameter to control whether details about each created and updated asset across batches should be tracked (
true
) or ignored (false
) — counts will always be kept - an eighth parameter to control whether the matching for determining whether an asset already exists should be done in a case-insensitive way (
true
) or case-sensitively (false
) - a ninth parameter to control what kind of assets to create, if not running in
updateOnly
mode: partial assets (only available in lineage), or full assets - a tenth parameter to control whether the matching for determining whether an asset already exists should be done strictly according to the data type specified (
false
), or if tables, views and materialized views should be treated interchangeably (true
)
-
You can then start defining a pipeline directly against the client's
assets
by using theselect()
method.Including archived (soft-deleted) assets
Searches by default will return all assets that are found — whether active or archived (soft-deleted). In most cases, you probably only want the active ones, so this is the default behavior of
select()
. Sending intrue
to thisselect()
method will start the pipeline to include any archived (soft-deleted) assets in the results, if you do want them. -
You can chain as many
where()
methods as you want to define all the conditions the search results must match. You can use the static constants within any given type to select a particular attribute (likeQUALIFIED_NAME
in this example), and then limit results to only those assets whosequalifiedName
starts with thequalifiedName
of the schema (by using thestartsWith()
predicate). In this example, that means only assets that are within this particular schema will be returned as results. -
Since there could be tables, views, materialized views and columns in this schema — but you only want views and materialized views — you can use the
Asset.TYPE_NAME.in
helper method to restrict results to only views and materialized views. -
Since you only want to update views that do not already have a certificate, you can further limit the results using the
whereNot()
method. This will exclude any assets where a certificate alreadyhasAnyValue()
. -
(Optional) You can play around with different page sizes, to further limit API calls by retrieving more results per page.
-
Add as many attributes as needed. Each attribute you add here will ensure that detail is included in each search result. So in this example, every view will include its description, certificate, and individual owners. (Limit these attributes to the minimum you need about each view to do your intended work.)
-
Once you have defined the criteria for your pipeline, call the
stream()
method to push-down the pipeline to Atlan. This will:- Create a search that combines all the criteria you have specified.
- Run that search against Atlan to produce the first page of results.
- Page through the results by lazily fetching each subsequent page as you iterate through them. (So if you use a
limit()
on the stream, for example, you can break out before retrieving all pages.)
Can also run in parallel threads
You can also parallel-stream the results by passing
true
to thestream()
method. This will spawn multiple threads that each independently process a page of results and combine the results in parallel. While this can be significantly faster for processing many results, keep in mind if you are collecting the results into any structure that structure must be thread-safe. (For example, you'll need to use things likeConcurrentHashMap
rather than justHashMap
, and to useParallelBatch
rather thanAssetBatch
if making changes.) -
For each result, you can then carry out your changes and submit them into the batch.
-
Every asset implements the
trimToRequired()
method, which gives you a builder containing only the bare minimum information needed to update that asset.Limit your asset to only what you intend to update
When you send an update to Atlan, it will only attempt to change the information you send in your request — leaving any information not in your request as-is (unchanged) on the asset in Atlan. By using
trimToRequired()
you can remove all information you do not want to update, and then chain on only the details you do want to update. -
In this running example, you are updating the certificate to verified and setting a new owner — so you simply chain those updates onto the trimmed builder.
-
You can then add your (in-memory) modified asset to the batch.
Auto-saves as it goes
As long as the number of assets built-up is below the maximum batch size specified when creating the batch, this will simply continue to build up the batch. As soon as you hit the size limit for the batch, though, this same method will call the
save()
operation to batch-update all of those assets in a single API call.Remember to flush
Since your loop could finish before you reach another full batch, you must always remember to
flush()
the batch. This will send any remaining assets that were queued up, when the size of the queue did not yet reach the full batch size. -
Both the
.add()
and.flush()
operations of theAssetBatch
could send a request over to Atlan. Either can therefore also run into trouble and raise an error through anAtlanException
. It is up to you to handle such potential errors as you see fit. -
The
AssetBatch
'sadd()
method used in the previous step will automatically save as its internal queue of assets reaches a full batch size.Remember to flush
However, since your loop could finish before you reach another full batch, you must always remember to
flush()
the batch. This will send any remaining assets that were queued up.
Requires numerous API calls
To implement the same logic purely through raw API calls will require making many calls:
- To run the search.
- To page through the results.
- To batch up a set of assets to update.
- To submit each batch of assets to update.