Skip to content

Paging search results

More than you need to know to use the SDKs

The details below explain how paging works internally. In the case of our SDKs, these details are handled for you automatically — you can simply iterate through a search response and the SDK will automatically fetch the next page(s) when it needs to (lazily). The SDKs will even add a default sort by GUID to ensure stable results across pages, even when you do not provide any sorting criteria yourself.

For curious minds, though, you can page through search results using a combination of the following properties1:

Property Description Example
from Indicates the starting point for the results. 0
size Indicates how many results to include per response (page). As a general rule of thumb we would recommend a size from 20-100, making 50 a common starting point. 50
track_total_hits Includes an accurate number of total results, if set to true. With its default value on the raw REST APIs (false) the maximum number of results you will see in the approximateCount field in the response is 10000. (Again, the SDKs set this to true by default to avoid this confusion.) true

Always use some sort if you plan to page results

To have the most consistent results you can when paging, always use some sorting criteria. (And keep that criteria the same for every page.)

1.4.0 1.1.0

For example:

Annotated sort options, as you would define them in the Java SDK
1
2
SortOptions byUpdate = Asset.UPDATE_TIME.order(SortOrder.Desc); // (1)
SortOptions byGuid = Asset.GUID.order(SortOrder.Asc); // (2)
  1. Include any of your own sorting, like this example putting the most recently-updated assets first in the results.
  2. Also consider a tie-breaker sorting mechanism. In this case, we use an asset's GUID to further sort any results that have the same last modified timestamp, since GUID is guaranteed to be unique for every asset.
Build the request
 3
 4
 5
 6
 7
 8
 9
10
11
IndexSearchRequest index = IndexSearchRequest.builder(
  IndexSearchDSL.builder(someQuery) // (1)
      .from(100) // (2)
      .size(50) // (3)
      .trackTotalHits(true) // (4)
      .sortOption(byUpdate) // (5)
      .sortOption(byGuid)
      .build())
    .build();
  1. You still need a query, to get some results 😉.
  2. Starting point for the page of results being requested. In this example, you would be asking for the third page. (0 would be from 0-50 for the first page, 50 would be from 50-100 for the second page, and this gives us 100-150 for the third page.)
  3. The number of results per page (in this example, 50 results per page).
  4. Enable trackTotalHits so that your response includes an accurate total number of results. (Actually the Java SDK enables this by default, so this step is redundant unless you want to turn it off.)
  5. And we need to include the sorting criteria we defined just above.
Iterate through multiple pages of results
12
13
14
15
16
17
18
19
20
21
IndexSearchResponse response = index.search(); // (1)
long totalResults = response.getApproximateCount(); // (2)
for (Asset result : response) { // (3)
    // Do something with each result of the search...
}
response.forEach(a -> log.info("Found asset: {}", a.getGuid())); // (4)
response.stream() // (5)
    .filter(a -> !(a instanceof ILineageProcess)) // (6)
    .limit(100) // (7)
    .forEach(a -> log.info("Found asset: {}", a.getGuid())) // (8)
  1. Keep the response object from the initial search, as it has a helper method for paging.
  2. Since we set trackTotalHits to true (the default for the Java SDK even if we do not set it), the .getApproximateCount() will give us the total number of results. This can be over 10,000.
  3. Iterate through all the results, across all pages (each page is lazily-loaded, so you can break out at any time without actually retrieving all pages of results).
  4. Alternatively, you can iterate through all the results using forEach() on the response. (This uses the same underlying iterable-based implementation.)
  5. Alternatively, you can stream the results. Streaming will also lazily-load only the pages of results necessary to meet the chained criteria for processing the stream.
  6. When streaming, you can further filter the results to apply any complex filtering logic you could not push-down as part of the query itself.
  7. When streaming, you can also limit the total number of results you want to process — independently of the page size.
  8. Don't forget to actually do something with the results in the stream 😉
Annotated sort options, as you would define them in the Python SDK
1
2
3
4
5
6
7
from pyatlan.client.atlan import AtlanClient
from pyatlan.model.enums import SortOrder
from pyatlan.model.assets import Referenceable
from pyatlan.model.search import IndexSearchRequest, DSL

by_update = Referenceable.UPDATE_TIME.order(SortOrder.DESCENDING)  # (1)
by_guid = Referenceable.GUID.order(SortOrder.ASCENDING)  # (2)
  1. Include any of your own sorting, like this example putting the most recently-updated assets first in the results.
  2. Also consider a tie-breaker sorting mechanism. In this case, we use an asset's GUID to further sort any results that have the same last modified timestamp, since GUID is guaranteed to be unique for every asset.
Build the request
 8
 9
10
11
12
13
14
15
16
17
18
19
index = IndexSearchRequest(
    dsl=DSL(
        query=someQuery,  # (1)
        from_=100,  # (2)
        size=50,  # (3)
        track_total_hits=True,  # (4)
        sort=[  # (5)
            by_update,
            by_guid
        ],
    )
)
  1. You still need a query, to get some results 😉.
  2. Starting point for the page of results being requested. In this example, you would be asking for the third page. (0 would be from 0-50 for the first page, 50 would be from 50-100 for the second page, and this gives us 100-150 for the third page.)
  3. The number of results per page (in this example, 50 results per page).
  4. Enable track_total_hits so that your response includes an accurate total number of results. (Actually the Python SDK enables this by default, so this step is redundant unless you want to turn it off.)
  5. And we need to include the sorting criteria we defined just above.
Iterate through multiple pages of results
20
21
22
23
24
client = AtlanClient()
response = client.asset.search(index)  # (1)
total_results = response.count  # (2)
for result in response:  # (3)
    # Do something with each result of the search...
  1. Keep the response object from the initial search, as it has a helper method for paging.
  2. Since we set track_total_hits to True (the default for the Python SDK even if we do not set it), the .count property will give us the total number of results. This can be over 10,000.
  3. Iterate through all the results, across all pages (each page is lazily-loaded, so you can break out at any time without actually retrieving all pages of results). Don't forget to actually do something with the results in the stream 😉
POST /api/meta/search/indexsearch
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
  "dsl": {
    "from": 100, // (1)
    "size": 50, // (2)
    "track_total_hits": true, // (3)
    "query": {...}, // (4)
    "sort": [ // (5)
      { "__modificationTimestamp": { "order": "desc" }}, // (6)
      { "__guid": { "order": "asc" }} // (7)
    ]
  }
}
  1. Starting point for the page of results being requested. In this example, you would be asking for the third page. (0 would be from 0-50 for the first page, 50 would be from 50-100 for the second page, and this gives us 100-150 for the third page.)
  2. The number of results per page (in this example, 50 results per page).
  3. Enable track_total_hits so that your response includes an accurate total number of results.
  4. You still need a query, to get some results 😉.
  5. When paging, we should always sort the results (for consistency across the pages).
  6. Include any of your own sorting, like this example putting the most recently-updated assets first in the results.
  7. Also consider a tie-breaker sorting mechanism. In this case, we use the GUID of an asset to further sort any results that have the same last modified timestamp.
Annotated response, in plain JSON
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{
  "queryType": "INDEX",
  "searchParameters": {
      "showSearchScore": false,
      "suppressLogs": false,
      "allowDeletedRelations": false,
      "query": "{\"from\":100,\"size\":50,\"track_total_hits\":true,\"query\":{...},\"sort\":[{\"__modificationTimestamp\":\"asc\"},{\"__guid\":\"asc\"}]}" // (1)
  },
  "entities": [ // (2)
    {...},
    {...},
    ...
  ],
  "approximateCount": 24631 // (3)
}
  1. Note that every response to a search includes the query that was run. You can deconstruct this programmatically to always determine the from you will need for the next page of results. (Basically: from = from + size.) And in fact, since you can programmatically extract both the query and sorting criteria from this you have everything you need to get the next page — the query, the from, the size and the sort.
  2. The results themselves are the objects in the entities array. The size of this array will be at most size elements. Of course, your final page of results may not have a complete page of results, so it is possible that this array will be less than size (in particular, when you are on the final page).
  3. Since the request sets track_total_hits to true, the approximateCount in the response will have an accurate number of total results. Note that this can go beyond 10,000.

  1. If you're familiar with Elasticsearch there are an alternative paging options using search_after and point-in-time (PIT) state preservation. (There also used to be scrolling, but this is no longer recommended by Elasticsearch.) We do not currently expose the search_after or PIT approaches through Atlan's search. However, you should still be able to page beyond the first 10,000 results using the approach outlined above.