Paging search results¶
Automatically (via SDK)¶
Our SDKs are designed to simplify paging, so you do not need to worry about the underlying details. You can simply iterate through a search response and the SDK will automatically fetch the next page(s) when it needs to (lazily).
The SDKs will even add a default sort by GUID to ensure stable results across pages, even when you do not provide any sorting criteria yourself.
Automatic paging | |
---|---|
1 2 3 4 5 6 |
|
- You can start building a query across all assets using the
select()
method on theassets
member of any client. You can chain as many mandatory (where()
) conditions, mandatory exclusion (whereNot()
) conditions, and set of conditions some of which must match (whereSome()
) as you want. - The number of results to include (per page).
-
You can stream the results direct from the response. This will also lazily load and loop through each page of results.
Can be chained without creating a request in-between
You can actually chain the
stream()
method directly onto the end of your query and request construction, without creating arequest
orresponse
object in-between. -
With streaming, you can apply your own limits to the maximum number of results you want to process.
Independent of page size
Note that this is independent of page size. You could page through results 50 at a time, but only process a maximum of 100 total results this way. Since the results are lazily-loaded when streaming, only the first two pages of results would be retrieved in such a scenario.
-
You can also apply your own logical filters to the results.
Push-down as much as you can to the query
You should of course push-down as many of the filters as you can to the query itself, but if you have a particular complex check to apply that cannot be encoded in the query this can be a useful secondary filter over the results.
-
The
forEach()
on the resulting stream will then apply whatever actions you want with the results that come through.
Build the query | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
- You can start building a query using a
FluentSearch
object. You can have as many mandatory (where()
) conditions, mandatory exclusion (where_not()
) conditions, and set of conditions some of which must match (where_some()
) as you want. - This helper provides a query that ensures results are active (not archived) assets.
- You can now build all of this search configuration into a request.
- You can then run the search against this request.
-
This will iterate through all the results without the need to be concerned with pages.
Iterating over results produces a Generator
This means that results are retrieved from the backend a page at time. This also means that you can only iterate over the results once.
-
Remember that each result is a generic
Asset
. You should of course push-down as many of the filters as you can to the query itself, but if you have a particular complex check to apply that cannot be encoded in the query this can be a useful secondary filter over the results.
Automatic paging | |
---|---|
1 2 3 4 5 6 |
|
- You can start building a query across all assets using the
select()
method on theassets
member of any client. You can chain as many mandatory (where()
) conditions, mandatory exclusion (whereNot()
) conditions, and set of conditions some of which must match (whereSome()
) as you want. - The number of results to include (per page).
-
You can stream the results direct from the response. This will also lazily load and loop through each page of results.
Can be chained without creating a request in-between
You can actually chain the
stream()
method directly onto the end of your query and request construction, without creating arequest
orresponse
object in-between. -
With streaming, you can apply your own limits to the maximum number of results you want to process.
Independent of page size
Note that this is independent of page size. You could page through results 50 at a time, but only process a maximum of 100 total results this way. Since the results are lazily-loaded when streaming, only the first two pages of results would be retrieved in such a scenario.
-
You can also apply your own logical filters to the results.
Push-down as much as you can to the query
You should of course push-down as many of the filters as you can to the query itself, but if you have a particular complex check to apply that cannot be encoded in the query this can be a useful secondary filter over the results.
-
The
forEach()
on the resulting stream will then apply whatever actions you want with the results that come through.
Use an SDK
The SDKs manage making multiple requests and parsing results to make subsequent requests in the most efficient way possible. You will need to make many different API requests if you want to do the same directly via the raw REST APIs.
Manually (via Elastic)¶
For curious minds, though, you can page through search results using a combination of the following properties1:
Property | Description | Example |
---|---|---|
from |
Indicates the starting point for the results. | 0 |
size |
Indicates how many results to include per response (page). As a general rule of thumb we would recommend a size from 20 -100 , making 50 a common starting point. |
50 |
track_total_hits |
Includes an accurate number of total results, if set to true . With its default value on the raw REST APIs (false ) the maximum number of results you will see in the approximateCount field in the response is 10000. (Again, the SDKs set this to true by default to avoid this confusion.) |
true |
Constraints with this approach
To have the most consistent results you can when paging, you must always use some sorting criteria and include at least one sorting criteria as a tie-breaker. (You must also keep that criteria the same for every page.)
Furthermore, as you get to larger from
sizes (more than ~10,000) Elastic will begin to use significantly more resources to process your paging. To reduce this impact, if you need to page through many results you should implement your own timestamp-based offset mechanism so that the from
size is kept consistently low.
(Again, the SDKs do both of these for you automatically.)
For example:
Annotated sort options, as you would define them in the Java SDK | |
---|---|
1 2 |
|
- Include any of your own sorting, like this example putting the most recently-updated assets first in the results.
- Also consider a tie-breaker sorting mechanism. In this case, we use an asset's GUID to further sort any results that have the same last modified timestamp, since GUID is guaranteed to be unique for every asset.
Build the request | |
---|---|
3 4 5 6 7 8 9 10 11 |
|
- You still need a query, to get some results .
- Starting point for the page of results being requested. In this example, you would be asking for the third page. (
0
would be from0-50
for the first page,50
would be from50-100
for the second page, and this gives us100-150
for the third page.) - The number of results per page (in this example,
50
results per page). - Enable
trackTotalHits
so that your response includes an accurate total number of results. (Actually the Java SDK enables this by default, so this step is redundant unless you want to turn it off.) - And we need to include the sorting criteria we defined just above.
Iterate through multiple pages of results | |
---|---|
12 13 14 15 16 17 18 19 20 21 |
|
- Keep the response object from the initial search, as it has a helper method for paging.
- Since we set
trackTotalHits
totrue
(the default for the Java SDK even if we do not set it), the.getApproximateCount()
will give us the total number of results. This can be over 10,000. - Iterate through all the results, across all pages (each page is lazily-loaded, so you can break out at any time without actually retrieving all pages of results).
- Alternatively, you can iterate through all the results using
forEach()
on the response. (This uses the same underlying iterable-based implementation.) - Alternatively, you can stream the results. Streaming will also lazily-load only the pages of results necessary to meet the chained criteria for processing the stream.
- When streaming, you can further filter the results to apply any complex filtering logic you could not push-down as part of the query itself.
- When streaming, you can also limit the total number of results you want to process — independently of the page size.
- Don't forget to actually do something with the results in the stream
Annotated sort options, as you would define them in the Python SDK | |
---|---|
1 2 3 4 5 6 7 |
|
- Include any of your own sorting, like this example putting the most recently-updated assets first in the results.
- Also consider a tie-breaker sorting mechanism. In this case, we use an asset's GUID to further sort any results that have the same last modified timestamp, since GUID is guaranteed to be unique for every asset.
Build the request | |
---|---|
8 9 10 11 12 13 14 15 16 17 18 19 |
|
- You still need a query, to get some results .
- Starting point for the page of results being requested. In this example, you would be asking for the third page. (
0
would be from0-50
for the first page,50
would be from50-100
for the second page, and this gives us100-150
for the third page.) - The number of results per page (in this example,
50
results per page). - Enable
track_total_hits
so that your response includes an accurate total number of results. (Actually the Python SDK enables this by default, so this step is redundant unless you want to turn it off.) - And we need to include the sorting criteria we defined just above.
Iterate through multiple pages of results | |
---|---|
20 21 22 23 24 |
|
- Keep the response object from the initial search, as it has a helper method for paging.
- Since we set
track_total_hits
toTrue
(the default for the Python SDK even if we do not set it), the.count
property will give us the total number of results. This can be over 10,000. - Iterate through all the results, across all pages (each page is lazily-loaded, so you can break out at any time without actually retrieving all pages of results). Don't forget to actually do something with the results in the stream
Annotated sort options, as you would define them in the Java SDK | |
---|---|
1 2 |
|
- Include any of your own sorting, like this example putting the most recently-updated assets first in the results.
- Also consider a tie-breaker sorting mechanism. In this case, we use an asset's GUID to further sort any results that have the same last modified timestamp, since GUID is guaranteed to be unique for every asset.
Build the request | |
---|---|
3 4 5 6 7 8 9 10 11 |
|
- You still need a query, to get some results .
- Starting point for the page of results being requested. In this example, you would be asking for the third page. (
0
would be from0-50
for the first page,50
would be from50-100
for the second page, and this gives us100-150
for the third page.) - The number of results per page (in this example,
50
results per page). - Enable
trackTotalHits
so that your response includes an accurate total number of results. (Actually the Java SDK enables this by default, so this step is redundant unless you want to turn it off.) - And we need to include the sorting criteria we defined just above.
Iterate through multiple pages of results | |
---|---|
12 13 14 15 16 17 18 19 20 21 |
|
- Keep the response object from the initial search, as it has a helper method for paging.
- Since we set
trackTotalHits
totrue
(the default for the Java SDK even if we do not set it), the.getApproximateCount()
will give us the total number of results. This can be over 10,000. - Iterate through all the results, across all pages (each page is lazily-loaded, so you can break out at any time without actually retrieving all pages of results).
- Alternatively, you can iterate through all the results using
forEach()
on the response. (This uses the same underlying iterable-based implementation.) - Alternatively, you can stream the results. Streaming will also lazily-load only the pages of results necessary to meet the chained criteria for processing the stream.
- When streaming, you can further filter the results to apply any complex filtering logic you could not push-down as part of the query itself.
- When streaming, you can also limit the total number of results you want to process — independently of the page size.
- Don't forget to actually do something with the results in the stream
POST /api/meta/search/indexsearch | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
- Starting point for the page of results being requested. In this example, you would be asking for the third page. (
0
would be from0-50
for the first page,50
would be from50-100
for the second page, and this gives us100-150
for the third page.) - The number of results per page (in this example,
50
results per page). - Enable
track_total_hits
so that your response includes an accurate total number of results. - You still need a query, to get some results .
- When paging, we should always sort the results (for consistency across the pages).
- Include any of your own sorting, like this example putting the most recently-updated assets first in the results.
- Also consider a tie-breaker sorting mechanism. In this case, we use the GUID of an asset to further sort any results that have the same last modified timestamp.
Annotated response, in plain JSON | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
- Note that every response to a search includes the query that was run. You can deconstruct this programmatically to always determine the
from
you will need for the next page of results. (Basically:from = from + size
.) And in fact, since you can programmatically extract both the query and sorting criteria from this you have everything you need to get the next page — thequery
, thefrom
, thesize
and thesort
. - The results themselves are the objects in the
entities
array. The size of this array will be at mostsize
elements. Of course, your final page of results may not have a complete page of results, so it is possible that this array will be less thansize
(in particular, when you are on the final page). - Since the request sets
track_total_hits
totrue
, theapproximateCount
in the response will have an accurate number of total results. Note that this can go beyond 10,000.
-
If you're familiar with Elasticsearch there are an alternative paging options using
search_after
and point-in-time (PIT) state preservation. (There also used to be scrolling, but this is no longer recommended by Elasticsearch.) We do not currently expose thesearch_after
or PIT approaches through Atlan's search. However, you should still be able to page beyond the first 10,000 results using the approach outlined above. ↩