This page explains how to use the sample operator function in APL.
The sample
operator in APL psuedo-randomly selects rows from the input dataset at a rate specified by a parameter. This operator is useful when you want to analyze a subset of data, reduce the dataset size for testing, or quickly explore patterns without processing the entire dataset. The sampling algorithm is not statistically rigorous but provides a way to explore and understand a dataset. For statistically rigorous analysis, use summarize
instead.
You can find the sample
operator useful when working with large datasets, where processing the entire dataset is resource-intensive or unnecessary. It’s ideal for scenarios like log analysis, performance monitoring, or sampling for data quality checks.
If you come from other query languages, this section explains how to adjust your existing queries to achieve the same results in APL.
Splunk SPL users
In Splunk SPL, the sample
command works similarly, returning a subset of data rows randomly. However, the APL sample
operator requires a simpler syntax without additional arguments for biasing the randomness.
ANSI SQL users
In ANSI SQL, there is no direct equivalent to the sample
operator, but you can achieve similar results using the TABLESAMPLE
clause. In APL, sample
operates independently and is more flexible, as it’s not tied to a table scan.
ProportionOfRows
: A float greater than 0 and less than 1 which specifies the proportion of rows to return from the dataset. The rows are selected randomly.The operator returns a table containing the specified number of rows, selected randomly from the input dataset.
In this use case, you sample a small number of rows from your HTTP logs to quickly analyze trends without working through the entire dataset.
Query
Output
_time | req_duration_ms | id | status | uri | method | geo.city | geo.country |
---|---|---|---|---|---|---|---|
2023-10-16 12:45:00 | 234 | user1 | 200 | /index | GET | New York | US |
2023-10-16 12:47:00 | 120 | user2 | 404 | /login | POST | Paris | FR |
2023-10-16 12:48:00 | 543 | user3 | 500 | /checkout | POST | Tokyo | JP |
This query returns a random subset of 5 % of all rows from the HTTP logs, helping you quickly identify any potential issues or patterns without analyzing the entire dataset.
In this use case, you sample a small number of rows from your HTTP logs to quickly analyze trends without working through the entire dataset.
Query
Output
_time | req_duration_ms | id | status | uri | method | geo.city | geo.country |
---|---|---|---|---|---|---|---|
2023-10-16 12:45:00 | 234 | user1 | 200 | /index | GET | New York | US |
2023-10-16 12:47:00 | 120 | user2 | 404 | /login | POST | Paris | FR |
2023-10-16 12:48:00 | 543 | user3 | 500 | /checkout | POST | Tokyo | JP |
This query returns a random subset of 5 % of all rows from the HTTP logs, helping you quickly identify any potential issues or patterns without analyzing the entire dataset.
In this use case, you sample traces to investigate performance metrics for a particular service across different spans.
Query
Output
_time | duration | span_id | trace_id | service.name | kind | status_code |
---|---|---|---|---|---|---|
2023-10-16 14:05:00 | 1.34s | span5678 | trace123 | checkoutservice | client | 200 |
2023-10-16 14:06:00 | 0.89s | span3456 | trace456 | checkoutservice | server | 500 |
This query returns 5 % of all traces for the checkoutservice
to identify potential performance bottlenecks.
In this use case, you sample security log data to spot irregular activity in requests, such as 500-level HTTP responses.
Query
Output
_time | req_duration_ms | id | status | uri | method | geo.city | geo.country |
---|---|---|---|---|---|---|---|
2023-10-16 14:30:00 | 543 | user4 | 500 | /payment | POST | Berlin | DE |
2023-10-16 14:32:00 | 876 | user5 | 500 | /order | POST | London | GB |
This query helps you quickly spot failed requests (HTTP 500 responses) and investigate any potential causes of these errors.
take
when you want to return the first N rows in the dataset rather than a random subset.where
to filter rows based on conditions rather than sampling randomly.top
to return the highest N rows based on a sorting criterion.This page explains how to use the sample operator function in APL.
The sample
operator in APL psuedo-randomly selects rows from the input dataset at a rate specified by a parameter. This operator is useful when you want to analyze a subset of data, reduce the dataset size for testing, or quickly explore patterns without processing the entire dataset. The sampling algorithm is not statistically rigorous but provides a way to explore and understand a dataset. For statistically rigorous analysis, use summarize
instead.
You can find the sample
operator useful when working with large datasets, where processing the entire dataset is resource-intensive or unnecessary. It’s ideal for scenarios like log analysis, performance monitoring, or sampling for data quality checks.
If you come from other query languages, this section explains how to adjust your existing queries to achieve the same results in APL.
Splunk SPL users
In Splunk SPL, the sample
command works similarly, returning a subset of data rows randomly. However, the APL sample
operator requires a simpler syntax without additional arguments for biasing the randomness.
ANSI SQL users
In ANSI SQL, there is no direct equivalent to the sample
operator, but you can achieve similar results using the TABLESAMPLE
clause. In APL, sample
operates independently and is more flexible, as it’s not tied to a table scan.
ProportionOfRows
: A float greater than 0 and less than 1 which specifies the proportion of rows to return from the dataset. The rows are selected randomly.The operator returns a table containing the specified number of rows, selected randomly from the input dataset.
In this use case, you sample a small number of rows from your HTTP logs to quickly analyze trends without working through the entire dataset.
Query
Output
_time | req_duration_ms | id | status | uri | method | geo.city | geo.country |
---|---|---|---|---|---|---|---|
2023-10-16 12:45:00 | 234 | user1 | 200 | /index | GET | New York | US |
2023-10-16 12:47:00 | 120 | user2 | 404 | /login | POST | Paris | FR |
2023-10-16 12:48:00 | 543 | user3 | 500 | /checkout | POST | Tokyo | JP |
This query returns a random subset of 5 % of all rows from the HTTP logs, helping you quickly identify any potential issues or patterns without analyzing the entire dataset.
In this use case, you sample a small number of rows from your HTTP logs to quickly analyze trends without working through the entire dataset.
Query
Output
_time | req_duration_ms | id | status | uri | method | geo.city | geo.country |
---|---|---|---|---|---|---|---|
2023-10-16 12:45:00 | 234 | user1 | 200 | /index | GET | New York | US |
2023-10-16 12:47:00 | 120 | user2 | 404 | /login | POST | Paris | FR |
2023-10-16 12:48:00 | 543 | user3 | 500 | /checkout | POST | Tokyo | JP |
This query returns a random subset of 5 % of all rows from the HTTP logs, helping you quickly identify any potential issues or patterns without analyzing the entire dataset.
In this use case, you sample traces to investigate performance metrics for a particular service across different spans.
Query
Output
_time | duration | span_id | trace_id | service.name | kind | status_code |
---|---|---|---|---|---|---|
2023-10-16 14:05:00 | 1.34s | span5678 | trace123 | checkoutservice | client | 200 |
2023-10-16 14:06:00 | 0.89s | span3456 | trace456 | checkoutservice | server | 500 |
This query returns 5 % of all traces for the checkoutservice
to identify potential performance bottlenecks.
In this use case, you sample security log data to spot irregular activity in requests, such as 500-level HTTP responses.
Query
Output
_time | req_duration_ms | id | status | uri | method | geo.city | geo.country |
---|---|---|---|---|---|---|---|
2023-10-16 14:30:00 | 543 | user4 | 500 | /payment | POST | Berlin | DE |
2023-10-16 14:32:00 | 876 | user5 | 500 | /order | POST | London | GB |
This query helps you quickly spot failed requests (HTTP 500 responses) and investigate any potential causes of these errors.
take
when you want to return the first N rows in the dataset rather than a random subset.where
to filter rows based on conditions rather than sampling randomly.top
to return the highest N rows based on a sorting criterion.