sample

The sample operator in APL psuedo-randomly selects rows from the input dataset at a rate specified by a parameter. This operator is useful when you want to analyze a subset of data, reduce the dataset size for testing, or quickly explore patterns without processing the entire dataset. The sampling algorithm is not statistically rigorous but provides a way to explore and understand a dataset. For statistically rigorous analysis, use summarize instead. You can find the sample operator useful when working with large datasets, where processing the entire dataset is resource-intensive or unnecessary. It’s ideal for scenarios like log analysis, performance monitoring, or sampling for data quality checks.

For users of other query languages

If you come from other query languages, this section explains how to adjust your existing queries to achieve the same results in APL.

Splunk SPL users

In Splunk SPL, the sample command works similarly, returning a subset of data rows randomly. However, the APL sample operator requires a simpler syntax without additional arguments for biasing the randomness.

| sample 10

ANSI SQL users

In ANSI SQL, there is no direct equivalent to the sample operator, but you can achieve similar results using the TABLESAMPLE clause. In APL, sample operates independently and is more flexible, as it’s not tied to a table scan.

SELECT * FROM table TABLESAMPLE (10 ROWS);

Usage

Syntax

| sample ProportionOfRows

Parameters

ProportionOfRows: A float greater than 0 and less than 1 which specifies the proportion of rows to return from the dataset. The rows are selected randomly.

Returns

The operator returns a table containing the specified number of rows, selected randomly from the input dataset.

Use case examples

In this use case, you sample a small number of rows from your HTTP logs to quickly analyze trends without working through the entire dataset.Query

['sample-http-logs']
| sample 0.05

Run in PlaygroundOutput

_time	req_duration_ms	id	status	uri	method	geo.city	geo.country
2023-10-16 12:45:00	234	user1	200	/index	GET	New York	US
2023-10-16 12:47:00	120	user2	404	/login	POST	Paris	FR
2023-10-16 12:48:00	543	user3	500	/checkout	POST	Tokyo	JP

This query returns a random subset of 5 % of all rows from the HTTP logs, helping you quickly identify any potential issues or patterns without analyzing the entire dataset.

take: Use take when you want to return the first N rows in the dataset rather than a random subset.
where: Use where to filter rows based on conditions rather than sampling randomly.
top: Use top to return the highest N rows based on a sorting criterion.

Get started

Functions

Operators

Reference

Migration

For users of other query languages

Usage

Syntax

Parameters

Returns

Use case examples

Get started

Functions

Operators

Reference

Migration

​For users of other query languages

​Usage

​Syntax

​Parameters

​Returns

​Use case examples

​List of related operators

For users of other query languages

Usage

Syntax

Parameters

Returns

Use case examples

List of related operators