# Sampling methods[¶](https://doc.dataiku.com/dss/latest/advanced/sampling.html#sampling-methods "Permalink to this headline")

* Generic sampling methods

+ No sampling

+ First records

+ Random sampling (approximate ratio)

+ Random sampling (approximate count)

+ Column values subset

+ Class rebalancing (approximate number of records)

+ Class rebalancing (approximate ratio)

* Exploration / Visual data preparation

Many parts of DSS support sampling data to extract subsets and/or reduce the size of data to process

Sampling can be configured in the following locations in DSS:

* Exploration

* Visual data preparation

* Charts

* The sampling recipe

* Machine learning

* Various APIs for fetching datasets data

DSS provides a variety of sampling methods

## Generic sampling methods[¶](https://doc.dataiku.com/dss/latest/advanced/sampling.html#generic-sampling-methods "Permalink to this headline")

DSS provides the following methods that are available in most cases where sampling is requested.

### No sampling[¶](https://doc.dataiku.com/dss/latest/advanced/sampling.html#no-sampling "Permalink to this headline")

All data is taken, sampling does not happen.

### First records[¶](https://doc.dataiku.com/dss/latest/advanced/sampling.html#first-records "Permalink to this headline")

This method takes the first N rows of the dataset. It is very fast, as it only reads N rows, but may result in a very biased view of the dataset.

### Random sampling (approximate ratio)[¶](https://doc.dataiku.com/dss/latest/advanced/sampling.html#random-sampling-approximate-ratio "Permalink to this headline")

This method randomly selects approximately X% of the rows. The target count of records is approximate, and will be more precise with large input datasets.

This method requires a full pass reading the data.

### Random sampling (approximate count)[¶](https://doc.dataiku.com/dss/latest/advanced/sampling.html#random-sampling-approximate-count "Permalink to this headline")

This method randomly selects approximately N records. The target count of records is approximate, and will be more precise with large input datasets.

This method requires 2 full passes reading the data.

### Column values subset[¶](https://doc.dataiku.com/dss/latest/advanced/sampling.html#column-values-subset "Permalink to this headline")

This method randomly selects a subset of values and chooses all rows with these values, in order to obtain approximately N rows. This is useful for selecting a subset of customers, for example.

This sampling method requires 2 full passes reading the data. The time taken by this method is thus linear with the size of the dataset.

This method is useful if you want to have all records for some values of the column, for your analysis. For example, if your dataset is a log of user actions, it is more interesting to have “all actions for a sample of the users” rather than “a sample of all actions”, as it allows you to really study the sequences of actions of these users.

“Column values subset” sampling will only provide interesting results if the selected column has a sufficiently large number of values. A user id would generally be a good choice for the sampling column.

### Class rebalancing (approximate number of records)[¶](https://doc.dataiku.com/dss/latest/advanced/sampling.html#class-rebalancing-approximate-number-of-records "Permalink to this headline")

This method randomly selects approximately N rows, trying to rebalance equally all modalities of a column. This method does not oversample, only undersample (so some rare modalities may remain under-represented).In all cases, rebalancing is approximative.

This sampling method requires 2 full passes reading the data. The time taken by this method is thus linear with the size of the dataset.

### Class rebalancing (approximate ratio)[¶](https://doc.dataiku.com/dss/latest/advanced/sampling.html#class-rebalancing-approximate-ratio "Permalink to this headline")

This method randomly selects approximately X% of the rows, trying to rebalance equally all modalities of a column.

This method does not oversample, only undersample (so some rare modalities may remain under-represented). In all cases, rebalancing is approximative.

This sampling method requires 2 full passes reading the data. The time taken by this method is thus linear with the size of the dataset.

## Exploration / Visual data preparation[¶](https://doc.dataiku.com/dss/latest/advanced/sampling.html#exploration-visual-data-preparation "Permalink to this headline")

For exploration and visual data preparation, additional sampling methods are available, thanks to the “in-memory” characteristic.

See Sampling in explore for more information
