> For the complete documentation index, see [llms.txt](https://cueobserve.cuebook.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://cueobserve.cuebook.ai/datasets.md).

# Datasets

Datasets are similar to aggregated SQL VIEWS of your data. When you run an anomaly detection job, the associated dataset's SQL query is run and the results are stored as a Pandas dataframe in memory.

![](/files/-MeoIhSs0RHpZwtTDEvw)

You write a SQL GROUP BY query with aggregate functions to roll-up your data. You then map the columns as dimensions or measures.

![](/files/-MeoEhoM9pc_c1THRdFn)

1. Dataset must have only one timestamp column. This timestamp column is used to generate timeseries data for anomaly detection.
2. Dataset must have at least one aggregate column. CueObserve currently supports only COUNT or SUM as aggregate functions. Aggregate columns must be mapped as measures.
3. Dataset can have one or more dimension columns (optional).
4. Dataset can be classified as a non-rollup dataset, details are provided below.

### **Non-Rollup Datasets**

A dataset can be created as a non-rollup dataset using a switch to inform the system that it does not need to roll up aggregate the data during the pre-processing of the data.

![Non Roll-up switch](/files/pVM19lbQhbNCSaFVEM3I)

By default, all datasets are "rolled up" i.e. metric data points are aggregated(summed up) on the timestamp buckets for a specific dimension value.

But for metrics like percentage etc. such aggregation might not be relevant, so one can specify to the system that it is a non-rollup dataset. Currently we support only single dimension on Non-rollup datasets to avoid duplicate timestamp values after pre-processing.

## SQL GROUP BY Query

Your SQL must group by timestamp and all dimension columns. You must truncate the timestamp column to HOUR or DAY before grouping. For example, if you want to track hourly anomalies on the dataset, truncate the timestamp to HOUR.

Below is a sample GROUP BY query for BigQuery. See [Data Sources](/sources.md) for sample queries on other databases and data warehouses.

```sql
SELECT
TIMESTAMP_TRUNC(CreatedTS, DAY) as OrderDate, -- HOUR or DAY granularity
City, State, -- dimensions
COUNT(1) as Orders, SUM(IFNULL(Order_Amount,0)) as OrderAmount -- measures
FROM ORDERS
WHERE CreatedTS >= TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY), INTERVAL 400 DAY)  -- limit historical data to use for forecasting
GROUP BY 1, 2, 3
ORDER BY 1
```

Since the last time bucket might be partial, CueObserve ignores the last time bucket when generating timeseries.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://cueobserve.cuebook.ai/datasets.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
