Data warehouse queries

Analytical SQL queries that efficiently process entire streams

Beneath automatically writes records to a data warehouse to enable analytical SQL queries (OLAP). Check out Unified data system to learn more about the different formats Beneath stores data in.

Query examples

Assume we have a stream of page views with path example/project/page-views and schema:

type PageView @schema {
  user_id: Int! @key
  time: Timestamp! @key
  url: String
}

We can count the total number of page views:

SELECT count(*)
FROM `example/project/page-views`

or count the number of unique visitors per day:

SELECT timestamp_trunc(time, DAY), count(distinct user_id)
FROM `example/project/page-views`
GROUP BY timestamp_trunc(time, DAY)

Query syntax

Under the hood, Beneath uses Google BigQuery as its data warehouse. It generally shares the same syntax as BigQuery, and you can use most of BigQuery’s built-in functions.

The main deviation from BigQuery is in how you reference streams: Streams are given as paths quoted with backticks (`), for example `example/project/page-views` (see examples above for more).

Deduplicating by key

Every record in a stream is added to the warehouse. That means if you write multiple records with the same schema key, they will all be processed by the query. To support deduplication of records with the same key, we include two hidden columns for each record:

  • __key contains a bytes-serialized representation of the record’s unique key
  • __timestamp contains the write timestamp for the record

So the following pattern can be used to deduplicate records:

WITH deduplicated AS (
  SELECT r.* FROM (
    SELECT array_agg(t ORDER BY t.__timestamp DESC LIMIT 1)[OFFSET(0)] r
    FROM `username/project/stream` t
    GROUP BY t.__key
  )
)
SELECT count(*)
FROM deduplicated