Production data science projects often need to integrate several different data technologies to adequately consume and query data. To provide a seamless experience, when you write data to a table in Beneath, it automatically replicates it to:
Beneath lets you access these systems through a single layer of abstraction, so you can focus on your use cases without getting bogged down in integration and maintenance. The following sections explain each of these technologies in more detail.
The streaming log keeps real-time, ordered track of every record written to a table, allowing you to replay the history of a table as if you had been subscribed since its beginning, and then stay subscribed for updates without missing a single change. If your code is down for a while or only runs periodically, you can get every change that happened in the meantime once you reconnect (it’s an at-least-once guarantee).
The streaming log makes many things simpler, like filtering data in real-time, enriching incoming data with machine learning, or synchronizing data to an external system.
(Systems that can serve as a streaming log are sometimes called an event log or message queue, and stand-alone implementations include Apache Kafka, Amazon Kinesis, Cloud Pubsub and RabbitMQ).
The data warehouse stores records with a focus on analytical processing with SQL, making it ideal for business intelligence and ad-hoc exploration. It’s slow for finding individual records, but lets you scan and analyze an entire table in seconds.
(Systems that serve as a data warehouse are sometimes called a data lake or OLAP database, and stand-alone implementations include BigQuery, Snowflake, Redshift and Hive).
The operational data store enables fast, indexed lookups of individual records or specific ranges of records. It allows you to fetch records in milliseconds, thousand of times per second, which is useful when rendering a website or serving an API.
In Beneath, records are currently indexed based on their unique key (see Tables for more). For tables that contain multiple records with the same unique key (for example due to updates), the operational data store only indexes the most recent record.
(While broader categories, key-value stores and OLTP databases often serve as operational data stores, and popular stand-alone implementations include MongoDB, Postgres, Cassandra and Bigtable).
To illustrate how these systems work in tandem, imagine you’re building a weather forecasting website. Every time you get new weather data, you write it to Beneath and it becomes available in every system. The streaming log instantly pushes the data to your weather prediction model, which uses it to compute an updated forecast that it writes back into Beneath. Every time someone visits your website, you serve them the most recent forecast from the operational data index. Once a day, you re-train your weather prediction model with a complex SQL query that runs in the data warehouse.
We’re not interested in reinventing the wheel, so under the hood, Beneath uses battle-tested data technologies. The cloud version of Beneath uses a combination of Google Bigtable and Cloud Pub/Sub for log streaming, Google BigQuery as its data warehouse and Google Bigtable as the operational data store. If you self-host Beneath, we can provide drivers for a variety of other technologies. While the choice of underlying technologies have certain implications, Beneath generally abstracts away many of the differences.
In addition to the three data technologies mentioned above, there are some rarer technologies worth mentioning, such as graph databases (for querying networks of data) and full-text search systems (for advanced search). We’re devoted to covering more data access paradigms, so if Beneath doesn’t currently serve your use case, we would love to hear from you.