mardi 18 avril 2023

Storing and Retrieving state changes

I'm exploring how to design a database that can store state changes with the timestamp they occurred. The goal is to have fast access to all states within a given time period (e.g. 5 seconds). Inserting new values shouldn't be super slow, but number one priority is querying speed.

The data is generated like this:

ts: 1672527600010, field_0001 = true, field_0002 = 1491, field_0003 = "v07:01"
ts: 1672527600010, field_1001 = 47, field_1002 = -83
ts: 1672527600020, field_0002 = 595
ts: 1672527600050, field_1001 = 150, field_1003 = 75

The timestamp is in Milliseconds since Unix Epoch. Only the state of the fields that changed are transmitted and every block of fields is transmitted separately. It is not guaranteed that the timestamps are transmitted in the correct order. There are 10 blocks with a total of 250 fields.

Querying this data with the ts_begin = 1672527600000 and ts_end = 1672527605000 the result should look like this:

ts_occurred_ms field_0001 field_0002 field_0003 field_1001 field_1002 field_1003
1672527600010 true 1491 "v07:01" 47 -83 NULL
1672527600020 true 595 "v07:01" 47 -83 NULL
1672527600050 true 595 "v07:01" 150 -83 75

The result should combine all tables into one and fill up cells with the last available value for the field. The user should be able to see all fields at every timestamp.

What are some best practices for designing such a database?

What kind of schema should I use to store the state changes?

The first idea I came up with was to create a table for each block of fields

data_0000(ts_corrected, ts_occurred_ms, field_0001, field_0002, field_0003, ...)
data_1000(ts_corrected, ts_occurred_ms, field_1001, field_1002, field_1003)
data_2000(ts_corrected, ts_occurred_ms, field_2001, field_2002, field_2003, ...)
...

ts_corrected and ts_occurred_ms combined are the primary key in each table.

WITH data AS (
    SELECT
        COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms) AS tsoccurred_ms,
        LAST_VALUE(field_0001) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_0001,
        LAST_VALUE(field_0002) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_0002,
        LAST_VALUE(field_0003) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_0003,
        LAST_VALUE(field_1001) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_1001,
        LAST_VALUE(field_1002) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_1002,
        LAST_VALUE(field_1003) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_1003,
        LAST_VALUE(field_2001) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_2001,
        LAST_VALUE(field_2002) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_2002,
        LAST_VALUE(field_2003) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_2003
    FROM data_0000
        FULL JOIN (SELECT * FROM data_1000 WHERE tscorrected = 0 AND tsoccurred_ms < 1672527605000) data_1000 ON data_0000.tsoccurred_ms = data_1000.tsoccurred_ms
        FULL JOIN (SELECT * FROM data_2000 WHERE tscorrected = 0 AND tsoccurred_ms < 1672527605000) data_2000 ON COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms) = data_2000.tsoccurred_ms
)
SELECT *
FROM data
WHERE tsoccurred_ms BETWEEN 1672527600000 AND 1672527605000
ORDER BY tsoccurred_ms;

Unfortunately for a mere 150 000 state changes this query already takes about 7 seconds in my test environment.

I didn't find a way to make this query more efficient as the subquery needs to join the whole tables as there could be columns where the value hasn't changed for hundreds of thousands of rows.

The second idea was to create a single table for all changes.

data(ts_corrected, ts_occurred_ms, field_id, value)

This significantly increases the storage space needed, as every field change also stores the corresponding timestamp.

WITH cte AS (
    SELECT tsoccurred_ms,
        MAX(CASE WHEN tscorrected = 0 AND field_id = 0001 THEN value END) AS field_0001,
        MAX(CASE WHEN tscorrected = 0 AND field_id = 0002 THEN value END) AS field_0002,
        MAX(CASE WHEN tscorrected = 0 AND field_id = 0003 THEN value END) AS field_0003,
        MAX(CASE WHEN tscorrected = 0 AND field_id = 1001 THEN value END) AS field_1001,
        MAX(CASE WHEN tscorrected = 0 AND field_id = 1002 THEN value END) AS field_1002,
        MAX(CASE WHEN tscorrected = 0 AND field_id = 1003 THEN value END) AS field_1003,
        MAX(CASE WHEN tscorrected = 0 AND field_id = 2001 THEN value END) AS field_2001,
        MAX(CASE WHEN tscorrected = 0 AND field_id = 2002 THEN value END) AS field_2002,
        MAX(CASE WHEN tscorrected = 0 AND field_id = 2003 THEN value END) AS field_2003
    FROM (
        SELECT * FROM cbm_data_1 WHERE tsoccurred_ms BETWEEN 1672527600000 AND 1672527605000
        UNION ALL
        SELECT *
        FROM ( SELECT * FROM cbm_data_1 WHERE tsoccurred_ms < 1672527600000  ORDER BY tsoccurred_ms DESC ) t1
    ) t2
    GROUP BY tsoccurred_ms
),
cte2 AS (
    SELECT tsoccurred_ms,
        COALESCE(field_0001, MAX(field_0001) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_0001,
        COALESCE(field_0002, MAX(field_0002) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_0002,
        COALESCE(field_0003, MAX(field_0003) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_0003,
        COALESCE(field_1001, MAX(field_1001) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_1001,
        COALESCE(field_1002, MAX(field_1002) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_1002,
        COALESCE(field_1003, MAX(field_1003) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_1003,
        COALESCE(field_2001, MAX(field_2001) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_2001,
        COALESCE(field_2002, MAX(field_2002) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_2002,
        COALESCE(field_2003, MAX(field_2003) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_2003
    FROM cte
)
SELECT *
FROM cte2
WHERE tsoccurred_ms BETWEEN 1672527600000 AND 1672527605000

This is a lot faster, but when I add a couple million state changes queries also begin to take a long time. Maybe I am missing something.

Aucun commentaire:

Enregistrer un commentaire