I'm exploring how to design a database that can store state changes with the timestamp they occurred. The goal is to have fast access to all states within a given time period (e.g. 5 seconds). Inserting new values shouldn't be super slow, but number one priority is querying speed.
The data is generated like this:
ts: 1672527600010, field_0001 = true, field_0002 = 1491, field_0003 = "v07:01"
ts: 1672527600010, field_1001 = 47, field_1002 = -83
ts: 1672527600020, field_0002 = 595
ts: 1672527600050, field_1001 = 150, field_1003 = 75
The timestamp is in Milliseconds since Unix Epoch. Only the state of the fields that changed are transmitted and every block of fields is transmitted separately. It is not guaranteed that the timestamps are transmitted in the correct order. There are 10 blocks with a total of 250 fields.
Querying this data with the ts_begin = 1672527600000 and ts_end = 1672527605000 the result should look like this:
ts_occurred_ms | field_0001 | field_0002 | field_0003 | field_1001 | field_1002 | field_1003 |
---|---|---|---|---|---|---|
1672527600010 | true | 1491 | "v07:01" | 47 | -83 | NULL |
1672527600020 | true | 595 | "v07:01" | 47 | -83 | NULL |
1672527600050 | true | 595 | "v07:01" | 150 | -83 | 75 |
The result should combine all tables into one and fill up cells with the last available value for the field. The user should be able to see all fields at every timestamp.
What are some best practices for designing such a database?
What kind of schema should I use to store the state changes?
The first idea I came up with was to create a table for each block of fields
data_0000(ts_corrected, ts_occurred_ms, field_0001, field_0002, field_0003, ...)
data_1000(ts_corrected, ts_occurred_ms, field_1001, field_1002, field_1003)
data_2000(ts_corrected, ts_occurred_ms, field_2001, field_2002, field_2003, ...)
...
ts_corrected and ts_occurred_ms combined are the primary key in each table.
WITH data AS (
SELECT
COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms) AS tsoccurred_ms,
LAST_VALUE(field_0001) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_0001,
LAST_VALUE(field_0002) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_0002,
LAST_VALUE(field_0003) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_0003,
LAST_VALUE(field_1001) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_1001,
LAST_VALUE(field_1002) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_1002,
LAST_VALUE(field_1003) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_1003,
LAST_VALUE(field_2001) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_2001,
LAST_VALUE(field_2002) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_2002,
LAST_VALUE(field_2003) IGNORE NULLS OVER (ORDER BY COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms, data_2000.tsoccurred_ms)) AS field_2003
FROM data_0000
FULL JOIN (SELECT * FROM data_1000 WHERE tscorrected = 0 AND tsoccurred_ms < 1672527605000) data_1000 ON data_0000.tsoccurred_ms = data_1000.tsoccurred_ms
FULL JOIN (SELECT * FROM data_2000 WHERE tscorrected = 0 AND tsoccurred_ms < 1672527605000) data_2000 ON COALESCE(data_0000.tsoccurred_ms, data_1000.tsoccurred_ms) = data_2000.tsoccurred_ms
)
SELECT *
FROM data
WHERE tsoccurred_ms BETWEEN 1672527600000 AND 1672527605000
ORDER BY tsoccurred_ms;
Unfortunately for a mere 150 000 state changes this query already takes about 7 seconds in my test environment.
I didn't find a way to make this query more efficient as the subquery needs to join the whole tables as there could be columns where the value hasn't changed for hundreds of thousands of rows.
The second idea was to create a single table for all changes.
data(ts_corrected, ts_occurred_ms, field_id, value)
This significantly increases the storage space needed, as every field change also stores the corresponding timestamp.
WITH cte AS (
SELECT tsoccurred_ms,
MAX(CASE WHEN tscorrected = 0 AND field_id = 0001 THEN value END) AS field_0001,
MAX(CASE WHEN tscorrected = 0 AND field_id = 0002 THEN value END) AS field_0002,
MAX(CASE WHEN tscorrected = 0 AND field_id = 0003 THEN value END) AS field_0003,
MAX(CASE WHEN tscorrected = 0 AND field_id = 1001 THEN value END) AS field_1001,
MAX(CASE WHEN tscorrected = 0 AND field_id = 1002 THEN value END) AS field_1002,
MAX(CASE WHEN tscorrected = 0 AND field_id = 1003 THEN value END) AS field_1003,
MAX(CASE WHEN tscorrected = 0 AND field_id = 2001 THEN value END) AS field_2001,
MAX(CASE WHEN tscorrected = 0 AND field_id = 2002 THEN value END) AS field_2002,
MAX(CASE WHEN tscorrected = 0 AND field_id = 2003 THEN value END) AS field_2003
FROM (
SELECT * FROM cbm_data_1 WHERE tsoccurred_ms BETWEEN 1672527600000 AND 1672527605000
UNION ALL
SELECT *
FROM ( SELECT * FROM cbm_data_1 WHERE tsoccurred_ms < 1672527600000 ORDER BY tsoccurred_ms DESC ) t1
) t2
GROUP BY tsoccurred_ms
),
cte2 AS (
SELECT tsoccurred_ms,
COALESCE(field_0001, MAX(field_0001) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_0001,
COALESCE(field_0002, MAX(field_0002) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_0002,
COALESCE(field_0003, MAX(field_0003) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_0003,
COALESCE(field_1001, MAX(field_1001) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_1001,
COALESCE(field_1002, MAX(field_1002) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_1002,
COALESCE(field_1003, MAX(field_1003) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_1003,
COALESCE(field_2001, MAX(field_2001) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_2001,
COALESCE(field_2002, MAX(field_2002) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_2002,
COALESCE(field_2003, MAX(field_2003) OVER (ORDER BY tsoccurred_ms ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) AS field_2003
FROM cte
)
SELECT *
FROM cte2
WHERE tsoccurred_ms BETWEEN 1672527600000 AND 1672527605000
This is a lot faster, but when I add a couple million state changes queries also begin to take a long time. Maybe I am missing something.
Aucun commentaire:
Enregistrer un commentaire