Question Details

No question body available.

Tags

design memory estimation

Answers (3)

July 23, 2025 Score: 2 Rep: 220,779 Quality: Low Completeness: 20%

When using a 32 byte hash function for numbers in the range of 5 millions, the risk of hash collisions is extremely low, hence that's surely a possible approach. Still, you have to make sure no jobid is used twice for different jobs, not even accidentally, hence you will need a central registry for the IDs.

But why complicate things by a hash function at all? Just introduce a unique job number which is automatically generated, maybe filled up with leading zeros to get fixed width numbers. Or, if you want to save the central registry, you may add an additional jobGUID - filled with GUIDs, of course. That is the standard solution for distributed systems.

July 23, 2025 Score: 2 Rep: 31,152 Quality: Low Completeness: 40%

If I understand the proposal correctly, you will need to track the hash of the jobts in some other location (hashes are not reversible.)

If you already have to do that, there's no reason you are limited to hashes. You could ceate a sequential id or any other scheme you like. With 32 bits, you will be good for billions jobts before you run out of unique Ids. You will need to make sure these ids are unique before you associate logs with them, of course.

July 25, 2025 Score: 1 Rep: 312 Quality: Low Completeness: 100%

"We have a system that periodically polls and monitors on host-level jobs. On every poll, a given job emit can multiple timeseries.
...
Each timeseries is identified by its timeseries id (e.g. "cpu", "memory"). It can also be user-defined.
...
The cardinality of jobs id ~5m and cardinality of timeseries ids is ~1B
...
We want to store the datapoints of every job timeseries into a key value store. I figured the key could be: hash(jobts) + fixedwidthunixtimestamp. The value is the datapoint value (e.g. cpu of the job at time T).
...
I want to optimize for our memory footprint on the key value store - what are some techniques/tradeoffs I should consider?".

Map "CPU", "memory ", and "user-defined string" to a 32-bit number stored in a key map table; slower lookup but considerable memory savings.

Forget the hash, instead of xxhash128(jobidstring + timeseriesidstring), your "series ID" (the unique identifier for jobid + timeseriesid) becomes a composite of the two integer IDs.

You could combine these two int32 values into a single int64 (8-byte) value. For instance, you could concatenate them (e.g., (jobintid