What if we told you that there was a way to dramatically reduce the cost to operate on cloud providers? That’s what we found when we dug into the different data structures offered in Redis. Before we committed to one, we did some research into the difference in memory usage between using HASH versus using SET. We found that HASHes can be used to potentially reduce the cost to operate by at least 50% over SET. Read on to see how we determined this and how you can put these savings into practice for yourself!
Introduction
Redis is an in-memory data structure store often used as a distributed cache enabling rapid operations. It offers simple key-value stores but also more complicated data structures such as lists, hashes (map/dict), sorted sets, etc. This blog post assumes some understanding of Redis. Read more about Redis here.
Salesforce Activity Platform apps and customers have a configurable data retention period in which we will TTL (time-to-live)/delete any data older than that set time. Our data can be either be at the Organization (parent) level or at the User (child) level. A parent level deletion job can only be recognized as complete if and only if all child level jobs are also complete. Our deletion pipeline used a Mongo set data structure to keep track of completed child jobs from fanout for parallel execution. However, this fanout introduced a massive load on Mongo as each child job will be required to update its respective parent set. Thus, the number of database calls scaled linearly with the average number of users per org. We decided to move this set to Redis in order to reduce load on Mongo.
Since this deletion job runs daily, we can can encounter a fluctuation of how much data we’re deleting and how many jobs we’re executing. Thus, to provision the correct and proper size of the cache, we performed memory usage calculations related to our use case. The testing done in this post was performed on more generalized test data, but the findings and optimizations we decided to take are still applicable.
Obviously use built-in Redis SET, right?
Redis offers a SET data structure that is very intuitive and simple to use. However, it’s healthy to be cautious about datastore implementations of data structures. Our initial concerns revolved around the extra memory needed to maintain the SET data structure. Could there be a trade-off in using the data structure that is simpler to interact with? The Redis docs note memory optimizations for small HASHes, but we wanted to see if it would benefit our use case.
Luckily, simulating a SET by using a HASH data structure requires minimal additional work. To do so, we paired each key (what we would want to store into the SET) with a small value such as “1.” In reality, this value could be anything. However, since the value is not important to our operation, it is ideal to keep it something small. Effectively, the “key-set” of our HASH is our “set” data structure.
Our strategy results in a view of the HASH like this:
"redisHashKey":
{
"valueToStore1": 1,
"valueToStore2": 1,
"valueToStore3": 1,
"valueToStore4": 1
}
> HSET hashKey valueToStore 1 # insert
> HDEL hashKey valueToDelete # delete
> HKEYS hashKey # retrieve "set"
> HLEN hashKey # retrieve size
as opposed to…
"redisSetKey": {"valueToStore1","valueToStore2","valueToStore3","valueToStore4"}> SADD setKey valueToStore # insert
> SREM setKey valueToDelete # delete
> SMEMBERS setKey # retrieve set
> SCARD setKey # retrieve size
From testing, we see that HASH can actually be better than SET for tracking unique elements. We’ve been able to cut down our memory usage by more than half employing this simple trick.
“hash-max-ziplist-entries” config for HASH
If using HASH, there is a mysterious config to be aware of that isn’t heavily documented. You can read more about it in the official Redis docs as well as a great blog post from Peter Bengtsson.
Effectively, this config can be thought of as “Redis will employ tricks to optimize memory usage for HASHes where the number of keys within the HASH is less than the configured value.” Otherwise, it effectively has the same memory usage as storing plain key-value pairs.
Maybe not! We tested it out.
We performed our tests with a simple script that:
- Flushed the cache
- Assign the config value of “hash-max-ziplist-entries”
- Inserted multiple ranges of values to be stored (24 len random string)
- Measured the memory usage of the data structure
The plots below will show the memory usage of the data structure as the number of entries grow. It will also show the first derivative of the original “num entries vs memory usage” curve which equates to the “cost to add new entry” as the data structure grows.
Here is a code snippet of the script used to generate these graphs:
import redis, string, random, csv
import numpy as np
import matplotlib.pyplot as pltr = redis.Redis(host='localhost', port=6379, db=0)
config_name = 'hash-max-ziplist-entries'
hashKey = 'hash1'
setKey = 'set1'def get_random_entry():
return ''.join(random.choice(string.ascii_lowercase) for i in range(24))def set_hash_max_ziplist(value):
r.config_set(config_name, value)
print('Setting config %s' % (r.config_get(config_name)))def run_sadd_10k(): #adding to SET
print('Flushing cache: %s' % r.flushall())
x_num_entries = np.arange(10000)
y_mem_usage = np.zeros(len(x_num_entries))
for i in x_num_entries:
r.sadd(setKey, get_random_entry())
mem_usage = r.memory_usage(setKey)
y_mem_usage[i] = mem_usage
diff_y = np.diff(y_mem_usage)
return x_num_entries, y_mem_usage, diff_y
def run_hset_10k(): #adding to HASH
print('Flushing cache: %s' % r.flushall())
x_num_entries = np.arange(10000)
y_mem_usage = np.zeros(len(x_num_entries))
for i in x_num_entries:
r.hset(hashKey, get_random_entry(), 1)
mem_usage = r.memory_usage(hashKey)
y_mem_usage[i] = mem_usage
diff_y = np.diff(y_mem_usage)
return x_num_entries, y_mem_usage, diff_y
set_hash_max_ziplist(512) #set hash-max-ziplist-entries to 512
x,y,dy = run_hset_10k() #run_hset_10k or run_sadd_10k
csv_matrix = np.transpose(np.array([x,y,np.insert(dy, 0, 0, axis=0)]))
np.savetxt("run_hset_10k_512.csv", csv_matrix, delimiter=',', header="Num Entry,Mem Usage, Diff", comments="")
plt.title("HASH Num entries VS memory usage - 1k entries/512 config")
plt.xlabel("num entries")
plt.ylabel("memory usage (in bytes)")
plt.plot(x, y, color ="red")
plt.show()
Adding to SET : hash-max-ziplist-entries=512 : 1k entries
Adding to SET : hash-max-ziplist-entries=512 : 10k entries
Adding to HASH : hash-max-ziplist-entries=512 : 1k entries
Adding to HASH : hash-max-ziplist-entries=512 : 10k entries
There are some things we can see by just looking at the graphs with 1k entries and ‘hash-max-ziplist-entries’=512.
- When (num entries < hash-max-ziplist-entries), HASH uses less memory than SET, and it costs less to add a new entry.
- SET cost per child is consistent throughout (ignoring optimizations).
- HASH will increase after exceeding the config, but then remains overall constant.
- Before exceeding the config, HASH is 100% linear and more predictable than SET.
- Given the same config value, less memory is used by SET for a large number of entries.
How does changing ‘hash-max-ziplist-entries’ affect memory usage?
Below, we will change the config from 512 to 32768 on 50k entries as a dramatic example. SET does not seem to be affected by this change when comparing 50k entry runs at both different config values.
However, we see a gradual and constant increase in memory usage from 0–32767 entries and a spike at x=32768 (instead of x=512) on the HASH plots. This confirms the effects of changing this config.
Does size increase of one HASH affect others?
An important question to ask is, if one HASH exceeds the config size that Redis optimizes for, does it nullify HASH optimizations for the entire cache? What if one of our “sets” is an outlier with more entries than expected?
For this, we’ll set the config value to 1028. We will pre-populate HASH2 and HASH3 with 900 and 700 entries respectively. Then we will record the memory usage of all three sets as HASH1 is growing.
As you can see, HASH2 and HASH3 are unaffected by the loss of optimization from HASH1 exceeding the limit.
When do these spikes happen?
If we look at the “cost per new entry” graph for HASH or SET, we can see that the increase spikes happen in powers of two. Currently, these fluctuations in cost to add new entry are unexplained outliers.
We can try to infer that the increases are from extending the size of the backing collection. The decreases may be from some sort of metadata compression or aggregation algorithm. However, the initial loss of optimization is greatest in percentage relative to current memory usage. (511→512 is 250% increase, 4095→4096 is 20%)
Conclusion
The data shows that, when possible, it is more space effective to represent a “set” data structure as a map/dict in Redis (SET vs HASH). In our case, the majority of our SET sizes will be below the default value for hash_max_ziplist_entries
which is 512. And, in the scenario where the SET size exceeds this configured value (which we do run into sometimes), we can be confident that our existing and new HASHes will not be affected.
There is still more work to be done when looking at the performance trade-offs for increasing the config value. Given that Redis is in-memory, it is already faster than we need. It is ideal to keep the size of all your HASHes below the optimization limit. Although more of a separate topic, there are other techniques that can be employed to better ensure even distribution across multiple HASH objects, such as generating a hashcode from your object key to use instead.
Cost savings
At the time of writing this, AWS ElasticCache Pricing shows that cache size and pricing changes are directly linked. Double in cache size, double the cost to operate per hour. From the plots showing the memory usage of 1k entries with hash_max_ziplist_entries
configured to 512, there can be a memory saving of ~56% using HASH instead of SET (14403 bytes vs 32920 bytes).
Employing this simple trick can help scale down the instance type potentially saving 50% on cost to operate.
Redis already recommends provisioning an instance that is double your peak memory usage. Regardless, you should run these calculations on your use cases to see how much can be saved.
Learn how you can join our team to tackle problems like this one by joining our Talent Network!