Big Data leads to Data Hoarding

I’ve said before that the three areas a Data Guardian will spend time on are Availability, Performance and Security of their databases.

There’s one trend in the industry that is endangering those responsibilities.

Keep watching to find out how Big Data affects the Data Guardian.

What is Data Hoarding?

Big Data is a term for collecting data from many sources, usually with a goal to predict user behavior.

While beneficial to businesses, this can lead to bad hygiene practices when handling data.

Typically, applications made for users should be kept lean to keep performance stable.

But with storage being cheap and the rise of a need to collect all the information, applications are hoarding data in the hopes that it might be useful someday.

Hoarding image
Hoarding

How does this affect the Data Guardian’s area of responsibilities?

It directly impacts performance, availability, and security of the database.

How does data hoarding affect performance?

Databases must process data to return results of queries asked of it.

Large datasets require more resources for processing.

For instance, seemingly simple queries that use indexes well still will not scale well under large concurrent requests on large data sizes.

As a result, large datasets will hurt the performance of your database.

How does data hoarding affects availability?

For data availability, one of the main concerns is how long it takes to recover from a failure.

This recovery time is defined as the Mean Time To Recover (MTTR).

As you might expect, the larger the data the longer it takes to recover completely.

Vitess, a technology for automatic sharding of MySQL, recommends shard sizes stay below 250GB to ensure recovery time is kept relatively low.

This is not a bad rule of thumb.

Though be careful if you don’t have a method for managing complex database systems in place.

How does data hoarding affect security?

On the security front, you might be collecting sensitive information with all the other data you are collecting.

Various laws such as EU’s GDPR and US HIPPA require special handling of sensitive information.

Special handling includes length of time to keep data, anonymizing data, and the ability to delete the data upon request.

Big Data can leave your business with a liability if not properly planned for.

Conclusion

I’m not saying Big Data is bad, but rather that a Data Guardian will need to plan for how to handle large datasets.

Practice good data hygiene for your data and keep only what is required for your user applications.

If you must collect data for analytical datasets, separate them and set appropriate expectations on SLOs for availability and performance.

And make sure these systems follow security requirements based on the type of data that is collected.