Laracon 2024: Jess Archer: Analyzing Analytical Databases

Jess Archer taught attendees about analytical databases and how they compare to other more traditional databases.

I think this is the talk that taught me the most of the entire conference.

Definitions

  • OLTP (Online Transaction Processing): MySQL, PostgreSQL, SQLite, etc.
  • OLAP (Online Analytical Processing): SingleStore, ClickHouse, etc.

Her preference is Clickhouse; it’s free and open-source, and has excellent documentation and performance.

Comparisons

OLTP databases tend to be row-oriented and store data on disk with each row’s index.

OLAP databases tend to be column-oriented, and store each column of data together, making it much more performant to run queries like AVERAGE(), SUM(), etc., as it only has to open a single file instead of reading the entire database like an OLTP database would.

She had downloaded a dump (22GB compressed) of all Stack Overflow posts and imported it into a MySQL database and a Clickhouse database to run queries live on stage.

It could take 5–6 seconds to load an average view count using MySQL, and 27.5ms using Clickhouse.

What’s the catch?

At least for Clickhouse, the ID field is not unique, meaning that you could have multiple rows with the same ID, and that selecting a row by ID requires a full table scan (using LIMIT 1 can help by “bailing out” once a match has been found).

Ordering: the table structure should be designed close to what the typical query needs, to prevent extra reads from disk.

Inserts: bulk inserts are optimal, rather than single-row inserts

  • Each individual insert creates a “part” or folder on disk
  • The database engine will eventually merge and compact them (see the MergeTree engine)
  • The async_insert feature can also help

Updates: ideally, data is immutable so the engine doesn’t have to rewrite an entire file on disk

Deletes: can be optimized and automated; there’s a marker that indicates a row has been deleted, and at some point the engine will compact the files and remove those

Other Notes about Clickhouse

The LowCardinality field: similar to an enum, but better; it creates a dictionary of values.

The ReplacingMergeTree engine: inserting and updating an entry results in two entries on disk until the engine compacts the files; this engine provides a final keyword that resolves this automatically during queries.

Clickhouse can also easily fill gaps in time series data, while this would be more complicated using other database engines.

Packages

She mentioned these packages for using ClickHouse in a Laravel application:

A week later, I’m still thinking about this talk and how we could use Clickhouse to provide better features and performance for some of our clients.