What is ClickHouse?
ClickHouse is an open-source column-oriented database management system for online analytical processing (OLAP).
ClickHouse allows generating analytical reports of data using SQL queries that are updated in real-time. The system is marketed for high performance. It is simple and works out of the box. The project was released as open-source software under the Apache 2 license in June 2016.
ClickHouse is the first open-source SQL data warehouse to match the performance and scalability of proprietary databases such as Sybase IQ, Vertica, and Snowflake. It includes the following features, such as:
- Column storage that handles tables with trillions of rows and thousands of columns.
- Fault-tolerance and read scaling thanks to built-in replication.
- Outstanding aggregation through materialized views.
- Features to solve real-world problems such as funnel analytics and last point queries.
ClickHouse development is driven by a community consisting of hundreds of contributors focused on solving real problems, not implementing corporate roadmaps.
ClickHouse processes 100 of millions to more than a billion rows and tens of gigabytes of data per single server per second and performing on hundreds of node clusters. This system can be easily installed on a single server or a virtual machine.
ClickHouse uses all available hardware to its full potential for the fastest process of each query. The peak processing performance for a single query stands at more than two terabytes per second.
ClickHouse allows companies to add servers to their clusters without investing time or money into any additional DBMS modification. It is CPU efficient because of its vectorized query execution involving relevant processor instructions and runtime code generation.
- ClickHouse was developed by the Russian IT company Yandex for the Yandex.Metrica.
- When raw data was stored in the aggregated form, then Metrica previously used a classical approach, and this approach helps to reduce the amount of stored data.
- A different approach is to store unaggregated data. Processing raw data requires a high-performance system since all calculations are made in real-time. A column-oriented DBMS is needed to handle analytical data on the entire internet scale to solve this problem.
- The first ClickHouse prototype appeared in 2009.
- End of 2014, Yandex.Metrica version 2.0 was released. The new version has an interface for creating custom reports and uses ClickHouse for storing and processing data.
Features of ClickHouse
Here are the following main features of the ClickHouse, such as:
- True column-oriented DBMS: No extra data is stored with the values. It means that the constant length values must be supported to avoid storing their length "number" next to the values.
- Linear scalability:It is possible to extend a cluster by adding servers.
- Fault tolerance:The system is a cluster of shards, where each shard is a group of replicas. ClickHouse uses asynchronous multi-master replication and can be deployed across multiple data centers. Data is written to any available replica and distributed to all the remaining replicas. ZooKeeper is used for coordinating processes but not involved in query processing and execution.
- SQL support: ClickHouse supports an extended SQL language that includes arrays and nested data structures, approximate and URI functions, and the availability to connect an external key-value store.
- High performance: Vector calculation approach is used for high CPU performance. In this approach, data is stored by columns and processed by vectors (parts of columns). It supports sampling and approximate calculations. And also, parallel and distributed query processing are available, including JOINs.
- HDD optimization:The system can process data that doesn't fit in random access memory.
- Blazing fast: ClickHouse uses all available hardware to its full potential to fastest process each query.
- Easy to use: ClickHouse is simple and instantly available for building reports. SQL language allows expressing the desired result without involving any custom non-standard API found in some alternative systems.
- Highly reliable: ClickHouse DBMS can be configured as a distributed system located on independent nodes, without any single failure points. It also includes a lot of enterprise-grade security features and fail-safe mechanisms against human errors.
- Clients for database connectivity:Database connection options include the console client, the HTTP API, or one of the wrappers. A JDBC driver is also available for ClickHouse.
Disadvantages of ClickHouse
Here are the following points that can be considered as disadvantages, such as:
- There is no support for transactions.
- Lack of ability to modify or delete already inserted data with a high rate and low latency.
- The sparse index makes ClickHouse not so efficient for point queries retrieving single rows by their keys.
- By default, when performing aggregations, the intermediate query states must fit in the RAM on a single server. In such cases, ClickHouse can be configured to spill on the disk.