Home >Operation and Maintenance >Apache >Detailed explanation of apache druid in one article
Foreword:
What is apache druid?
It is an analytical data platform that integrates the characteristics of time series database, data warehouse and full-text retrieval system.
This article will give you a brief understanding of druid's characteristics, usage scenarios, technical features and architecture, etc. This will help us choose a data storage solution and gain an in-depth understanding of druid storage and time series storage.
Overview
A modern cloud-native, stream-native, analytical database
Druid is designed for fast queries and fast data ingestion workflows. The strength of Druid lies in its powerful UI, operable queries at runtime, and high-performance concurrent processing. Druid can be regarded as an open source alternative for data warehouses that meets diverse user scenarios.
Easy integration with existing data pipelines
Druid can stream data from a message bus (such as Kafka, Amazon Kinesis), or batch load files from a data lake (such as HDFS, Amazon S3 and other similar data sources).
100x faster performance than traditional solutions
Druid’s benchmark performance tests for data ingestion and data querying significantly exceed traditional solutions.
Druid's architecture combines the best features of data warehouses, time series databases and retrieval systems.
Unlock new workflows
Druid unlocks scenarios for clickstream, APM (application performance management system), supply chain (supply chain), network telemetry, digital marketing and other event-driven forms of scenarios New query methods and workflows. Druid is built for fast ad-hoc querying of real-time and historical data.
Deployed on AWS/GCP/Azure, hybrid cloud, k8s and rented servers
Druid can be deployed in any *NIX environment. Whether it's an on-premises environment or a cloud environment. Deploying Druid is very easy: scale up and down by adding or removing services.
Usage Scenarios
Apache Druid is suitable for scenarios with high requirements for real-time data extraction, high-performance query and high availability. Therefore, Druid is often used as an analysis system with a rich GUI, or as a backend for a high-concurrency API that requires fast aggregation. Druid is more suitable for event-oriented data.
Common usage scenarios:
Click stream analysis (web and mobile analysis)
Risk control analysis
Network telemetry analysis (network performance monitoring )
Server indicator storage
Supply chain analysis (manufacturing indicators)
Application performance indicators
Business intelligence/real-time online analysis system OLAP
These usage scenarios will be analyzed in detail below:
User activities and behaviors
Druid is often used in click stream, access stream, and activity stream data. Specific scenarios include: measuring user engagement, tracking A/B testing data for product launches, and understanding user usage patterns. Druid can accurately and approximately calculate user indicators, such as unique counting indicators. This means that metrics such as daily active users can be calculated to an approximate value (with an average accuracy of 98%) in a second to see overall trends, or to be calculated precisely to present to stakeholders. Druid can be used to do "funnel analysis" to measure how many users took a certain action and did not take another action. This is useful for products tracking user registrations.
Network flow
Druid is often used to collect and analyze network flow data. Druid is used to manage streaming data segmented and combined with arbitrary attributes. Druid is able to extract large amounts of network flow records and can quickly combine and sort dozens of attributes at query time, which facilitates network flow analysis. These attributes include core attributes such as IP and port numbers, as well as additional enhanced attributes such as location, service, application, device and ASN. Druid is able to handle non-fixed schemas, which means you can add any attributes you want.
digital marketing
Druid is often used to store and query online advertising data. This data usually comes from advertising service providers, and it is crucial to measure and understand advertising campaign performance, click penetration rate, conversion rate (consumption rate) and other indicators.
Druid was originally designed as a powerful user-oriented analytical application for advertising data. In terms of storing advertising data, Druid has already had a lot of production practice, and a large number of users around the world have stored PB-level data on thousands of servers.
Application Performance Management
Druid is often used to track operational data generated by applications. Similar to user activity usage scenarios, this data can be about how users interact with the application, and it can be indicator data reported by the application itself. Druid can be used to drill down to discover how different components of an application are performing, locate bottlenecks, and identify problems.
Unlike many traditional solutions, Druid has the characteristics of smaller storage capacity, smaller complexity, and greater data throughput. It can quickly analyze application events on thousands of properties and calculate complex loading, performance, and utilization metrics. For example, API endpoint based on 95% query latency. We can organize and segment data by any temporary attributes, such as segmenting data by day, such as statistics by user portraits, such as statistics by data center location.
IoT and Device Metrics
Driud can be used as a time series database solution to store indicator data of processing servers and devices. Collect real-time data generated by machines and perform quick ad hoc analysis to measure performance, optimize hardware resources, and locate problems.
Unlike many traditional time series databases, Druid is essentially an analysis engine. Druid combines the concepts of time series databases, columnar analysis databases, and retrieval systems. It supports time-based partitioning, column storage, and search indexing in a single system. This means that time-based queries, numeric aggregations, and retrieval filter queries will be extremely fast.
You can include millions of unique dimension values in your metrics, and freely combine groups and filters by any dimension (dimensions in Druid are similar to tags in time series databases). You can calculate a large number of complex metrics based on tag groups and ranks. And your search and filtering on tags will be faster than traditional time series databases.
OLAP and Business Intelligence
Druid is often used in business intelligence scenarios. The company deploys Druid to speed up queries and enhance applications. Unlike Hadoop-based SQL engines (such as Presto or Hive), Druid is designed for high concurrency and sub-second queries, and enhances interactive data queries through the UI. This makes Druid more suitable for real visual interaction analysis.
Technology
Apache Druid is an open source distributed data storage engine. Druid's core design incorporates concepts from OLAP/analytic databases, timeseries databases, and search systems to create a unified system suitable for a wide range of use cases. Druid integrates the main features of these three systems into Druid's ingestion layer (data ingestion layer), storage format (storage formatting layer), querying layer (querying layer), and core architecture (core architecture).
Druid’s main features include:
Column storage
Druid stores and compresses each column of data separately. And when querying, only the specific data that needs to be queried is queried, and fast scanning, ranking and groupBy are supported.
Native search index
Druid creates an inverted index for string values to achieve fast search and filtering of data.
Streaming and batch data ingestion
Out-of-the-box Apache kafka, HDFS, AWS S3 connectors, streaming processors.
Flexible data schemas
Druid elegantly adapts to changing data schemas and nested data types.
Time-based optimized partitioning
Druid intelligently partitions data based on time. Therefore, Druid time-based queries will be significantly faster than traditional databases.
Support SQL statements
In addition to native JSON-based queries, Druid also supports SQL based on HTTP and JDBC.
Horizontal scalability
Data ingestion rate of millions/second, massive data storage, and sub-second query.
Easy to operate and maintain
The capacity can be expanded and reduced by adding or removing servers. Druid supports automatic rebalancing and failover.
Data Intake
Druid supports both streaming and batch data ingestion. Druid typically connects to raw data sources through a message bus like Kafka (loading streaming data) or through a distributed file system like HDFS (loading batch data).
Druid stores original data in data nodes in the form of segments through Indexing processing. Segments are a query-optimized data structure.
Data Storage
Like most analytical databases, Druid uses columnar storage. Depending on the data type of different columns (string, number, etc.), Druid uses different compression and encoding methods. Druid also builds different types of indexes for different column types.
Similar to the retrieval system, Druid creates an inverted index for string columns to achieve faster search and filtering. Similar to a time series database, Druid intelligently partitions data based on time to achieve faster time-based queries.
Unlike most traditional systems, Druid can pre-aggregate data before ingesting it. This pre-aggregation operation is called rollup, which can significantly save storage costs.
Query
Druid supports JSON-over-HTTP and SQL query methods. In addition to standard SQL operations, Druid also supports a large number of unique operations. The algorithm suite provided by Druid can be used to quickly perform counting, ranking and quantile calculations.
Architecture
Druid is a microservice architecture, which can be understood as a database disassembled into multiple services. Each of Druid's core services (ingestion, querying, and coordination) can be deployed individually or jointly on commodity hardware.
Druid clearly names each service to ensure that operation and maintenance personnel can adjust the parameters of the corresponding service according to usage and load conditions. For example, when the load demands it, operators can give more resources to the data ingestion service and reduce resources to the data query service.
Druid can fail independently without affecting the operation of other services.
Operation and Maintenance
Drui is designed to be a robust system that needs to run 24/7. Druid has the following features to ensure long-term operation and ensure no data loss.
Data copies
Druid creates multiple data copies based on the configured number of copies, so a single machine failure will not affect Druid queries.
Independent services
Druid clearly names each main service, and each service can be adjusted accordingly according to usage. Services can fail independently without affecting the normal operation of other services. For example, if the data ingestion service fails, no new data will be loaded into the system, but existing data can still be queried.
Automatic data backup
Druid automatically backs up all indexed data to a file system, which can be a distributed file system, such as HDFS. You can lose all Druid cluster data and quickly reload from backup data.
Rolling update
Through rolling update, you can update the Druid cluster without downtime, so that it is invisible to users. All Druid versions are backwards compatible.
If you want to learn about time series databases and comparisons, you can move to another article:
First introduction and selection of time series database (TSDB)
Related recommendations: apache server
The above is the detailed content of Detailed explanation of apache druid in one article. For more information, please follow other related articles on the PHP Chinese website!