search
HomeOperation and MaintenanceApacheDetailed explanation of apache druid in one article

Detailed explanation of apache druid in one article

Foreword:

What is apache druid?

It is an analytical data platform that integrates the characteristics of time series database, data warehouse and full-text retrieval system.

This article will give you a brief understanding of druid's characteristics, usage scenarios, technical features and architecture, etc. This will help us choose a data storage solution and gain an in-depth understanding of druid storage and time series storage.

Overview

A modern cloud-native, stream-native, analytical database

Druid is designed for fast queries and fast data ingestion workflows. The strength of Druid lies in its powerful UI, operable queries at runtime, and high-performance concurrent processing. Druid can be regarded as an open source alternative for data warehouses that meets diverse user scenarios.

Easy integration with existing data pipelines

Druid can stream data from a message bus (such as Kafka, Amazon Kinesis), or batch load files from a data lake (such as HDFS, Amazon S3 and other similar data sources).

100x faster performance than traditional solutions

Druid’s benchmark performance tests for data ingestion and data querying significantly exceed traditional solutions.

Druid's architecture combines the best features of data warehouses, time series databases and retrieval systems.

Unlock new workflows

Druid unlocks scenarios for clickstream, APM (application performance management system), supply chain (supply chain), network telemetry, digital marketing and other event-driven forms of scenarios New query methods and workflows. Druid is built for fast ad-hoc querying of real-time and historical data.

Deployed on AWS/GCP/Azure, hybrid cloud, k8s and rented servers

Druid can be deployed in any *NIX environment. Whether it's an on-premises environment or a cloud environment. Deploying Druid is very easy: scale up and down by adding or removing services.

Usage Scenarios

Apache Druid is suitable for scenarios with high requirements for real-time data extraction, high-performance query and high availability. Therefore, Druid is often used as an analysis system with a rich GUI, or as a backend for a high-concurrency API that requires fast aggregation. Druid is more suitable for event-oriented data.

Common usage scenarios:

Click stream analysis (web and mobile analysis)

Risk control analysis

Network telemetry analysis (network performance monitoring )

Server indicator storage

Supply chain analysis (manufacturing indicators)

Application performance indicators

Business intelligence/real-time online analysis system OLAP

These usage scenarios will be analyzed in detail below:

User activities and behaviors

Druid is often used in click stream, access stream, and activity stream data. Specific scenarios include: measuring user engagement, tracking A/B testing data for product launches, and understanding user usage patterns. Druid can accurately and approximately calculate user indicators, such as unique counting indicators. This means that metrics such as daily active users can be calculated to an approximate value (with an average accuracy of 98%) in a second to see overall trends, or to be calculated precisely to present to stakeholders. Druid can be used to do "funnel analysis" to measure how many users took a certain action and did not take another action. This is useful for products tracking user registrations.

Network flow

Druid is often used to collect and analyze network flow data. Druid is used to manage streaming data segmented and combined with arbitrary attributes. Druid is able to extract large amounts of network flow records and can quickly combine and sort dozens of attributes at query time, which facilitates network flow analysis. These attributes include core attributes such as IP and port numbers, as well as additional enhanced attributes such as location, service, application, device and ASN. Druid is able to handle non-fixed schemas, which means you can add any attributes you want.

digital marketing

Druid is often used to store and query online advertising data. This data usually comes from advertising service providers, and it is crucial to measure and understand advertising campaign performance, click penetration rate, conversion rate (consumption rate) and other indicators.

Druid was originally designed as a powerful user-oriented analytical application for advertising data. In terms of storing advertising data, Druid has already had a lot of production practice, and a large number of users around the world have stored PB-level data on thousands of servers.

Application Performance Management

Druid is often used to track operational data generated by applications. Similar to user activity usage scenarios, this data can be about how users interact with the application, and it can be indicator data reported by the application itself. Druid can be used to drill down to discover how different components of an application are performing, locate bottlenecks, and identify problems.

Unlike many traditional solutions, Druid has the characteristics of smaller storage capacity, smaller complexity, and greater data throughput. It can quickly analyze application events on thousands of properties and calculate complex loading, performance, and utilization metrics. For example, API endpoint based on 95% query latency. We can organize and segment data by any temporary attributes, such as segmenting data by day, such as statistics by user portraits, such as statistics by data center location.

IoT and Device Metrics

Driud can be used as a time series database solution to store indicator data of processing servers and devices. Collect real-time data generated by machines and perform quick ad hoc analysis to measure performance, optimize hardware resources, and locate problems.

Unlike many traditional time series databases, Druid is essentially an analysis engine. Druid combines the concepts of time series databases, columnar analysis databases, and retrieval systems. It supports time-based partitioning, column storage, and search indexing in a single system. This means that time-based queries, numeric aggregations, and retrieval filter queries will be extremely fast.

You can include millions of unique dimension values ​​in your metrics, and freely combine groups and filters by any dimension (dimensions in Druid are similar to tags in time series databases). You can calculate a large number of complex metrics based on tag groups and ranks. And your search and filtering on tags will be faster than traditional time series databases.

OLAP and Business Intelligence

Druid is often used in business intelligence scenarios. The company deploys Druid to speed up queries and enhance applications. Unlike Hadoop-based SQL engines (such as Presto or Hive), Druid is designed for high concurrency and sub-second queries, and enhances interactive data queries through the UI. This makes Druid more suitable for real visual interaction analysis.

Technology

Apache Druid is an open source distributed data storage engine. Druid's core design incorporates concepts from OLAP/analytic databases, timeseries databases, and search systems to create a unified system suitable for a wide range of use cases. Druid integrates the main features of these three systems into Druid's ingestion layer (data ingestion layer), storage format (storage formatting layer), querying layer (querying layer), and core architecture (core architecture).

Detailed explanation of apache druid in one article

Druid’s main features include:

Column storage

Druid stores and compresses each column of data separately. And when querying, only the specific data that needs to be queried is queried, and fast scanning, ranking and groupBy are supported.

Native search index

Druid creates an inverted index for string values ​​to achieve fast search and filtering of data.

Streaming and batch data ingestion

Out-of-the-box Apache kafka, HDFS, AWS S3 connectors, streaming processors.

Flexible data schemas

Druid elegantly adapts to changing data schemas and nested data types.

Time-based optimized partitioning

Druid intelligently partitions data based on time. Therefore, Druid time-based queries will be significantly faster than traditional databases.

Support SQL statements

In addition to native JSON-based queries, Druid also supports SQL based on HTTP and JDBC.

Horizontal scalability

Data ingestion rate of millions/second, massive data storage, and sub-second query.

Easy to operate and maintain

The capacity can be expanded and reduced by adding or removing servers. Druid supports automatic rebalancing and failover.

Data Intake

Druid supports both streaming and batch data ingestion. Druid typically connects to raw data sources through a message bus like Kafka (loading streaming data) or through a distributed file system like HDFS (loading batch data).

Druid stores original data in data nodes in the form of segments through Indexing processing. Segments are a query-optimized data structure.

Detailed explanation of apache druid in one article

Data Storage

Like most analytical databases, Druid uses columnar storage. Depending on the data type of different columns (string, number, etc.), Druid uses different compression and encoding methods. Druid also builds different types of indexes for different column types.

Similar to the retrieval system, Druid creates an inverted index for string columns to achieve faster search and filtering. Similar to a time series database, Druid intelligently partitions data based on time to achieve faster time-based queries.

Unlike most traditional systems, Druid can pre-aggregate data before ingesting it. This pre-aggregation operation is called rollup, which can significantly save storage costs.

Detailed explanation of apache druid in one article

Query

Druid supports JSON-over-HTTP and SQL query methods. In addition to standard SQL operations, Druid also supports a large number of unique operations. The algorithm suite provided by Druid can be used to quickly perform counting, ranking and quantile calculations.

Detailed explanation of apache druid in one article

Architecture

Druid is a microservice architecture, which can be understood as a database disassembled into multiple services. Each of Druid's core services (ingestion, querying, and coordination) can be deployed individually or jointly on commodity hardware.

Druid clearly names each service to ensure that operation and maintenance personnel can adjust the parameters of the corresponding service according to usage and load conditions. For example, when the load demands it, operators can give more resources to the data ingestion service and reduce resources to the data query service.

Druid can fail independently without affecting the operation of other services.

Detailed explanation of apache druid in one article

Operation and Maintenance

Drui is designed to be a robust system that needs to run 24/7. Druid has the following features to ensure long-term operation and ensure no data loss.

Data copies

Druid creates multiple data copies based on the configured number of copies, so a single machine failure will not affect Druid queries.

Independent services

Druid clearly names each main service, and each service can be adjusted accordingly according to usage. Services can fail independently without affecting the normal operation of other services. For example, if the data ingestion service fails, no new data will be loaded into the system, but existing data can still be queried.

Automatic data backup

Druid automatically backs up all indexed data to a file system, which can be a distributed file system, such as HDFS. You can lose all Druid cluster data and quickly reload from backup data.

Rolling update

Through rolling update, you can update the Druid cluster without downtime, so that it is invisible to users. All Druid versions are backwards compatible.

If you want to learn about time series databases and comparisons, you can move to another article:

First introduction and selection of time series database (TSDB)

Related recommendations: apache server

The above is the detailed content of Detailed explanation of apache druid in one article. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:掘金. If there is any infringement, please contact admin@php.cn delete
Apache's Continuing Importance: Reasons for Its LongevityApache's Continuing Importance: Reasons for Its LongevityApr 22, 2025 am 12:08 AM

Reasons for Apache's continued importance include its diversity, flexibility, strong community support, widespread use and high reliability in enterprise-level applications, and continuous innovation in emerging technologies. Specifically, 1) The Apache project covers multiple fields from web servers to big data processing, providing rich solutions; 2) The global community of the Apache Software Foundation (ASF) provides continuous support and development momentum for the project; 3) Apache shows high stability and scalability in enterprise-level applications such as finance and telecommunications; 4) Apache continues to innovate in emerging technologies such as cloud computing and big data, such as breakthroughs from ApacheFlink and ApacheArrow.

Beyond the Hype: Assessing Apache's Current RoleBeyond the Hype: Assessing Apache's Current RoleApr 21, 2025 am 12:14 AM

Apache remains important in today's technology ecosystem. 1) In the fields of web services and big data processing, ApacheHTTPServer, Kafka and Hadoop are still the first choice. 2) In the future, we need to pay attention to cloud nativeization, performance optimization and ecosystem simplification to maintain competitiveness.

Apache's Impact: Web Hosting and Content DeliveryApache's Impact: Web Hosting and Content DeliveryApr 20, 2025 am 12:12 AM

ApacheHTTPServer has a huge impact on WebHosting and content distribution. 1) Apache started in 1995 and quickly became the first choice in the market, providing modular design and flexibility. 2) In web hosting, Apache is widely used for stability and security and supports multiple operating systems. 3) In terms of content distribution, combining CDN use improves website speed and reliability. 4) Apache significantly improves website performance through performance optimization configurations such as content compression and cache headers.

Apache's Role: Serving HTML, CSS, JavaScript, and MoreApache's Role: Serving HTML, CSS, JavaScript, and MoreApr 19, 2025 am 12:09 AM

Apache can serve HTML, CSS, JavaScript and other files. 1) Configure the virtual host and document root directory, 2) receive, process and return requests, 3) use .htaccess files to implement URL rewrite, 4) debug by checking permissions, viewing logs and testing configurations, 5) enable cache, compressing files, and adjusting KeepAlive settings to optimize performance.

What Apache is Known For: Key Features and AchievementsWhat Apache is Known For: Key Features and AchievementsApr 18, 2025 am 12:03 AM

ApacheHTTPServer has become a leader in the field of web servers for its modular design, high scalability, security and performance optimization. 1. Modular design supports various protocols and functions by loading different modules. 2. Highly scalable to adapt to the needs of small to large applications. 3. Security protects the website through mod_security and multiple authentication mechanisms. 4. Performance optimization improves loading speed through data compression and caching.

The Enduring Relevance of Apache: Examining Its Current StatusThe Enduring Relevance of Apache: Examining Its Current StatusApr 17, 2025 am 12:06 AM

ApacheHTTPServer remains important in modern web environments because of its stability, scalability and rich ecosystem. 1) Stability and reliability make it suitable for high availability environments. 2) A wide ecosystem provides rich modules and extensions. 3) Easy to configure and manage, and can be quickly started even for beginners.

Apache's Popularity: Reasons for Its SuccessApache's Popularity: Reasons for Its SuccessApr 16, 2025 am 12:05 AM

The reasons for Apache's success include: 1) strong open source community support, 2) flexibility and scalability, 3) stability and reliability, and 4) a wide range of application scenarios. Through community technical support and sharing, Apache provides flexible modular design and configuration options, ensuring its adaptability and stability under a variety of needs, and is widely used in different scenarios from personal blogs to large corporate websites.

Apache's Legacy: What Made It Famous?Apache's Legacy: What Made It Famous?Apr 15, 2025 am 12:19 AM

Apachebecamefamousduetoitsopen-sourcenature,modulardesign,andstrongcommunitysupport.1)Itsopen-sourcemodelandpermissiveApacheLicenseencouragedwidespreadadoption.2)Themodulararchitectureallowedforextensivecustomizationandadaptability.3)Avibrantcommunit

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software