现代数据基础设施原理-常见问题-PHP中文网

首页

常见问题

现代数据基础设施原理

百草

Sep 09, 2024 pm 02:59 PM

The evolution of the internet over the past few decades has undeniably impacted how our societies function. From facilitating globalization to making new technology like social media and consumer apps available to nearly every person on the planet, the web has seeped into most aspects of our day-to-day lives. However, this ubiquity comes with an ever-growing need to manage enormous amounts of data, which requires better and better data infrastructure.

现代数据基础设施原理

The evolution of the internet over the past few decades has undeniably impacted how our societies function. From facilitating globalization to making new technology like social media and consumer apps available to nearly every person on the planet, the web has seeped into most aspects of our day-to-day lives. However, this ubiquity comes with an ever-growing need to manage enormous amounts of data, which requires better and better data infrastructure.

Back in the Web 1.0 era, we could really only read static content on the internet. A decade later, with Web 2.0, it became possible to read and write on online social networks. Now, with Web 3.0 and the dawn of AI and blockchain, a single person generates around 1.7 MB of data every second. That adds up to approximately 146.88 GB of data per person per day. Such demanding workloads mean that data infrastructure is now mission-critical for most businesses. Modern data infrastructure supports everything from daily operational workloads (OLTP) to strategic decision-making workloads (OLAP).

Considering the data requirements of the world we live in today, we can conclude that the purpose of a modern data infrastructure is to handle large volumes of data efficiently without compromising latency, consistency, security, or developer experience. So when an engineer is thinking about a new app or software (or upgrading an existing one), they must think about designing a modern data infrastructure that can handle a high velocity of data growth while maintaining efficiency, security, and usability. In this article, we will discuss the principles of modern data infrastructure at a higher level, so that when choosing a technology for your infrastructure, you're able to evaluate it from the lens of how well it fulfills each principle.

Design To Scale

Since a major requirement for software today is the ability to handle massive (and growing) amounts of data, it would follow that scalability would be front and center when designing a modern data infrastructure. More specifically, it is crucial to be able to scale both vertically and horizontally. Unlike legacy data platforms, which often rely on monolithic architectures that couldn't adapt to such high-volume demands, software using a modern data infrastructure must be capable of first pushing a single server instance (involving multiple cores and hundreds of GB of memory) to its limit and then extending to multiple instances with a distributed setup. It needs to be elastic as well to handle growing data volumes and sudden traffic surges.

Why?

Vertical scaling, or scaling up, involves adding more resources to an existing system. This can involve CPU, RAM, and storage capacity upgrades, which end up being cheaper while workloads are smaller while maintaining the ability to grow more and faster in the future. Software that is able to scale vertically is able to use hardware to its full capacity. It also tends to be easier to implement initially since it doesn't require a new system architecture and easier to manage since it ultimately requires fewer nodes. It can also improve the performance of apps that are CPU- or memory-intensive, along with reducing latency and response times with in-memory data and faster processors. But on the flip side, even optimal hardware has its limits, and as the upgrades become more expensive, it becomes less efficient to only scale horizontally. Scaling up also fails to address fault tolerance by creating a single point of failure.

Then there's horizontal scaling, or scaling out, which allows for systems to grow significantly more (theoretically, infinitely, although other limitations can come up practically) with the ability to handle more simultaneous users and requests by spreading the workload among multiple machines. The multiple nodes also do a better job of addressing the single point of failure with vertical scaling as well as improving load balancing. Elastic scaling is also possible with cloud platforms, where resources can be added or removed as necessary. Horizontal scaling can also be cheaper at higher volumes than vertical scaling. Then again, there's the sheer complexity of so many nodes to consider, along with network overhead and the difficulty of maintaining data consistency.

The best way to mitigate the weaknesses of each and make efficient use of both is to build software that can implement a combination of vertical and horizontal scaling. A modern data infrastructure should be able to initially scale up to maximize existing resources and transition to scaling out as workloads grow. If the software architecture allows it, it is also worth looking into a hybrid approach, where vertical scaling optimizes individual nodes and horizontal scaling provides overall system growth and redundancy.

Design To Fail Fast

Designing a modern data infrastructure to fail fast means creating systems that can quickly detect and handle failures, improving reliability and resilience. If a system goes down, most of the time, the problem is with the data layer not being able to handle the stress rather than the application compute layer. While scaling, when one or more components within the data infrastructure fail, they should fail fast and recover fast. In the meantime, since the data layer is stateful, the whole fail-and-recovery process should minimize data inconsistency as well. High availability should be intuitive and effortless for data infrastructure today.

The evolution path from backup to replication to automatic failover is crucial to achieving high availability in systems. Each stage improves how the data is protected and recovers from failovers.

Transitioning from periodic backups, which are necessary for long-term data recovery, to continuous replication improves data availability and reduces recovery times.
Implementing automatic failover on top of replication ensures that applications are up and running. Failover systems detect failures and switch to replicas automatically to ensure high availability.
Load balancing, distributed architectures, and container orchestration can further help improve availability.

Legacy data infrastructures often have single points of failure and lack redundancy mechanisms, making them vulnerable to downtime and data loss. Nowadays, the features discussed above are essential and should be easily accessible to developers.

Modern data infrastructure needs to have high availability and fault tolerance, and it should be a simple toggle (either in the UI or in the CLI) from the user's perspective. Obviously, an application without available data is pointless, and downtime can lead to a loss of revenue and reputation. Thus, automatic failover and high availability are a must.

Let's look at an example. If an eCommerce site goes down during the Black Friday sale because the data layer doesn't provide high availability, it will directly cause revenue loss. And this kind of revenue loss might not be recoverable. To add high availability to your data store, modern data infrastructure should allow you to simply toggle it on and choose your availability zone(s). With a few clicks in the UI or with just minimal additional configuration, high availability should be available at your fingertip.

Design for Speed

These days, we get frustrated when a Google search doesn't load immediately or the UI in an app takes more than a millisecond to be ready for us. By default, databases and data stores need to be able to respond quickly to user queries under heavy throughput. Users expect a real-time or near-real-time experience from all applications. Much of the time, even a few milliseconds, is too slow. For instance, a web API request may translate to one or a few queries to the primary on-disk database and then a few to even tens of operations to the in-memory data store. For each in-memory data store operation, a sub-millisecond response time is a bare necessity for an expected user experience.

100ms or less is an ideal wait time for a human experiencing technology, as it feels instantaneous. Anything over 200ms makes the latency feel obvious, and the human feels frustrated. So if an application has a latency higher than 200ms, people tend to report that it is difficult to use. For example, if a payment request takes more than a few seconds to process, the customer may be left to wonder if their payment went through, and if they have to spend time figuring it out, they may just lose interest in buying.

Design for Security

As more and more is done online, we are required to share personal information and data online to complete tasks. Even when we don't share data ourselves, applications collect information about our online behavior that can say a lot about who we are. Simply by using software and apps, everyone is left vulnerable to data breaches, cybersecurity threats, and even identity theft. This leaves engineers with the responsibility to carefully consider security when designing their modern data infrastructures, along with the need to maintain compliance and data integrity.

By implementing RBAC, ACLs, and secured network practices, engineers can develop a fundamentally robust security framework to handle any threats and protect their software's data.

RBAC, or role-based access control, is a system for restricting access based on roles that are assigned to users. Beyond defining roles and permissions, RBAC requires a regular review of these assignments to block unauthorized access. RBAC also provides granular control over user authorization and makes it easier to manage permissions as people join and/or leave the organization.

ACLs, or Access Control Lists, define which users or systems are granted or denied access to specific resources. ACLs are even more granular than RBAC and provide flexibility since they can be applied to different types of resources, like files, directories, and network devices.

Secured network practices protect data in transit and make sure that network communications are protected from unauthorized access and attacks. To implement secured network practices, encryption protocols like TLS and SSL should be used to secure data during transmission. Firewalls and security groups should control traffic based on the organization's security rules. A network should be segmented into different zones to avoid spreading breaches and limiting attacks. VPNs and secure access solutions also help to protect against remote access.

It is also important to secure data-sharing mechanisms within the organization with encrypted transmissions and secure file-sharing platforms like Google Drive or Dropbox, depending on the company's needs. Maintaining clear documentation for data-sharing procedures also makes it easier to maintain consistency.

Design for Maintainability

Outdated systems often feature tightly coupled components and rigid architectures, making it difficult to configure, extend, and integrate new parts without creating silos and increasing the complexity of maintenance. Software's data infrastructure is an ecosystem of moving parts. Every part must work together, be configurable, and be extensible — all without creating silos. In practice, this is not easy to do due to the way fallible humans use each moving part. However, here are a few tips to make the task more approachable:

Even as everything begins to fit well into a single ecosystem, a modular data architecture using microservices and containerization makes it easier to update or replace individual components. It also makes sense to use automation wherever possible for tasks like deployment, scaling, and monitoring. This reduces human intervention and error. And of course, maintain high-quality documentation and standardization across the entire ecosystem.

Design for Cost Efficiency

We touched lightly on cost efficiency while talking about scalability, but let's step a little further into the topic. With the increasing complexity and scale of data operations, cost efficiency is central to continued innovation. In the competitive landscape of the software industry, companies, especially startups and medium-sized enterprises, often face tight budgets and financial constraints — every dollar counts. This makes it imperative to ensure that every dollar spent contributes directly to value creation and operational efficiency without sacrificing scalability for future growth.

An example of a tool that promotes operational efficiency is DuckDB. Sure, a huge cluster of powerful computers can calculate the result of our complex analytic queries, but engineers should consider: do they really need so much power? The vast majority of organizations only need data analysis over the course of hundreds of GB to a few TB of data. DuckDB, as a lightweight engine, can provide engineers with what they need without unnecessarily breaking the bank for features and power they will never use. In contrast, legacy data infrastructures weren't designed for cost efficiency because they often relied on expensive, proprietary hardware and software, required significant upfront investments, and incurred high ongoing maintenance costs. Additionally, their inability to scale efficiently led to over-provisioning resources to handle peak loads, resulting in wasted capacity and higher operational expenses.

It's also important to consider whether the tool that the team is adopting provides transparency around how pricing is being calculated. Some products charge based on the "number of reads and writes," the "number of rows fetched," or the "total data processed." But what do these numbers really mean? Most teams don't even have access to such metrics, let alone understand how the product is determining these numbers. This can result in ridiculously high costs that are difficult to track and fix. Shopify, for example, once stumbled with BigQuery, which is not an obsolete technology, over a $1M query.

At the end of the day, in a modern data infrastructure, costs should be predictable and efficient (even at scale). All team members should be able to understand pricing, which should be a major consideration for engineers as they are developing software. That being said, if a software's user base grows 10X over the years, the data infrastructure costs certainly should not match if it has grown efficiently.

Design for Developer Experience

A modern data infrastructure that is optimized for a positive developer experience can increase productivity, accelerate development, and reduce errors. So what are developers looking for in a good experience? We mean ease of use, familiar tools and integrations, freedom to access and process data easily, and not having to worry about security. Conversely, we don't want to work with an unfashionable data store that has an extremely complex configuration, doesn't work well out of the box, and requires very specialized knowledge to even get started.

Intuitive and familiar APIs and SDKs can make it easier for engineers to jump into building data-driven applications and should be made available and easily accessible. For example, CockroachDB is compatible with the PostgreSQL wire protocol and API, making it much easier for developers to migrate existing applications. This compatibility allows CockroachDB to stand as part of the PostgreSQL ecosystem, enabling developers to leverage their existing knowledge and tools. In addition to providing clear and easy-to-use documentation for the data infrastructure, it's also a good idea to ensure there's documentation in place for APIs, SDKs, and any other tools to help developers avoid bottlenecks and hurdles.

Now, to make the process of using the data infrastructure as easy as possible, rich features are features that support various use cases and introduce commonly required shortcuts. This can include full-text search, geospatial queries, built-in connectors for various data sources, etc.

Support for diverse data types (strings, numbers, and vectors and embedding for AI) and multiple models (relational, key-value, graph, document) reduces the need for extra tools and integrations, reduces complexity around data processing, and makes it easier to query and analyze data across different formats. And yes, that's right, a vector is just a data type, and it is or will be supported by all major data platforms.

And then there's security. We've talked about the importance and components of security already, but it's also important to note that from the user's perspective, security should be built in. A development team with no security expertise should not have to worry about it once the data endpoint is properly protected. Additional features, such as encryption at rest, should be easily configurable and toggleable as well.

All in all, developing any tool for developers is all about making tedious processes faster and easier so that engineers can focus on the innovation around what they are building.

Conclusion

When designing a modern data infrastructure, the major principles to keep in mind are scalability, high availability, speed, security, maintainability, efficiency, and of course, developer experience.

Take the time to assess your own product's data infrastructure against these principles: Do you have a modern data infrastructure? Consider thinking back to these components as you add and take away data technology in the future.

以上是现代数据基础设施原理的详细内容。更多信息请关注PHP中文网其他相关文章！

声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn