Introduction to NoSQL
NoSQL (NoSQL = Not Only SQL), meaning "not just SQL".
In modern computing systems, huge amounts of data are generated on the Internet every day.
A large part of this data is processed by relational database management systems (RDMBSs). In 1970, E.F. Codd's paper "A relational model of data for large shared data banks" proposed the relational model, which made data modeling and application programming simpler.
It has been proved through application practice that the relational model is very suitable for client-server programming, far exceeding the expected benefits. Today it is the dominant technology for structured data storage in network and business applications.
NoSQL is a new revolutionary database movement. It was proposed in the early days, and the trend became more and more intense in 2009. Advocates of NoSQL advocate the use of non-relational data storage. Compared with the overwhelming use of relational databases, this concept is undoubtedly an injection of new thinking.
Relational database follows ACID rules
Transaction is transaction in English, which is very similar to transactions in the real world. It has the following four characteristics:
1. A (Atomicity) Atomicity
Atomicity is easy to understand. That is to say, all operations in the transaction are either completed or none. The condition for the success of the transaction is that the transaction All operations in it are successful. As long as one operation fails, the entire transaction fails and needs to be rolled back.
For example, by bank transfer, transferring 100 yuan from account A to account B is divided into two steps: 1) Withdraw 100 yuan from account A; 2) Deposit 100 yuan into account B. These two steps are either completed together or not completed together. If only the first step is completed and the second step fails, the money will be 100 yuan less for no reason.
2. C (Consistency) Consistency
Consistency is also easier to understand, which means that the database must always be in a consistent state. The running of the transaction will not change the original consistency constraints of the database.
For example, the existing integrity constraint a+b=10, if a transaction changes a, then b must be changed so that a+b=10 is still satisfied after the transaction ends, otherwise the transaction fails.
3. I (Isolation) Independence
The so-called independence means that concurrent transactions will not affect each other. If the data to be accessed by one transaction is being accessed by another, Transaction modification, as long as another transaction is not committed, the data it accesses will not be affected by the uncommitted transaction.
For example, there is a transaction that transfers 100 yuan from account A to account B. If the transaction is not completed yet, if B checks his own account at this time, he will not see the newly added 100 yuan.
4. D (Durability) Durability
Durability means that once a transaction is submitted, the modifications it makes will be permanently saved in the database, even if there is a downtime It won't be lost either.
Distributed system
A distributed system consists of multiple computers and communicating software components connected through a computer network (local network or wide area network).
A distributed system is a software system built on the network. It is precisely because of the characteristics of software that distributed systems have a high degree of cohesion and transparency.
Thus, the difference between networks and distributed systems lies more in the high-level software (especially the operating system) than in the hardware.
Distributed systems can be applied on different platforms such as: PC, workstation, LAN and WAN, etc.
Advantages of distributed computing
Reliability (fault tolerance):
An important advantage in distributed computing systems is reliability. A system crash on one server does not affect the remaining servers.
Scalability:
In a distributed computing system, more machines can be added as needed.
Resource sharing:
Sharing data is essential for applications such as banks and reservation systems.
Flexibility:
Since the system is very flexible, it is easy to install, implement and debug new services.
Faster speed:
A distributed computing system can have the computing power of multiple computers, making it faster than other systems.
Open system:
Because it is an open system, the service can be accessed locally or remotely.
Higher performance:
Can provide higher performance (and better cost-effectiveness) than centralized computer network clusters.
Disadvantages of distributed computing
Troubleshooting:
:
Troubleshooting and diagnosing issues.
Software:
Less software support is a major drawback of distributed computing systems.
Network:
Problems with network infrastructure, including: transmission problems, high load, information loss, etc.
Security:
The characteristics of the development system cause problems such as data security and sharing risks in distributed computing systems.
What is NoSQL?
NoSQL refers to a non-relational database. NoSQL is sometimes called the abbreviation of Not Only SQL, which is a general term for database management systems that are different from traditional relational databases.
NoSQL is used for the storage of very large-scale data. (Google or Facebook, for example, collect trillions of bits of data on their users every day). These types of data stores do not require a fixed schema and can scale out without redundant operations.
Why use NoSQL?
Today we can easily access and capture data through third-party platforms (such as Google, Facebook, etc.). Users' personal information, social networks, geographical locations, user-generated data and user operation logs have increased exponentially. If we want to mine these user data, then SQL databases are no longer suitable for these applications. The development of NoSQL databases can handle these large data very well.
Example
Social relationship network:
Separate records: UserID, first_name,last_name , age, gender,...
Task: Find all friends of friends of friends of ... friends of a given user.
Wikipedia Page:
Combination of structured and unstructured data
Task: Retrieve all pages regarding athletics of Summer Olympics before 1950.
RDBMS vs NoSQL
RDBMS
-
Highly organized structured data
- Structured Query Language (SQL) (SQL)
-
Data and relationships are stored in separate tables.
- Data manipulation language, data definition language
- Strict consistency
- Basic Transactions
NoSQL
- Represents more than just SQL
- No declarative query language
-Weless model
-Key-value pair storage, column storage, document storage, graph database
-Eventual consistency, not ACID properties
- Unstructured and unpredictable data
- CAP Theorem
- High performance, high availability and scalability
A brief history of NoSQL
The term NoSQL first appeared in 1998 and is a lightweight, open source software developed by Carlo Strozzi , a relational database that does not provide SQL functions.
In 2009, Johan Oskarsson of Last.fm initiated a discussion on distributed open source databases [2], and Eric Evans from Rackspace once again proposed the concept of NoSQL. At this time, NoSQL mainly refers to non-relational , distributed, database design pattern that does not provide ACID.
The "no:sql(east)" seminar held in Atlanta in 2009 was a milestone, and its slogan was "select fun, profit from real_world where relational=false;". Therefore, the most common interpretation of NoSQL is "non-relational", emphasizing the advantages of Key-Value Stores and document databases, rather than simply opposing RDBMS.
CAP theorem (CAP theorem)
In computer science, CAP theorem (CAP theorem), also known as Brewer's theorem, states that for a For a distributed computing system, it is impossible to satisfy the following three points at the same time:
Consistency(Consistency) (all nodes have the same data at the same time)
Availability (guarantee a response for each request regardless of success or failure)
separated Tolerance (Partition tolerance) (The loss or failure of any information in the system will not affect the continued operation of the system)
The core of CAP theory is: a distributed system cannot satisfy the three requirements of consistency, availability and partition fault tolerance at the same time. It can only satisfy two of them well at the same time.
Therefore, according to the CAP principle, NoSQL databases are divided into three categories: satisfying the CA principle, satisfying the CP principle and satisfying the AP principle:
CA - single-point cluster, satisfying Consistent, available systems are generally less scalable.
CP - A system that satisfies consistency and must tolerate partitioning. Usually the performance is not particularly high.
AP - A system that meets availability, partition tolerance, and may generally have lower consistency requirements.
Advantages/Disadvantages of NoSQL
Advantages:
- High scalability
- Distributed computing
- Low cost
- Architecture flexibility, semi-structured Data
- No complex relationships
Disadvantages:
- No standardization
- Limited query capabilities (so far)
- Eventually consistent is an unintuitive program
BASE
BASE: Basically Available, Soft-state, Eventually Consistent. Defined by Eric Brewer.
The core of CAP theory is: a distributed system cannot satisfy the three requirements of consistency, availability and partition fault tolerance at the same time. It can only satisfy two of them well at the same time.
BASE is the principle of weak requirements for availability and consistency of NoSQL databases:
Basically Availble --Basically available
Soft-state --Soft state/flexible transactions. "Soft state" can be understood as "connectionless", while "Hard state" is "connection-oriented"
Eventual Consistency --eventual consistency Eventual consistency is also the ultimate goal of ACID.
ACID vs BASE
ACID | BASE |
---|---|
Atomicity(Atomicity) | Basically available(Basically Available) |
Consistency (Consistency) | Soft state/flexible transaction (Soft state) |
Isolation (Isolation) | Eventual consistency (Eventual consistency) |
Persistence (Durable) |
NoSQL database classification
Partial representatives | ||
Hbase Cassandra Hypertable | As the name suggests, data is stored in columns. The biggest feature is that it facilitates the storage of structured and semi-structured data, facilitates data compression, and has great IO advantages for queries targeting a certain column or several columns. | |
Document Storage | MongoDB CouchDB | Document storage is generally stored in a format similar to json, and the stored content is document type. In this way, there is an opportunity to create indexes on certain fields and implement certain functions of the relational database. |
key-value storage | Tokyo Cabinet / Tyrant ##Berkeley DBMemcacheDB Redis | The value can be quickly queried through the key. Generally speaking, regardless of the format of the value, all items are accepted as ordered. (Redis includes other functions) |
##Graph storage | Neo4J FlockDB | Optimal storage of graph relationships. Using traditional relational databases to solve this problem will result in low performance and inconvenient design and use. |
##Object Storage
| db4o ##Versant | The database is operated through a syntax similar to that of an object-oriented language, and data is accessed through objects. |
Berkeley DB XML | BaseX Efficiently store XML data and support XML's internal query syntax, such as XQuery and Xpath. |
Facebook
Mozilla
Adobe
Foursquare
LinkedIn
Digg
McGraw-Hill Education
Vermont Public Radio