python - 对爬虫抓取的数据进行分析该用MySQL还是mogodb？

Question

我们现在准备写一个爬虫抓取大量数据（预测后期可能会达到两百万到两千万记录的数量级），然后对这些数据进行一些数据分析（主要是各种聚合函数计算，生成统计图，以及排序计算排行榜，还有横向比较生成走势折线...

高洛峰 · Answer

Correct the spelling, it should be MongoDB.
Each database has its own advantages and disadvantages, and its applicable situations are also different. Since I am on the side of MongoDB, and someone mentioned MySQL and HDFS above, I will analyze the advantages of MongoDB over MySQL and HDFS in data analysis. The questioner may wish to see if these advantages are what you want, and then make a decision based on the actual situation of your project.
MySQL is a long-established RDBMS, with common features of RDBMS and complete support for ACID. Its technology has gone through a long period of precipitation and application testing, and is already at a relatively stable application stage. The main advantage of RDBMS over NoSQL in practical applications is strong transactions. However, in an OLAP application, strong transactions do not have much use, but they hinder distributed support. Under the premise of full development, horizontal expansion will eventually become the main bottleneck in your choice of MySQL. In addition, for applications such as crawlers, unstructured data is usually crawled, which has great limitations in the storage and query of relational models. But there is also a possibility that the websites you are interested in are all the same type of websites, and you are only interested in specific content on the web pages, so that they can be organized into structured data, so MySQL is still competent in this regard. But even so, with the development of applications, the flexibility of data storage will still be sacrificed in the future. Therefore, for applications such as crawlers, the main problem of MySQL is that the data model is not flexible enough and cannot (or is difficult to) expand horizontally.
As far as the two main problems above are concerned, HDFS can actually handle them. Therefore, HDFS has advantages over MySQL in applications such as crawlers. Similarly, MongoDB also solves these two problems very well. So what are the advantages of MongoDB over HDFS? A very important point comes from the fact that MongoDB can establish a secondary index on any field in the document like a relational database, so that the performance advantages brought by the index can be maximized during the analysis process. In addition, HDFS provides more like a file system, while MongoDB provides a flexible database technology. Operations such as geographical distribution and expired document archiving can be easily implemented on MongoDB.
In terms of ecosystem, the peripheral tools of HDFS must be richer, after all, where is the development history. MongoDB currently mainly supports:

BI Connector: MongoDB provides PostgreSQL or MySQL interface to the outside world to utilize existing BI tools
Spark Connector: MongoDB connects with Spark for calculation

Back to your question, in all fairness, the efficiency is not big at the level of one million to ten million. No matter which database is used, the performance difference will not be qualitatively different if it is used correctly. Regarding availability issues, MongoDB's high availability can achieve second-level error recovery. MySQL also has corresponding solutions, but the operation and maintenance may be more complicated. There is not much difference between the companies in terms of safety.

PHP中文网 · Answer

MySQL will become very nervous when processing large amounts of data. On the contrary, MongoDB should be better through a cluster.

In fact, you don’t need a database at all. This may become an IO bottleneck for crawlers.

You can try HDFS using Hadoop.

巴扎黑 · Answer

You should choose Hadoop as the processing platform. In this case, the underlying data storage is generally better to use MySQL's .mangodb+hadoop combination for real-time monitoring, such as the barrage during the Spring Festival Gala live broadcast, because mongodb supports millisecond-level data query , real-time analysis. Hadoop writes once and retrieves it many times. If coupled with MySQL, it is more suitable for your project. The security is actually about the same. It's ok if the key firewall is secure. After all, your database is isolated. So I suggest you choose MySQL.

PHP中文网 · Answer

We are now going to write a crawler to capture a large amount of data (it is predicted that it may reach the order of two million to twenty million records later)

If you only have this little data, MySQL or MongoDB will work. But relatively speaking, MongoDB will be more flexible.

天蓬老师 · Answer

The amount of data between 200w and 2000w is relatively small. You can consider which one of the two is more familiar to you and use that one. But basically, if the database reaches tens of millions of levels, there will be query performance problems, so if the data continues to grow, you can consider using mongodb. After all, it is much simpler to build a mongodb sharded cluster than a mysql cluster. And it’s more flexible to handle.

天蓬老师 · Answer

There is no need to use hadoop for data volume of 200-2000w, unless your team is familiar with the hadoop technology stack;
From a performance perspective, this level of data can be used by both MySQL and mongoDB. The key depends on whether your data is structured or unstructured. Relatively speaking, mongo is more flexible

天蓬老师 · Answer

It just so happens that my current company has done something in this area, and I am responsible for it. I can tell you about it for reference.
What I mainly do here is log processing and archiving, doing hot and cold statistics on the access logs generated every day, generating various data reports, etc. In fact, the crawler is similar in the end.
I first considered MYSQL, but the performance of a single MYSQL table exceeding tens of millions was poor, so I chose to use mongodb at that time.
In fact, what you do is very simple. It is nothing more than using python to regularly capture the daily server logs locally, and then using the pandas library to construct the data into the data structure you want. If you need to calculate group aggregation, just aggregate it. Finally, Daily data results are thrown into mongodb.
Currently, the company has about 8KW of mongodb data. The efficiency of data retrieval is still acceptable. Remember to add indexes.
In addition to recording data into mongodb, we also use flask to write a restful API to specifically call the data statistics results for the operation system. The operation side will also create a table on MYSQL to collect the statistical results from my mongodb. Calculate a total data again and put it in MYSQL, so that you don't have to call mongodb to perform repeated aggregation calculations every time you get the data from the API.

python - 对爬虫抓取的数据进行分析该用MySQL还是mogodb？

reply all(7)I'll reply