Home  >  Article  >  Backend Development  >  java - PHP or python for data collection and analysis, what are the more mature frameworks?

java - PHP or python for data collection and analysis, what are the more mature frameworks?

WBOY
WBOYOriginal
2016-10-22 00:14:101379browse

I now need to automatically collect data from the article list of a website and the actual content in the list. The id of each article can be obtained in the list, and each article is passed through a unified interface (the parameter brings the article id that is The corresponding json can be obtained) and there is some data that needs to be collected and then analyzed.

Are there any relatively mature frameworks or wheels that can meet my needs? (It needs to be multi-threaded and can run stably 24/7 because the number of collections is huge)

In addition, I would like to ask how to store the collected content (millions to tens of millions). There is some numerical data in the data that needs statistical analysis. Can I use mysql? Or are there other more mature and simple wheels that can be used?

Reply content:

I now need to automatically collect data from the article list of a website and the actual content in the list. The id of each article can be obtained in the list, and each article is passed through a unified interface (the parameter brings the article id that is The corresponding json can be obtained) and there is some data that needs to be collected and then analyzed.

Are there any relatively mature frameworks or wheels that can meet my needs? (It needs to be multi-threaded and can run stably 24/7 because the number of collections is huge)

In addition, I would like to ask how to store the collected content (millions to tens of millions). There is some numerical data in the data that needs statistical analysis. Can I use mysql? Or are there other more mature and simple wheels that can be used?

If it is data analysis.
map-reduce does log analysis
Dpark can solve PV and UV analysis
Spark is also good.
After producing the data report, you can use Pandas for analysis and display. .

If it is data collection. There are many tools.

Why do I think you want to start a search engine? . . The quantity is relatively large. Distributed stuff is recommended.
It is not practical to use MYSQL. . .

Young man, isn’t this what you want from a reptile?

  1. Crawler framework: scrapy

  2. Database selection: You can use MySQL to index at your level for another 500 years

You can also try MongoDB

You didn’t say anything about the language or environment. For multi-threading, nodejs and python are currently generally used. Both of these can use mysql and the like to store data. Millions or tens of millions is not a problem.

Have you ever played with python selenium + PhantomJs?

This is scrapy in python language or this is

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn