Home  >  Q&A  >  body text

python - 一般公司做爬虫采集的话常用什么语言

一般公司做爬虫采集的话常用什么语言 在京东搜点书全是有关java的

阿神阿神2741 days ago1675

reply all(30)I'll reply

  • ringa_lee

    ringa_lee2017-04-17 17:50:02

    You can try the jsoup tool, which is developed using java.

    reply
    0
  • 阿神

    阿神2017-04-17 17:50:02

    Let’s start using node now. JavaScript is the one who understands HTML best

    reply
    0
  • 天蓬老师

    天蓬老师2017-04-17 17:50:02

    nodejs +1

    reply
    0
  • PHP中文网

    PHP中文网2017-04-17 17:50:02

    nodejs +1

    reply
    0
  • 伊谢尔伦

    伊谢尔伦2017-04-17 17:50:02

    Actually, I don’t quite agree with what the person who did the DHT crawler said.
    Different languages ​​will naturally have different uses. Talking about which one is good or bad without the environment is just a hooliganism.
    1. If you are doing it for fun, crawling a few pages in a targeted manner, and if efficiency is not the core requirement, the problem will not be big. Any language will work, and the performance difference will not be big. Of course, if you encounter a very complex page and the regular expression is very complex, the maintainability of the crawler will decrease.

    2. If you are doing directional crawling, and the target needs to parse dynamic js.
    So at this time, using the ordinary method of requesting the page and getting the content will definitely not work. A js engine similar to firfox and chrome is needed to dynamically parse the js code. At this time, we recommend casperJS+phantomjs or slimerJS+phantomjs

    3. If it is a large-scale website crawling
    At this time, efficiency, scalability, maintainability, etc. must be considered.
    Large-scale crawling involves many aspects, such as distributed crawling, heavy judgment mechanism, and task scheduling. Which of these questions is easier if you go deeper?
    Language selection is very important at this time.

    NodeJs: It is very efficient in crawling. High concurrency, multi-threaded programming becomes simple traversal and callback, memory and CPU usage are small, and callback must be handled well.

    PHP: Various frameworks are available everywhere, you can just use any one. However, there is really a problem with the efficiency of PHP...not much to say

    Python: I write more in python and have better support for various problems. The scrapy framework is easy to use and has many advantages.

    I think js is not very suitable for writing... efficiency issues. If I haven’t written it, I’ll probably be in a lot of trouble.

    As far as I know, big companies also use C++. In short, most of them are modified on open source frameworks. Not many people really reinvent the wheel.
    not worth.

    I wrote it casually based on my impressions. Corrections are welcome.

    reply
    0
  • PHP中文网

    PHP中文网2017-04-17 17:50:02

    Use pyspider, its performance is no worse than scrapy, more flexible, with WEBUI, and also supports JS crawling~
    You can play it with your own demo~

    reply
    0
  • 迷茫

    迷茫2017-04-17 17:50:02

    selenium

    reply
    0
  • 黄舟

    黄舟2017-04-17 17:50:02

    nodejs +1

    No, I was wrong.


    High-performance crawlers are not as suitable for concurrency as servers, but for efficiency (reduce duplication) are more suitable for parallelism rather than concurrency.

    Well I was wrong again.


    Concurrency and parallelism are almost the same for crawlers~


    No, it’s different.

    Forget it, nodejs +1.

    reply
    0
  • 大家讲道理

    大家讲道理2017-04-17 17:50:02

    Most of them use python, and of course there are also many java c++. Python comes quickly and has great advantages over small and medium-sized applications. If it is large-scale, optimization or C is required to rewrite some performance bottleneck codes.

    reply
    0
  • 天蓬老师

    天蓬老师2017-04-17 17:50:02

    You can try python’s scrapy

    reply
    0
  • Cancelreply