The content of this article is about how to use java to implement a p2p seed search function. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.
I had a lot of interest in p2p many years ago, but it stayed in theory and never had the opportunity to practice it. I have recently implemented this thing. From the beginning to now, I think there are some things that I can share. Let’s get to the point.
Before talking about p2p, I want to talk about how we download files. Let me list several ways to download files
1. Use the http protocol to download. The most commonly used method is probably to download files through a browser.
2. Use ftp to download. There are two modes for ftp. One is port (active) mode. In this mode, the client will open a port N (>1023) locally to establish an ftp connection and then send Give the ftp server N 1 listening port for data transmission. When there is a firewall or the client is NAT, it cannot be downloaded. Another way is passive mode. In this mode, in addition to port 21, the ftp server will open a port greater than 1023. That is to say, the client will actively initiate ftp connections and data transmission connections, as long as the ftp server is open. There will be no problem with this port.
The above two methods can be collectively referred to as cs architecture. Under this architecture, resources are concentrated on the server. When the amount of data reaches a certain level, problems will occur. In order to solve this problem, we may think of distributed decentralization, so p2p came into being. p2p stands for peer to peer. This is a peer-to-peer architecture. Each node is both a client and a server.
When storing resources on each node, we may think, when I download a resource, how do I know which machines this file is on? Can it be downloaded?
There was a tracker role in the early p2p architecture. This tracker was responsible for storing metadata information of files. So now the file will be saved on each peer, and the file information will be obtained through the tracker.
Under this architecture, all our files are distributed, but the tracker will be responsible for storing the metadata information of all files, so the tracker only needs to store a small amount of data, compared to Existing files will be relatively easy.
But once the tracker server hangs or the service is unavailable, all files will not be downloaded because it is not fully distributed. In order to be completely decentralized, a trackerless architecture will be developed later. ,
At this time, the tracker no longer exists, and all files, including the metadata information of the files, are stored in a distributed manner.
DHT (Distributed Hash Table) distributed hash table, which is used to replace tracker. There are many algorithms to implement dht, such as Kademlia algorithm and so on.
Several concepts:
1.nodeid Each nodeid in the dht network is 160bit
2.XOR The distance between two nodes is calculated using XOR
3.routing table routing table
The main focus here is implementation, so there is a lot of information on the Internet for the principle part. You can refer to it to see
There are two steps to implement seed search. The first step is a crawler, which is used to crawl seed information on the Internet. The second step is to join the search.
Requires the following knowledge: seeds, bittorrent dht protocol, bencoded
When it comes to p2p, we have to mention seeds, which are the kind of files that are the result of .torrent. Everyone may have used bt Torrents have downloaded files, and the downloaded files use the bittorrent protocol. So how to collect seeds on the Internet?
The main fields included in bt seeds: https://segmentfault.com/a/1190000000681331
The seeds obtained in dht are called trackerless torrent. There is no announce attribute, but there is nodes attribute instead. The official recommendation is not to add router.bittorrent.com to the seed or add it to the routing table.
If you want to get the seed information, you must have an in-depth understanding of the DHT Protocol. bep_0005 describes the DHT Protocol
For details, you can click here http://www.bittorrent.org/beps/bep_0005.html
How to implement a routing table:
The routing table is covered All Node IDs, from 0 to 2 raised to the 160th power. The routing table can be composed of buckets, and each bucket covers part of all nodes.
At the beginning, there is only one bucket in the routing table, covering all nodeids. Each bucket can only hold up to K nodes. The current K value is 8. If the bucket is full, and all the nodes in it are good, and the own nodeid is not in this bucket, then the original bucket is divided into two new buckets, covering 0..2159 respectively. and 2159..2160.
When a bucket is full, the new node is easily discarded. If the node in it goes offline, it will be replaced. If a node has not been pinged in the past 15 minutes, then ping the node. If no response is returned, the node will also be replaced.
Each bucket should have a last changed attribute to indicate the activity of this bucket. This field will be updated in these situations:
1. The node in the bucket is pinged and has a response
2. A node is added to the bucket
3. The node in the bucket has been replaced
If the bucket does not update this field within 15 minutes, an id within the bucket range will be randomly selected to perform the find_node operation.
Messages are transmitted through the KRPC Protocol in the dht network.
1.ping
ping query is mainly used for heartbeat check
2.find_node
Find For a node, the other party will query the nearest N nodes from its own routing table and return them, usually 8
3.get_peers
Find the owner of the infohash based on the infohash If peers are found, return nodes
#4.announce_peer
tells other peers that they also have infohash.
Note that the above four will refresh the routing table
There are no nodes in the routing table at the beginning, so you need to start from the super node ( For example dht.transmissionbt.com
, etc.) find and add nodes through find_node requests, and the returned nodes are used for find_node.
The routing table I implemented myself is slightly different from the one described above.
DHT network uses udp for data transmission, so I only need to open an upd port and continuously send find_node requests to establish a routing table, and then obtain the infohash of the seed through get_peers and announce_peer.
When we join the dht network, we can only get the infohash of the seed file through the four methods introduced above, so we also need to download the seed through infohash. For details, please refer to bep_009http:/ /www.bittorrent.org/beps/bep_0009.html
We mainly use bep_009 to obtain the name field of the seed. After obtaining the file name field, we can create an index based on the name and infohash to provide search. (Here we mainly build magnet links. With magnet links, you can go to Thunder, Baidu Netdisk, etc. to download resources)
Most magnet link formats: magnet:?xt=urn: btih:infohash
The method introduced above is to build a magnet link by obtaining infohash, and then download it with the help of third-party software. Of course, you can also download it yourself through BitTorrent Protocol. If you are interested, you can study it yourself.
Okay, the above just briefly introduces some implementation steps. Many details and specific implementations are not mentioned. In my own words, I referred to some github dht projects and then implemented it myself. The specific address is as follows :https://github.com/mistletoe9527/dht-spider
The above is the detailed content of How to use java to implement a p2p seed search function. For more information, please follow other related articles on the PHP Chinese website!