Home  >  Article  >  Java  >  Hash--Introduction to common algorithms

Hash--Introduction to common algorithms

零下一度
零下一度Original
2017-06-29 11:29:192835browse

Hash (Hash), also known as hashing, is a very common algorithm. Hash is mainly used in JavaHashMap data structure. The hash algorithm consists of two parts: hash function and hash table. We can know the characteristics of our array. We can quickly locate elements through subscripts (O(1)). Similarly, in the hash table, we can use the key (hash value) To quickly locate a certain value, the hash value is calculated through the hash function (hash(key) = address). The element [address] = value can be located through the hash value. The principle is similar to that of an array.

The best hash function is of course one that can calculate a unique hash value for each key value, but there may often be different keyvalue hash value, which causes conflicts. There are two aspects to judge whether a hash function is well designed:

 1. Few conflicts .

 2. Calculation is fast.

The following are several commonly used hash functions. They all have certain mathematical principles behind them and have undergone a lot of practice. Their mathematical principles will not be explored here.

BKDRHash function (h = 31 * h + c)

This hash function is applied to string hash value calculation in Java.

//String#hashCodepublic int hashCode() {int h = hash;if (h == 0 && value.length > 0) {char val[] = value;for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];    //BKDR 哈希函数,常数可以是131、1313、13131……        }
        hash = h;
    }return h;
}

##DJB2 Hash function (h = h << 5 + h + c = h = 33 * h + c

ElasticSearch

uses DJB2The hash function hashes the specified key of the document to be indexed.

##SDBMHash function (h = h << 6 + h << 16 - h + c = 65599 * h + c) is applied in

SDBM

(a simple database engine). The above is just a list of three hash functions. Let’s do an experiment to see how they conflict.

 

Java

 1 package com.algorithm.hash; 2  3 import java.util.HashMap; 4 import java.util.UUID; 5  6 /** 7  * 三种哈希函数冲突数比较 8  * Created by yulinfeng on 6/27/17. 9  */10 public class HashFunc {11 12     public static void main(String[] args) {13         int length = 1000000;   //100万字符串14         //利用HashMap来计算冲突数,HashMap的键值不能重复所以length - map.size()即为冲突数15         HashMap<String, String> bkdrMap = new HashMap<String, String>();16         HashMap<String, String> djb2Map = new HashMap<String, String>();17         HashMap<String, String> sdbmMap = new HashMap<String, String>();18         getStr(length, bkdrMap, djb2Map, sdbmMap);19         System.out.println("BKDR哈希函数100万字符串的冲突数:" + (length - bkdrMap.size()));20         System.out.println("DJB2哈希函数100万字符串的冲突数:" + (length - djb2Map.size()));21         System.out.println("SDBM哈希函数100万字符串的冲突数:" + (length - sdbmMap.size()));22     }23 24     /**25      * 生成字符串,并计算冲突数26      * @param length27      * @param bkdrMap28      * @param djb2Map29      * @param sdbmMap30      */31     private static void getStr(int length, HashMap<String, String> bkdrMap, HashMap<String, String> djb2Map, HashMap<String, String> sdbmMap) {32         for (int i = 0; i < length; i++) {33             System.out.println(i);34             String str = UUID.randomUUID().toString();35             bkdrMap.put(String.valueOf(str.hashCode()), str);   //Java的String.hashCode就是BKDR哈希函数, h = 31 * h + c36             djb2Map.put(djb2(str), str);    //DJB2哈希函数37             sdbmMap.put(sdbm(str), str);    //SDBM哈希函数38         }39     }40 41     /**42      * djb2哈希函数43      * @param str44      * @return45      */46     private static String djb2(String str) {47         int hash = 0;48         for (int i = 0; i != str.length(); i++) {49             hash = hash * 33 + str.charAt(i);    //h = h << 5 + h + c = h = 33 * h + c50         }51         return String.valueOf(hash);52     }53 54     /**55      * sdbm哈希函数56      * @param str57      * @return58      */59     private static String sdbm(String str) {60         int hash = 0;61         for (int i = 0; i != str.length(); i++) {62             hash = 65599 * hash + str.charAt(i);    //h = h << 6 + h << 16 - h + c = 65599 * h + c63         }64         return String.valueOf(hash);65     }66 }
The following are
10

万、100 Ten thousand, 200 Conflict number situation:

After repeated trials, the number of collisions of the three hash functions is actually about the same.

 

Python3

 1 import uuid 2  3 def hash_test(length, bkdrDic, djb2Dic, sdbmDic): 4     for i in range(length): 5         string = str(uuid.uuid1())  #基于时间戳 6         bkdrDic[bkdr(string)] = string 7         djb2Dic[djb2(string)] = string 8         sdbmDic[sdbm(string)] = string 9 10 #BDKR哈希函数11 def bkdr(string):12     hash = 013     for i in range(len(string)):14         hash = 31 * hash + ord(string[i])   # h = 31 * h + c15     return hash16 17 #DJB2哈希函数18 def djb2(string):19     hash = 020     for i in range(len(string)):21         hash = 33 * hash + ord(string[i])   # h = h << 5 + h + c22     return hash23 24 #SDBM哈希函数25 def sdbm(string):26     hash = 027     for i in range(len(string)):28         hash = 65599 * hash + ord(string[i])    # h = h << 6 + h << 16 - h + c29     return hash30 31 length = 10032 bkdrDic = dict() #bkdrDic = {}33 djb2Dic = dict()34 sdbmDic = dict()35 hash_test(length, bkdrDic, djb2Dic, sdbmDic)36 print("BKDR哈希函数100万字符串的冲突数:%d"%(length - len(bkdrDic)))37 print("DJB2哈希函数100万字符串的冲突数:%d"%(length - len(djb2Dic)))38 print("SDBM哈希函数100万字符串的冲突数:%d"%(length - len(sdbmDic)))

A hash table is a data structure that needs to be used with a hash function to create an index for quick search ——"Algorithm Notes". Generally speaking, it is a fixed-length storage space. For example, HashMapThe default hash table is a fixed-length hash table of 16EntryArray. After having a fixed-length storage space, the remaining question is how to put the value into which position. Usually if the hash value is m, the length is n, then this value is placed at the m mod n position.

The picture above shows the hash and hash table, as well as the solution to the conflict (zipper method). There are many solutions to conflicts, including hashing again until there is no conflict, and also using the zipper method to use linked lists to concatenate elements at the same position as shown in the figure above.

Imagine that the length of the hash table in the above example is 10, resulting in 1 conflicts. If the hash The table length is 20, then there will be no conflict. The search is faster but will waste more space. If the hash table length is 2, the conflict search will be inverted 3 times slower, but this will save a lot of space. So the length selection of the hash table is crucial, but it is also an important problem.

Supplement:

Hashes are used in many aspects. For example, different values ​​have different hash values, but The hash algorithm can also be designed so that similar or identical values ​​have similar or identical hash values. That is to say, if two objects are completely different, then their hash values ​​are completely different; if two objects are exactly the same, then their hash values ​​are also exactly the same; the more similar the two objects are, then their hash values ​​are also completely different. The more similar they are. This is actually a similarity problem, which means that this idea can be generalized and applied to similarity calculations (such as the Jaccard distance problem), and ultimately applied to accurate advertising placement, product recommendation, etc.

In addition, consistent hashing can also be applied in load balancing. How to ensure that each server can evenly distribute the load pressure, a good hash algorithm can also do it.

The above is the detailed content of Hash--Introduction to common algorithms. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Previous article:Detailed explanation of 4 different forms of data source configurationNext article:Detailed explanation of 4 different forms of data source configuration

Related articles

See more