Home  >  Q&A  >  body text

java - 算法问题,2个数组,数组a保存1000W条手机号,数组b保存5000W条,找出两个数组相同的手机号,执行时间需要 <2s

有人说用归并算法,但是执行下来时间远远不止2s。
在下实在想不出还有什么好办法,希望各位能给个提示或者解法,谢谢。

下面是我的测试代码:

  public class TestA {

    public static void main(String[] args) {
        long[] a = new long[50000000];
        long num = 13000000000l;
        for (int i = 0; i < a.length; i++) {
            a[i] = (num + i);
        }
        long[] b = new long[10000000];
        long num2 = 14000000000l;
        for (int i = 0; i < b.length - 2; i++) {
            b[i] = (num2 + i);
        }
        b[9999999] =  13000000000l;
        b[9999998] =  13000000001l;

        long[] c = new long[a.length+b.length];
        long start = System.currentTimeMillis();
        for (int i = 0; i < a.length; i++) {
            c[i] = a[i];
        }
        for (int i = 0; i < b.length; i++) {
            c[i + a.length] = b[i];
        }
        System.out.println("start");
        sort(c, 0, c.length-1);
        long end = System.nanoTime();
        System.out.println(System.currentTimeMillis() - start);
        for (int i = 0; i < c.length; i++) {
            
            System.out.println(c[i]);
        }

    }

    public static void sort(long[] data, int left, int right) {
        if (left < right) {
            // 找出中间索引
            int center = (left + right) / 2;
            // 对左边数组进行递归
            sort(data, left, center);
            // 对右边数组进行递归
            sort(data, center + 1, right);
            // 合并
            merge(data, left, center, right);
        }
    }

    public static void merge(long[] data, int left, int center, int right) {
        long[] tmpArr = new long[data.length];
        int mid = center + 1;
        // third记录中间数组的索引
        int third = left;
        int tmp = left;
        while (left <= center && mid <= right) {

            // 从两个数组中取出最小的放入中间数组
            if (data[left] <= data[mid]) {
                if(data[left] == data[mid]){
                    System.out.println(data[left]);
                }
                tmpArr[third++] = data[left++];
             
            } else {
                tmpArr[third++] = data[mid++];
            }
        }
        // 剩余部分依次放入中间数组
        while (mid <= right) {
            tmpArr[third++] = data[mid++];
        }
        while (left <= center) {
            tmpArr[third++] = data[left++];
        }
        // 将中间数组中的内容复制回原数组
        while (tmp <= right) {
            data[tmp] = tmpArr[tmp++];
        }
    }

}
怪我咯怪我咯2763 days ago14364

reply all(10)I'll reply

  • 大家讲道理

    大家讲道理2017-04-18 10:53:41

    To provide an idea, what we are looking for is the same phone number in the two arrays. Then we build a dictionary tree from the phone numbers in the first array.
    In the second time, just search it directly in the dictionary tree. The overall complexity is O(N * 11) = O(N). 11 digits per phone number. The total complexity is only O(N).
    Reference Zhihu: https://zhuanlan.zhihu.com/p/...

    public class TrieTree {
        private class Node {
            char ch;
            TreeMap<Character, Node> node;
            int count;
    
            public Node(char ch) {
                ch = this.ch;
                node = new TreeMap<>();
                count = 0;
            }
        }
    
        public class StringCount implements Comparable{
            public String str;
            public int count;
    
            public StringCount(String str, int count) {
                this.str = str;
                this.count = count;
            }
    
            @Override
            public int compareTo(Object o) {
                StringCount t = (StringCount) o;
                if(this.count > t.count){
                    return -1;
                }
                if(this.count < t.count){
                    return 1;
                }
                return 0;
            }
        }
    
        private int indexStringCount;
        private StringCount[] stringCounts;
        private Node root;
        private int size;
        private static boolean DEBUG = true;
        
        public TrieTree() {
            root = new Node('$');
            size = 0;
        }
    
        public int insert(String str) {
            int res = 0;
            Node temp = root;
            for (int i = 0; i < str.length(); i++) {
                char ch = str.charAt(i);
                if (temp.node.get(ch) == null) temp.node.put(ch, new Node(ch));
                temp = temp.node.get(ch);
            }
            if(temp.count == 0) size ++;
            res = ++temp.count;
            return res;
        }
        
        public StringCount[] getStringCount(){
            indexStringCount = 0;
            stringCounts = new StringCount[this.size];
            ergodic(root, "");
            Arrays.sort(stringCounts);
            {
                for(StringCount s : stringCounts){
                    System.out.println(s.str + " " + s.count);
                }
            }
            return stringCounts;
        }
        private void ergodic(Node foo, String str){
            if(foo.count != 0){
                stringCounts[indexStringCount ++] = new StringCount(str, foo.count);
            }
            for(Character ch:foo.node.keySet()){
                ergodic(foo.node.get(ch), str + ch);
            }
        }
        
        private void show(Node foo, String str) {
            if (foo.count != 0) {
                System.out.println(str + " : " + foo.count);
            }
            for(Character ch:foo.node.keySet()){
                show(foo.node.get(ch), str + ch);
            }
        }
        public void showALL() {
            show(root, "");
        }
        public int size(){
            return size;
        }
        
        public static void main(String[] args) {
            TrieTree tree = new TrieTree();
            String[] strs = { "a", "aa", "a", "b", "aab", "bba" };
            for (int i = 0; i < strs.length; i++) {
                tree.insert(strs[i]);
            }
            tree.showALL();
            System.out.println(tree.size);
            tree.getStringCount();
        }
    }

    reply
    0
  • 巴扎黑

    巴扎黑2017-04-18 10:53:41

    Just found a method, the execution time is about 2s:

    public class TestB {
        static long count;
    
        public static void main(String[] args) {
            long[] a = new long[50000000];
            long num = 13000000000l;
            for (int i = 0; i < a.length; i++) {
                a[i] = num + i;
            }
    
            long[] b = new long[10000000];
            long num2 = 14000000000l;
            for (int i = 0; i < b.length - 3; i++) {
                b[i] = num2 + i;
            }
            b[9999999] = 13000000000l;
            b[9999998] = 13000000002l;
            b[9999997] = 13000000002l;
    
            long start = System.currentTimeMillis();
      
            Arrays.sort(a);
            int flag = -1;
            for (int i = 0; i < b.length; i++) {
                count = b[i];
                flag = Arrays.binarySearch(a, count);
                if (flag <= 50000000 && flag >= 0) {
                    System.out.println(count + "   " +flag);
                }
            }
            System.out.println(System.currentTimeMillis() - start);
        }
    }
    

    The main idea of ​​this method is to sort first, then use the Arrays.binarySearch() method to perform a binary query and return the matching array subscript.


    I modified the method of obtaining the data source and found that if a random data source is used, the time consumed is about 8s. The error time is mainly consumed in the sort() sorting method. The rules of the data source still affect the time complexity of the sorting algorithm.
    The code is modified as follows:

    public class TestB {

    static long count;
    
    public static void main(String[] args) {
        System.out.println(random());
        long[] a = new long[50000000];
        for (int i = 0; i < a.length; i++) {
            a[i] = random();
        }
    
        long[] b = new long[10000000];
        for (int i = 0; i < b.length; i++) {
            b[i] = random();
        }
        long start = System.currentTimeMillis();
        Arrays.sort(b);
        Arrays.sort(a);
        int flag = -1;
        int cc =0;
        for (int i = 0; i < b.length; i++) {
            count = b[i];
            flag = Arrays.binarySearch(a, count);
            if (flag <= 50000000 && flag >= 0) {
                System.out.println(count + "   " + flag);
                cc++;
            }
        }
        System.out.println("相同数据的数量:"+cc);
        System.out.println(System.currentTimeMillis() - start);
    }
    
    public static long random() {
        return Math.round((new Random()).nextDouble()*10000000000L+10000000000L);
    }

    }

    reply
    0
  • 迷茫

    迷茫2017-04-18 10:53:41

    考虑bitmap, 参考https://github.com/RoaringBit...
    RoaringBitmap aBM = new RoaringBitmap()
    for (int i = 0; i < a.length; i++) {

    aBM.add(a[i])

    }
    ...
    RoaringBitmap interactionBM = RoaringBitmap.and(aBM, bBM)
    for (int item: interactionBM) {

    System.out.println(item)

    }

    reply
    0
  • 迷茫

    迷茫2017-04-18 10:53:41

            long start = System.currentTimeMillis();
            HashSet<Long> alongs = new HashSet<>();
    
            for (long l : b) {
                alongs.add(l);
            }
    
            ArrayList<Long> sames = new ArrayList<>();
            for (long l : a) {
                if (alongs.contains(l)) {
                    sames.add(l);
                }
            }
    
            long end = System.currentTimeMillis();
            System.out.println(end - start);

    Using the above code, on my machine, it is 8s

    reply
    0
  • 阿神

    阿神2017-04-18 10:53:41

    http://tieba.baidu.com/p/3866...

    reply
    0
  • 巴扎黑

    巴扎黑2017-04-18 10:53:41

    C#, run locally, release, 611ms

    long[] a = new long[50000000];
    long num = 13000000000L;
    for (int i = 0; i < a.Length; i++)
    {
        a[i] = (num + i);
    }
    
    long[] b = new long[10000000];
    long num2 = 14000000000L;
    for (int i = 0; i < b.Length - 2; i++)
    {
        b[i] = (num2 + i);
    }
    b[9999999] = 13000000000L;
    b[9999998] = 13000000001L;
    var hashSetB = new HashSet<long>(b);
    
    var matches = new List<long>();
    var timer = new System.Diagnostics.Stopwatch();
    Console.WriteLine("Starts...");
    timer.Start();
    for (var i = 0; i < a.Length; i++)
    {
        if (hashSetB.Contains(a[i]))
        {
            matches.Add(a[i]);
        }
    }
    timer.Stop();
    Console.WriteLine(timer.ElapsedMilliseconds);
    Console.WriteLine("Found match: " + string.Join(", ", matches));
    Console.ReadLine();

    reply
    0
  • ringa_lee

    ringa_lee2017-04-18 10:53:41

    redis SINTER (Returns all members of a set that is the intersection of all given sets.)

    reply
    0
  • 大家讲道理

    大家讲道理2017-04-18 10:53:41

    If this operation can only be performed in an array, there is no trick. At least the smaller array must be traversed, and even sorting is inevitable. And if the array contents can be exported to other data structures, it seems to violate the original intention of the question. Did the questioner want to test array operations?

    reply
    0
  • ringa_lee

    ringa_lee2017-04-18 10:53:41

    Let’s take a simpler method, which only takes about 200ms on MBP. Ordinary Pentium G2020 only takes 250ms
    Um, this algorithm is completely impossible.

    reply
    0
  • 高洛峰

    高洛峰2017-04-18 10:53:41

    In fact, algorithm is the key to this question.
    I suggest you read the first chapter of Programming Pearls. Will provide good ideas.
    First of all, the question must be refined, that is, does the mobile phone number have to be a Chinese mobile phone number only? Otherwise the number would be much higher.
    Let me briefly explain how phone numbers are stored in Zhuji.
    It only uses a binary number to store all mobile phone numbers. A binary digit can be very long. The length is equal to the largest possible phone number. For example, 13999999999, in fact, 13 can be omitted. Our number is a binary number with 999999999 digits.
    On each digit, if there is this phone number, it is marked as 1, if not, it is marked as 0.
    For the sake of simplicity, I will assume that the range of the mobile phone number is 0-9. We first prepare a 10-digit binary Number 0000000000.
    Assume that the first array has 3 phone numbers, namely 1, 5, and 7. Then the binary number storing this array is 0010100010. The 1, 5, and 7 bits of this number are 1, and the other bits are 0. .
    Suppose the second array has 6 phone numbers, namely 1, 2, 3, 4, 7, and 8. Then the binary number storing this array is 0110011110, the 1, 2, 3, 4, 7, and 8 digits of this number. The upper bit is 1 and the other bits are 0.
    So how to find out the same phone number in these two arrays?
    You only need to do the "bitwise AND" (AND) operation in bit operations. If both of the same bits are 1, it is still 1. As long as one of them is 0, it is 0.
    0010100010 & 0110011110 = 0010000010

    I found out immediately that it is a binary number with 1 in the 1st and 7th digits.

    reply
    0
  • Cancelreply