集合去重，高效算法

69 阅读 0 评论 46 点赞

我是靠谱客的博主鲤鱼大山，最近开发中收集的这篇文章主要介绍集合去重，高效算法，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

我们最常用的两个集合去重的方法是removeAll,但是当两个集合达到上万之后就已经很慢了，上百万之后，处理速度更是令人难以忍受，处理时间超过10分钟以上，测试代码如下：

public class Test {
    public static void main(String[] args) {
        List<Integer> list1 = new ArrayList<>();
        List<Integer> list2 = new ArrayList<>();
        for(int i = 0; i<=1000000; i++) {
            list1.add(i);
        }
        for(int i = 500000; i<=1500000; i++) {
            list2.add(i);
        }
        System.out.println("start: " + new Date());
        list2.removeAll(list1);
        System.out.println("end: " + new Date());
    }
}

这个时候我们可以使用CollectionUtils的intersection方法（取交集）和disjunction方法（取差集）,disjunction 就是我们要的去重之后的集合，代码如下：

public class Test {
    public static void main(String[] args) {
        List<Integer> list1 = new ArrayList<>();
        List<Integer> list2 = new ArrayList<>();
        for(int i = 0; i<=1000000; i++) {
            list1.add(i);
        }
        for(int i = 500000; i<=1500000; i++) {
            list2.add(i);
        }
        System.out.println("start: " + new Date());
        Collection intersection = CollectionUtils.intersection(list1, list2);
        Collection disjunction = CollectionUtils.disjunction(list2, intersection);
        System.out.println("end: " + new Date());
    }
}

测试结果：
start: Thu Dec 06 20:36:11 CST 2018
end: Thu Dec 06 20:36:12 CST 2018
单个集合数据量达到100万，处理时间为1s,结果显而易见。

源码如下：

public static Collection intersection(Collection a, Collection b) {
        ArrayList list = new ArrayList();
        Map mapa = getCardinalityMap(a);
        Map mapb = getCardinalityMap(b);
        Set elts = new HashSet(a);
        elts.addAll(b);
        Iterator it = elts.iterator();

        while(it.hasNext()) {
            Object obj = it.next();
            int i = 0;

            for(int m = Math.min(getFreq(obj, mapa), getFreq(obj, mapb)); i < m; ++i) {
                list.add(obj);
            }
        }

        return list;
    }


    public static Collection disjunction(Collection a, Collection b) {
        ArrayList list = new ArrayList();
        Map mapa = getCardinalityMap(a);
        Map mapb = getCardinalityMap(b);
        Set elts = new HashSet(a);
        elts.addAll(b);
        Iterator it = elts.iterator();

        while(it.hasNext()) {
            Object obj = it.next();
            int i = 0;

            for(int m = Math.max(getFreq(obj, mapa), getFreq(obj, mapb)) - Math.min(getFreq(obj, mapa), getFreq(obj, mapb)); i < m; ++i) {
                list.add(obj);
            }
        }

        return list;
    }

    public static Map getCardinalityMap(Collection coll) {
        Map count = new HashMap();
        Iterator it = coll.iterator();

        while(it.hasNext()) {
            Object obj = it.next();
            Integer c = (Integer)((Integer)count.get(obj));
            if (c == null) {
                count.put(obj, INTEGER_ONE);
            } else {
                count.put(obj, new Integer(c + 1));
            }
        }

        return count;
    }


从源码不难发现，集合遍历次数仅为集合a和集合b的并集的数量，通过Map的使用极大的提高了处理速度。