1亿个数引发的思考(二)之 开拓视野 BitMap Index 和 布隆过滤器

发布时间 2023-06-02 08:17:45作者: zno2

java实现:

java.util.BitSet

谷歌实现

<dependency>
    <groupId>com.googlecode.javaewah</groupId>
    <artifactId>JavaEWAH</artifactId>
    <version>1.2.3</version>
</dependency>

资料:

https://encyclopedia.thefreedictionary.com/Bitmap+index

理解:

1bit可以表示两种状态,0不存在,1存在
下标可以表示某个值(通过hash放到篮子里)
如何表示N个数的状态呢,byte=1字节=8位,long=8字节=64位,可以通过m个连续的long即long[m] ,m = N/64 + 1
把N个数add到bit set 中,即找到下标,bit位值为1

public void set(int bitIndex) {...}
public boolean get(int bitIndex) {...}
所以需求变成了小内存(压缩到BitSet后可以存放这些数据),流读取数据,先get,后set,如果get返回true则这个数重复,可以记录重复次数


Hash算法是散列算法,有很多种,有最出名的md5,sha256,还有很多其他的

 

验证1:

import java.util.ArrayList;
import java.util.BitSet;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import cn.hutool.core.util.RandomUtil;

public class BitSetTest {

    public static void main(String[] args) {
        BitSet bitSet = new BitSet(100);
        int[] randomInts = RandomUtil.randomInts(100);
        List<Integer> list = new ArrayList<>();
        for (int r : randomInts) {
            list.add(r);
        }
        list.add(RandomUtil.randomInt(0, 100));
        list.add(2);
        list.add(2);

        Map<Integer, Integer> occur = new HashMap<>();
        // 内存不够用时,这里要换成流形式加载 
        for (int i : list) {
            boolean b = bitSet.get(i);
            bitSet.set(i);
            if (b) {
                Integer integer = occur.get(i);
                if (integer == null) {
                    occur.put(i, 2);
                } else {
                    occur.put(i, integer + 1);
                }
            }
        }
        // 最终输出哪些数字出现了多次 
        System.out.println(occur);
    }
}

 

验证2:

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.lang.management.ManagementFactory;
import java.util.BitSet;
import java.util.HashMap;
import java.util.Map;

import cn.hutool.core.util.RandomUtil;

public class BitSetTest2 {

    public static void main(String[] args) throws IOException {
        // 1) 从0开始生成1亿个数字,随机两个位置插入两个数字 666、888
        generate1yi();
        
        // 2) 通过BitSet 查找重复出现的数字 
        find();
    }

    private static void generate1yi() throws IOException {
        
        int max = 10000 * 10000;
        int i1 = RandomUtil.randomInt(0, max);
        int i2 = RandomUtil.randomInt(0, max);
        
        int[] randomInts = RandomUtil.randomInts(max);
        File file = new File("D://out//1yi.txt");
        FileWriter fileWriter = new FileWriter(file);
        for (int i : randomInts) {
            fileWriter.append(i + "\r\n");
            if (i == i1) {
                fileWriter.append(666 + "\r\n");
            }else if(i == i2) {
                fileWriter.append(888 + "\r\n");
            }
        }
        fileWriter.close();
    }

    private static void find() throws NumberFormatException, IOException {
        usage();
        BitSet bitSet = new BitSet(10000 * 10000 + 1);
        usage();
        Map<Integer, Integer> occur = new HashMap<>();

        BufferedReader br = new BufferedReader(new FileReader(new File("D://out/1yi.txt")));
        String line = null;
        while ((line = br.readLine()) != null) {
            int i = Integer.valueOf(line);
            boolean b = bitSet.get(i);
            bitSet.set(i);
            if (b) {
                Integer integer = occur.get(i);
                if (integer == null) {
                    occur.put(i, 2);
                } else {
                    occur.put(i, integer + 1);
                }
            }
        }
        br.close();
        usage();
        System.out.println(occur);
    }
    
    public static String toStr(InputStream in) throws IOException {
        BufferedInputStream b = new BufferedInputStream(in);
        byte[] buffer = new byte[1024];
        int l = 0;
        StringBuffer sb = new StringBuffer();
        while ((l = b.read(buffer)) != -1) {
            sb.append(new String(buffer, 0, l));
        }
        return sb.toString();
    }
    
    private static void usage() {
        System.out.println(ManagementFactory.getMemoryMXBean().getHeapMemoryUsage());
    }
}

 

// 执行1和2
init = 536870912(524288K) used = 738080064(720781K) committed = 2633498624(2571776K) max = 7615283200(7436800K)
init = 536870912(524288K) used = 750580088(732988K) committed = 2633498624(2571776K) max = 7615283200(7436800K)
init = 536870912(524288K) used = 1033007456(1008796K) committed = 2310012928(2255872K) max = 7615283200(7436800K)
{888=2, 666=2}

 

// 仅执行2 
init = 536870912(524288K) used = 2694856(2631K) committed = 514850816(502784K) max = 7615283200(7436800K)
init = 536870912(524288K) used = 15194880(14838K) committed = 514850816(502784K) max = 7615283200(7436800K)
init = 536870912(524288K) used = 265467672(259245K) committed = 1259864064(1230336K) max = 7615283200(7436800K)
{888=2, 666=2}

 

结论:执行1 时一个1亿长度的int数组占用内存大约7百兆,仅执行2时初始一个1亿长度的BitSet 内存占用大约12兆;该方案时可以通过使用较少内存来判读某些数字重复出现的

 

思考:BitSet 压缩问题,比如两个数1,11111111111