bitmap技術解析：redis與roaringBitmap

bitmap的表象意義是，使用一個01標識位來表示是否的狀態，可以達到節省空間和高效判定的效果。在我們的實際工作中，也有著許多的應用場景，相信瞭解bitmap定會給你帶來一些額外的收穫。

1. bitmap使用場景說明

比如，我想記錄某使用者某天是否登入過，我們有很多做法。簡單的，只要使用者登入，就在db中記錄一條資料，要判斷使用者某天是否登入過，只需查詢對應日期是否有記錄即可。如果想統計當日登入的人數，count下記錄就可以了（可能需要去重））。這樣不是不可以，只是如果我們想優化下效能怎麼辦？這時我們就可以使用bitmap了，以日期為單位建立獨立bitmap，一個使用者id一個標識，要想判斷使用者是否登入過，直接取對應位標識即可。

再比如，我們有一批白名單使用者，在白名單裡則放行，否則拒絕存取。同樣，我們可以用一行行的記錄用db來儲存處理，但這可能很佔空間，或者效能不怎麼樣。同樣，使用bitmap可以快速處理這種場景。

再比如，用於快速去重一些處理，第一次處理時，將標識位改為1，後續將進行冪等，用bitmap可以快速過濾。

2. bitmap的通俗理解

bitmap簡單理解來說就是，bit的對映，將一個個的key對映到bit位上去，這樣就可以快速通過key直接定位到標識上去了。另外，因都是一個個的bit，所以進行count操作很方便。對於兩個bitmap的and/or位運算也是很方便和快速的。

3. redis的bitmap實現

理解了面上的bitmap的意思，要怎麼做也就大概會有個思路了。

談到redis，大家的對映問題：高效能，高並行，快取解決方案。好像有了redis，就有了底氣和別人一比高下了似的。那麼，在bitmap方面，它是否也神奇之處呢？

實際上，redis的bitmap實現比較簡單，和字面上的意思差不多，它是基於string實現的。

簡單來說就是，string底層的儲存也是二進位制的，也就是說string天生就看起來和bitmap的儲存類似，比如'ab'的底層儲存就是\x61\x62, 拆解成二進位制就是 0110 0001 0110 0010。如果我我們直接將其代表bitmap運算元，那麼總共就有6個資料有值，分別是2,3,8,10,11,15位有值。這樣說bitmap應該很清晰了。

接下來我們來討論下一個空間問題。我們知道一個16位元的二進位制可以表示最大 65536，32位元最大表示 4294967296，好像一個比較小的位就可以表示很大的資料了。但是這和我們說的bitmap還不一樣，在這裡，一個16位元的二進位制數只能表示16個bitmap值，32位元數只能表示32個值。從這點來說，bitmap好像很浪費空間呢。我們知道，現在的大多數機器都是64位元的。所以，如果我以這種結構儲存的話，應該只能存64個標識了。

那麼自然的，我們必須要使用一個合適的結構來儲存這些bit，redis中使用string結構來儲存bitmap，也就是說將string轉換成二進位制後，用其每一位來表示一個標識。這樣的話，能夠存放多少個標識就擴充套件到了string最大限制上去了。redis限制string最大是512M，也就是2^19KB=2^29B=2^32b，即最大2^32位元。

redis的bitmap操作命令，簡單範例如下：（咱們不是檔案，如需手冊，請參考官網）

setbit key offset 1|0
getbit key
bitcount key

下面，我們簡單看下redis的setbit的實現原始碼，具體看看其處理邏輯。

// bitops.c
// 操作命令: setbit key offset 0|1
/* SETBIT key offset bitvalue */
void setbitCommand(client *c) {
    robj *o;
    char *err = "bit is not an integer or out of range";
    uint64_t bitoffset;
    ssize_t byte, bit;
    int byteval, bitval;
    long on;
  // 解析 offset 值
    if (getBitOffsetFromArgument(c,c->argv[2],&bitoffset,0,0) != C_OK)
        return;
  // 解析 0|1 值
    if (getLongFromObjectOrReply(c,c->argv[3],&on,err) != C_OK)
        return;
  // 只接受0|1的輸入，其他一律報錯
    /* Bits can only be set or cleared... */
    if (on & ~1) {
        addReplyError(c,err);
        return;
    }
  // 獲取key對應的string物件，方便後續操作
    int dirty;
    if ((o = lookupStringForBitCommand(c,bitoffset,&dirty)) == NULL) return;

  // 計算偏移量: 1byte=8bit, 所以真正的位所在就等於 byte大定位 + 小移位
  // 從高到低計數, 即類似於 big-endian
    /* Get current values */
    byte = bitoffset >> 3;
    byteval = ((uint8_t*)o->ptr)[byte];
    bit = 7 - (bitoffset & 0x7);
    bitval = byteval & (1 << bit);

    /* Either it is newly created, changed length, or the bit changes before and after.
     * Note that the bitval here is actually a decimal number.
     * So we need to use `!!` to convert it to 0 or 1 for comparison. */
    if (dirty || (!!bitval != on)) {
    // 先取反保留當前值, 再重新設定on 進去
        /* Update byte with new bit value. */
        byteval &= ~(1 << bit);
        byteval |= ((on & 0x1) << bit);
        ((uint8_t*)o->ptr)[byte] = byteval;
    // 叢集擴散
        signalModifiedKey(c,c->db,c->argv[1]);
        notifyKeyspaceEvent(NOTIFY_STRING,"setbit",c->argv[1],c->db->id);
        server.dirty++;
    }
  // 返回舊的值給使用者端
    /* Return original value. */
    addReply(c, bitval ? shared.cone : shared.czero);
}
// 查詢key對應的 string 物件
/* This is a helper function for commands implementations that need to write
 * bits to a string object. The command creates or pad with zeroes the string
 * so that the 'maxbit' bit can be addressed. The object is finally
 * returned. Otherwise if the key holds a wrong type NULL is returned and
 * an error is sent to the client. */
robj *lookupStringForBitCommand(client *c, uint64_t maxbit, int *dirty) {
    size_t byte = maxbit >> 3;
    robj *o = lookupKeyWrite(c->db,c->argv[1]);
    if (checkType(c,o,OBJ_STRING)) return NULL;
    if (dirty) *dirty = 0;

    if (o == NULL) {
        o = createObject(OBJ_STRING,sdsnewlen(NULL, byte+1));
        dbAdd(c->db,c->argv[1],o);
        if (dirty) *dirty = 1;
    } else {
        o = dbUnshareStringValue(c->db,c->argv[1],o);
        size_t oldlen = sdslen(o->ptr);
        o->ptr = sdsgrowzero(o->ptr,byte+1);
        if (dirty && oldlen != sdslen(o->ptr)) *dirty = 1;
    }
    return o;
}

　　很簡單吧，不過應對一些場景還是綽綽有餘了，選對場景很重要。

　　 redis的bitmap實現簡單，易於理解，但也有比較大的弊端。這種基於string的實現方式簡單是簡單，但存在以下幾個問題：

1. 會存在較大間隙值，比如一開始就儲存一個較大的偏移標識進去，比如8位元的偏移，就可能讓記憶體佔用上M級別（然而你還什麼都沒幹）；

2.儲存範圍受限，最大隻能存int型的數位偏移，如果以userid為偏移，在使用者量小且以自增id生成使用者id也許沒問題，但其他情況就不好說了；

3.隨著資料量越來越大，單次設定標識的耗時就會越來越長（大key問題），且一不小心使用get命令進行讀取資料時，redis就尷尬了；

4. roaringbitmap實現

上篇講到redis的實現，簡單易懂，但是會存在一個極大空間浪費的問題，而且受限於陣列大小，儲存空間有限。那有沒有什麼辦法，可以壓縮儲存空間？

roaringbitmap使用多級分段儲存方式，避免了直接儲存的問題：一是空隙值問題，二是數值限制問題。它主要通過將64位元2個32位元儲存，將32位元分2個16位元儲存的方式實現。其操作主要有:add/contains/getlongcadinaty... 等常規介面。

其大致儲存結構圖如下：

具體實現如下：

// 1. 引入依賴包
        <dependency>
            <groupId>org.roaringbitmap</groupId>
            <artifactId>RoaringBitmap</artifactId>
            <version>0.9.28</version>
        </dependency>
// 2. 建立單元測試
    @Test
    public void testRoaringBitmap() {
        Roaring64NavigableMap bitmapObj = new Roaring64NavigableMap();
        bitmapObj.add(11122233366L);
        boolean exists = bitmapObj.contains(1);
        long eleSize = bitmapObj.getLongCardinality();
        System.out.println("exits:" + exists + ", eleSize:" + eleSize);
    }
// 具體實現
// Roaring64NavigableMap
  /**
   * Set all the specified values to true. This can be expected to be slightly faster than calling
   * "add" repeatedly. The provided integers values don't have to be in sorted order, but it may be
   * preferable to sort them from a performance point of view.
   *
   * @param dat set values
   */
  public void add(long... dat) {
    for (long oneLong : dat) {
      addLong(oneLong);
    }
  }
  
  /**
   * Add the value to the container (set the value to "true"), whether it already appears or not.
   *
   * Java lacks native unsigned longs but the x argument is considered to be unsigned. Within
   * bitmaps, numbers are ordered according to {@link Long#compareUnsigned}. We order the numbers
   * like 0, 1, ..., 9223372036854775807, -9223372036854775808, -9223372036854775807,..., -1.
   *
   * @param x long value
   */
  @Override
  public void addLong(long x) {
  // 高低位32位元拆分 (int) (id >> 32)
    int high = high(x);
    int low = low(x);

    // Copy the reference to prevent race-condition
    Map.Entry<Integer, BitmapDataProvider> local = latestAddedHigh;

    BitmapDataProvider bitmap;
    if (local != null && local.getKey().intValue() == high) {
      bitmap = local.getValue();
    } else {
      bitmap = highToBitmap.get(high);
      if (bitmap == null) {
      // 使用 RoaringBitmap 來儲存低層資料, 一級儲存
    // 使用 treemap 儲存整個結構，保證查詢快速
        bitmap = newRoaringBitmap();
        pushBitmapForHigh(high, bitmap);
      }
    // 使用臨時儲存當前高位範例的方式，避免經常查詢map帶來的效能消耗
    // 但實際上這要求使用者端的操作是按序操作的，這樣才能很好利用這個特性，如果只是隨機值的話，效果就大打折扣了
      latestAddedHigh = new AbstractMap.SimpleImmutableEntry<>(high, bitmap);
    }
  // 儲存低位資訊
    bitmap.add(low);
  // 擴容處理
    invalidateAboveHigh(high);
  }

  private void invalidateAboveHigh(int high) {
    // The cardinalities after this bucket may not be valid anymore
    if (compare(firstHighNotValid, high) > 0) {
      // High was valid up to now
      firstHighNotValid = high;

      int indexNotValid = binarySearch(sortedHighs, firstHighNotValid);

      final int indexAfterWhichToReset;
      if (indexNotValid >= 0) {
        indexAfterWhichToReset = indexNotValid;
      } else {
        // We have invalidate a high not already present: added a value for a brand new high
        indexAfterWhichToReset = -indexNotValid - 1;
      }

      // This way, sortedHighs remains sorted, without making a new/shorter array
      Arrays.fill(sortedHighs, indexAfterWhichToReset, sortedHighs.length, highestHigh());
    }
    allValid = false;
  }

// 低位儲存實現
// roaringbitmap
  /**
   * Add the value to the container (set the value to "true"), whether it already appears or not.
   *
   * Java lacks native unsigned integers but the x argument is considered to be unsigned.
   * Within bitmaps, numbers are ordered according to {@link Integer#compareUnsigned}.
   * We order the numbers like 0, 1, ..., 2147483647, -2147483648, -2147483647,..., -1.
   *
   * @param x integer value
   */
  @Override
  public void add(final int x) {
  // 再分高低位儲存, 即32位元拆分為2個16位元, (char) (x >>> 16)
    final char hb = Util.highbits(x);
  // 已經儲存過了，則直接更新值即可
  // highLowContainer = new RoaringArray();
    final int i = highLowContainer.getIndex(hb);
    if (i >= 0) {
    // 此處查詢成功，只是代表高位已經被某些值儲存過了，但低位仍然在變化
      highLowContainer.setContainerAtIndex(i,
          highLowContainer.getContainerAtIndex(i).add(Util.lowbits(x)));
    } else {
    // 否則新插入一個你們數, 預設以陣列形式儲存, 預設初始化大小為4
      final ArrayContainer newac = new ArrayContainer();
      highLowContainer.insertNewKeyValueAt(-i - 1, hb, newac.add(Util.lowbits(x)));
    }
  }
  
  // involves a binary search
  int getIndex(char x) {
    // before the binary search, we optimize for frequent cases
    if ((size == 0) || (keys[size - 1] == x)) {
      return size - 1;
    }
  // 使用二分查詢法查詢值的存在性，實際上內部還有其他優化
    // no luck we have to go through the list
    return this.binarySearch(0, size, x);
  }
  // insert a new key, it is assumed that it does not exist
  void insertNewKeyValueAt(int i, char key, Container value) {
    extendArray(1);
    System.arraycopy(keys, i, keys, i + 1, size - i);
    keys[i] = key;
    System.arraycopy(values, i, values, i + 1, size - i);
    values[i] = value;
    size++;
  }
  // RoaringArray
  protected Container getContainerAtIndex(int i) {
    return this.values[i];
  }
// 陣列的低位儲存實現
// ArrayContainer
  /**
   * running time is in O(n) time if insert is not in order.
   */
  @Override
  public Container add(final char x) {
  // 要插入的值大於當前容量/未大於當前容量分別處理
    if (cardinality == 0 || (cardinality > 0
            && (x) > (content[cardinality - 1]))) {
    // 大於 4096 後，擴充套件為 bitmap儲存結構
      if (cardinality >= DEFAULT_MAX_SIZE) {
        return toBitmapContainer().add(x);
      }
    // 擴容，策略分多種情況處理
      if (cardinality >= this.content.length) {
        increaseCapacity();
      }
    // 直接陣列儲存具體值即可
    // 也就是說，表面看起來這裡可能會被插入重複的值，但是實際這裡插入的是比最大值還大的值
    // 更小的值則會先查詢存在性，再進行找位插入
      content[cardinality++] = x;
    } else {
      int loc = Util.unsignedBinarySearch(content, 0, cardinality, x);
    // 小的值被插入到中間，如果找到相同的值，則本次add將被忽略
    // 也就是說，這種實現的是資料的有序插入
      if (loc < 0) {
        // Transform the ArrayContainer to a BitmapContainer
        // when cardinality = DEFAULT_MAX_SIZE
        if (cardinality >= DEFAULT_MAX_SIZE) {
          return toBitmapContainer().add(x);
        }
        if (cardinality >= this.content.length) {
          increaseCapacity();
        }
        // insertion : shift the elements > x by one position to
        // the right
        // and put x in it's appropriate place
        System.arraycopy(content, -loc - 1, content, -loc, cardinality + loc + 1);
        content[-loc - 1] = x;
        ++cardinality;
      }
    }
    return this;
  }
  // temporarily allow an illegally large size, as long as the operation creating
  // the illegal container does not return it.
  private void increaseCapacity(boolean allowIllegalSize) {
    int newCapacity = (this.content.length == 0) ? DEFAULT_INIT_SIZE
        : this.content.length < 64 ? this.content.length * 2
            : this.content.length < 1067 ? this.content.length * 3 / 2
                : this.content.length * 5 / 4;
    // never allocate more than we will ever need
    if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE && !allowIllegalSize) {
      newCapacity = ArrayContainer.DEFAULT_MAX_SIZE;
    }
    // if we are within 1/16th of the max, go to max
    if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE - ArrayContainer.DEFAULT_MAX_SIZE / 16
        && !allowIllegalSize) {
      newCapacity = ArrayContainer.DEFAULT_MAX_SIZE;
    }
    this.content = Arrays.copyOf(this.content, newCapacity);
  }

// bitmap的低位儲存實現
// BitmapContainer
  // 轉移老資料到bitmap的低位儲存中
  /**
   * Copies the data in a bitmap container.
   *
   * @return the bitmap container
   */
  @Override
  public BitmapContainer toBitmapContainer() {
    BitmapContainer bc = new BitmapContainer();
    bc.loadData(this);
    return bc;
  }

  void loadData(final ArrayContainer arrayContainer) {
    this.cardinality = arrayContainer.cardinality;
    for (int k = 0; k < arrayContainer.cardinality; ++k) {
      final char x = arrayContainer.content[k];
    // 取整64, 這個移位是真沒看懂, 反正超過31之後的
      bitmap[(x) / 64] |= (1L << x);
    }
  }
  
  @Override
  public Container add(final char i) {
    final long previous = bitmap[i >>> 6];
    long newval = previous | (1L << i);
    bitmap[i >>> 6] = newval;
    if (USE_BRANCHLESS) {
      cardinality += (int)((previous ^ newval) >>> i);
    } else if (previous != newval) {
      ++cardinality;
    }
    return this;
  }

　　整體說明，當資料為空時，結構自然為空，當只有一位資料時，就非常小，當資料量越來越大，空間也跟著變大（這很正常）。只要不是超大數量級的bitmap，空間就不會很大。但如果將每個位上都儲存上值，那麼它佔用的空間比簡單的bitmap資料結構是要大些的，因為它的每個key還儲存至少超過1bit的資料，甚至是原始資料，另外還有一些額外的treemap的資料結構的開銷。

　　另外，當總體量級上千萬的話，其實這種儲存方案，存在大物件的問題，你可能就需要jvm引數調優來解決，或者整體換方案了。

5. 還有沒有其他更好的實現？

上面兩種方案，其實都不錯，但好像都還有些問題存在。我們主要針對巨量資料量的問題，兩個方案都沒辦法解決，那麼是否就真的無解了呢？其實，辦法還是有的，比如我們做一些自定義的資料分段，比如 1-100的存在bitmap1， 101-200存在bitmap2，這樣就可以解決大容量的問題了。

只是這種方案需要我們小心處理自定義分段帶來的技術複雜性問題，也得小心對待，尤其是重要的生產場景，必須要有大量的效能測試和準確性測試，否則挖坑給自己就不好玩了。

　文章原創地址：bitmap技術解析：redis與roaringBitmap