使用 Node.js 和 Redis 探索 Bloom Filter 的魅力-＆＃＆按-PHP中文網

首頁

CMS教程

＆＃＆按

使用 Node.js 和 Redis 探索 Bloom Filter 的魅力

PHPz

Sep 01, 2023 pm 10:53 PM

使用 Node.js 和 Redis 探索 Bloom Filter 的魅力

在正確的用例中，布隆過濾器看起來就像魔法。這是一個大膽的說法，但在本教程中，我們將探討這種奇怪的資料結構、如何最好地使用它，以及一些使用 Redis 和 Node.js 的實際範例。

布隆過濾器是一種機率性、單向資料結構。在這種情況下，「過濾器」一詞可能會令人困惑；過濾器意味著它是一個活躍的事物，一個動詞，但將它視為存儲，一個名詞可能更容易。使用簡單的布隆過濾器，您可以做兩件事：

新增一個項目。
檢查之前是否未新增過某個項目。

這些是需要理解的重要限制 - 您無法刪除項目，也無法在布隆過濾器中列出項目。此外，您無法確定過去是否已將某個項目新增至篩選器。這就是布隆過濾器的機率性質發揮作用的地方——誤報是可能的，但誤報則不然。如果過濾器設定正確，誤報的可能性會非常小。

存在布隆過濾器的變體，它們添加了其他功能，例如刪除或縮放，但它們也增加了複雜性和限制。在繼續了解變體之前，首先了解簡單的布隆過濾器非常重要。本文僅介紹簡單的布隆過濾器。

有了這些限制，您可以獲得許多好處：固定大小、基於哈希的加密和快速查找。

當您設定布隆過濾器時，您需要為其指定一個大小。此大小是固定的，因此如果過濾器中有一項或十億項，它永遠不會增長超過指定的大小。當您在篩選器中新增更多項目時，誤報的可能性就會增加。如果您指定了較小的過濾器，則與使用較大的過濾器相比，誤報率會增加得更快。

布隆濾鏡建立在單向雜湊的概念上。與正確儲存密碼非常相似，布隆過濾器使用雜湊演算法來確定傳入其中的項目的唯一識別碼。哈希本質上是無法逆轉的，並且由看似隨機的字串表示。因此，如果有人獲得了布隆過濾器的存取權限，它不會直接洩露任何內容。

最後，布隆濾鏡速度很快。與其他方法相比，此操作涉及的比較次數要少得多，並且可以輕鬆儲存在記憶體中，從而防止影響效能的資料庫命中。

現在您已經了解了布隆過濾器的限制和優點，讓我們來看看可以使用它們的一些情況。

設定

我們將使用 Redis 和 Node.js 來說明 Bloom 過濾器。 Redis 是 Bloom 過濾器的儲存媒體；它速度快，位於記憶體中，並且有一些特定的命令（GETBIT、SETBIT），可以提高實作效率。我假設您的系統上安裝了 Node.js、npm 和 Redis。您的 Redis 伺服器應該在 localhost 上的預設連接埠上運行，以便我們的範例正常運作。

在本教學中，我們不會從頭開始實作篩選器；而是從頭開始實作篩選器。相反，我們將重點放在 npm 中預先建置模組的實際用途：bloom-redis。 bloom-redis 有一組非常簡潔的方法：add、contains 和 clear。

如前所述，布隆過濾器需要哈希演算法來產生項目的唯一識別碼。 bloom-redis 使用眾所周知的 MD5 演算法，儘管它可能不適合 Bloom 過濾器（有點慢，有點過大），但可以正常工作。

獨特的使用者名稱

用戶名，尤其是在 URL 中標識用戶的用戶名，需要是唯一的。如果您建立一個允許用戶更改用戶名的應用程序，那麼您可能需要一個從未使用過的用戶名，以避免用戶名混淆和被攻擊。

如果沒有布隆過濾器，您需要引用一個包含曾經使用過的每個使用者名稱的表，而大規模時這可能會非常昂貴。布隆過濾器可讓您在使用者每次採用新名稱時新增一個項目。當使用者檢查使用者名稱是否被佔用時，您所需要做的就是檢查布隆過濾器。它將能夠絕對確定地告訴您所要求的使用者名稱是否先前已新增。過濾器可能會錯誤地返回用戶名已被使用，而實際上用戶名尚未被使用，但這只是為了謹慎起見，不會造成任何真正的傷害（除了用戶可能無法聲明“k3w1d00d47”） .

為了說明這一點，讓我們使用 Express 建立一個快速的 REST 伺服器。首先，建立 package.json 文件，然後執行以下終端命令。

npm 安裝bloom-redis --save

npm install express --save

npm install redis --save

bloom-redis 的預設選項大小設定為 2 MB。出於謹慎考慮，這是錯誤的，但它相當大。設定布隆過濾器的大小至關重要：太大會浪費內存，太小則誤報率會太高。確定大小所涉及的數學非常複雜，超出了本教程的範圍，但幸運的是，有一個布隆過濾器大小計算器可以完成這項工作，而無需破解教科書。

現在，建立 app.js 如下：

var
  Bloom         =   require('bloom-redis'),
  express       =   require('express'),
  redis         =   require('redis'),
  
  app,
  client,
  filter;

//setup our Express server
app = express();

//create the connection to Redis
client = redis.createClient();


filter = new Bloom.BloomFilter({ 
  client    : client, //make sure the Bloom module uses our newly created connection to Redis
  key       : 'username-bloom-filter', //the Redis key
  
  //calculated size of the Bloom filter.
  //This is where your size / probability trade-offs are made
  //http://hur.st/bloomfilter?n=100000&p=1.0E-6
  size      : 2875518, // ~350kb
  numHashes : 20
});

app.get('/check', function(req,res,next) {
  //check to make sure the query string has 'username'
  if (typeof req.query.username === 'undefined') {
    //skip this route, go to the next one - will result in a 404 / not found
    next('route');
  } else {
   filter.contains(
     req.query.username, // the username from the query string
     function(err, result) {
       if (err) { 
        next(err); //if an error is encountered, send it to the client
        } else {
          res.send({ 
            username : req.query.username, 
            //if the result is false, then we know the item has *not* been used
            //if the result is true, then we can assume that the item has been used
            status : result ? 'used' : 'free' 
          });
        }
      }
    );
  }
});


app.get('/save',function(req,res,next) {
  if (typeof req.query.username === 'undefined') {
    next('route');
  } else {
    //first, we need to make sure that it's not yet in the filter
    filter.contains(req.query.username, function(err, result) {
      if (err) { next(err); } else {
        if (result) {
          //true result means it already exists, so tell the user
          res.send({ username : req.query.username, status : 'not-created' });
        } else {
          //we'll add the username passed in the query string to the filter
          filter.add(
            req.query.username, 
            function(err) {
              //The callback arguments to `add` provides no useful information, so we'll just check to make sure that no error was passed
              if (err) { next(err); } else {
                res.send({ 
                  username : req.query.username, status : 'created' 
                });
              }
            }
          );
        }
      }
    });
  }
});

app.listen(8010);

要執行此伺服器：node app.js。前往瀏覽器並將其指向：https://localhost:8010/check?username=kyle。回應應該是：{"username":"kyle","status":"free"}。

現在，讓我們透過將瀏覽器指向 http://localhost:8010/save?username=kyle 來儲存該使用者名稱。回應將是：{"username":"kyle","status":"created"}。如果回傳位址http://localhost:8010/check?username=kyle，回應將會是{"username":"kyle","status ":"已使用"} .同樣，返回http://localhost:8010/save?username=kyle 將導致{"username":"kyle","status":"not -建立「} 。

從終端機中，您可以看到過濾器的大小： redis-cli strlen 使用者名稱-bloom-filter。

現在，對於一項，它應該顯示 338622。

現在，繼續嘗試使用 /save 路由新增更多使用者名稱。您想嘗試多少就嘗試多少。

如果您再次檢查尺寸，您可能會發現尺寸略有增加，但並非每次添加都會增加。很好奇，對吧？在內部，布隆過濾器在保存在 username-bloom 的字串中的不同位置設定各個位元（1/0）。然而，這些並不是連續的，因此如果您在索引 0 處設定一位，然後在索引 10,000 處設定一位，則兩者之間的所有內容都將為 0。對於實際用途，一開始了解每個操作的精確機制並不重要，只需知道這一點即可這是正常的，您在 Redis 中的儲存永遠不會超過您指定的值。

新鮮內容

網站上的新鮮內容可以吸引用戶回頭客，那麼如何每次都向用戶展示新的內容呢？使用傳統的資料庫方法，您可以在表中新增一個包含使用者識別碼和故事識別碼的新行，然後在決定顯示一段內容時查詢該表。正如您可能想像的那樣，您的資料庫將成長得非常快，尤其是隨著使用者和內容的成長。

在這種情況下，漏報（例如不顯示看不見的內容）的後果非常小，這使得布隆過濾器成為一個可行的選擇。乍一看，您可能認為每個使用者都需要一個布隆過濾器，但我們將使用使用者識別碼和內容標識符的簡單串聯，然後將該字串插入到我們的過濾器中。這樣我們就可以對所有使用者使用單一過濾器。

在此範例中，讓我們建立另一個顯示內容的基本 Express 伺服器。每次您造訪路由 /show-content/any-username （any-username 是任何 URL 安全值）時，都會顯示一個新內容，直到網站沒有內容。在範例中，內容是古騰堡計畫前十名書籍的第一行。

我們需要再安裝一個 npm 模組。從終端運行： npm install async --save

您的新 app.js 檔案：

var
  async         =   require('async'),
  Bloom         =   require('bloom-redis'),
  express       =   require('express'),
  redis         =   require('redis'),
  
  app,
  client,
  filter,
  
  // From Project Gutenberg - opening lines of the top 10 public domain ebooks
  // https://www.gutenberg.org/browse/scores/top
  openingLines = {
    'pride-and-prejudice' : 
      'It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.',
    'alices-adventures-in-wonderland' : 
      'Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it' }

如果您仔細注意開發工具中的往返時間，您會發現使用使用者名稱請求單一路徑的次數越多，所需的時間就越長。雖然檢查過濾器需要固定的時間，但在本例中，我們正在檢查是否有更多項目。布隆過濾器能夠告訴您的資訊有限，因此您正在測試每個項目是否存在。當然，在我們的範例中，它相當簡單，但測試數百個專案效率很低。

過時資料

在此範例中，我們將建立一個小型 Express 伺服器，它將執行兩件事：透過 POST 接受新數據，並顯示當前數據（使用 GET 請求）。當新資料被 POST 到伺服器時，應用程式將檢查它是否存在於過濾器中。如果它不存在，我們會將其新增到 Redis 中的集合中，否則我們將傳回 null。 GET 請求將從 Redis 獲取它並將其發送到客戶端。

這與前兩種情況不同，誤報是不行的。我們將使用布隆過濾器作為第一道防線。考慮到布隆過濾器的屬性，我們只能確定某些東西不在過濾器中，因此在這種情況下，我們可以繼續讓資料進入。如果布隆過濾器傳回的資料可能在過濾器中，我們將根據實際資料來源進行檢查。

那么，我们得到了什么？我们获得了不必每次都检查实际来源的速度。在数据源速度较慢的情况下（外部 API、小型数据库、平面文件的中间），确实需要提高速度。为了演示速度，我们在示例中添加 150 毫秒的实际延迟。我们还将使用 console.time / console.timeEnd 来记录 Bloom 过滤器检查和非 Bloom 过滤器检查之间的差异。

在此示例中，我们还将使用极其有限的位数：仅 1024。它很快就会填满。当它填满时，它将显示越来越多的误报 - 您会看到响应时间随着误报率的填满而增加。

该服务器使用与之前相同的模块，因此将 app.js 文件设置为：

var
  async           =   require('async'),
  Bloom           =   require('bloom-redis'),
  bodyParser      =   require('body-parser'),
  express         =   require('express'),
  redis           =   require('redis'),
  
  app,
  client,
  filter,
  
  currentDataKey  = 'current-data',
  usedDataKey     = 'used-data';
  
app = express();
client = redis.createClient();

filter = new Bloom.BloomFilter({ 
  client    : client,
  key       : 'stale-bloom-filter',
  //for illustration purposes, this is a super small filter. It should fill up at around 500 items, so for a production load, you'd need something much larger!
  size      : 1024,
  numHashes : 20
});

app.post(
  '/',
  bodyParser.text(),
  function(req,res,next) {
    var
      used;
      
    console.log('POST -', req.body); //log the current data being posted
    console.time('post'); //start measuring the time it takes to complete our filter and conditional verification process
    
    //async.series is used to manage multiple asynchronous function calls.
    async.series([
      function(cb) {
        filter.contains(req.body, function(err,filterStatus) {
          if (err) { cb(err); } else {
            used = filterStatus;
            cb(err);
          }
        });
      },
      function(cb) {
        if (used === false) {
          //Bloom filters do not have false negatives, so we need no further verification
          cb(null);
        } else {
          //it *may* be in the filter, so we need to do a follow up check
          //for the purposes of the tutorial, we'll add a 150ms delay in here since Redis can be fast enough to make it difficult to measure and the delay will simulate a slow database or API call
          setTimeout(function() {
            console.log('possible false positive');
            client.sismember(usedDataKey, req.body, function(err, membership) {
              if (err) { cb(err); } else {
                //sismember returns 0 if an member is not part of the set and 1 if it is.
                //This transforms those results into booleans for consistent logic comparison
                used = membership === 0 ? false : true;
                cb(err);
              }
            });
          }, 150);
        }
      },
      function(cb) {
        if (used === false) {
          console.log('Adding to filter');
          filter.add(req.body,cb);
        } else {
          console.log('Skipped filter addition, [false] positive');
          cb(null);
        }
      },
      function(cb) {
        if (used === false) {
          client.multi()
            .set(currentDataKey,req.body) //unused data is set for easy access to the 'current-data' key
            .sadd(usedDataKey,req.body) //and added to a set for easy verification later
            .exec(cb); 
        } else {
          cb(null);
        }
      }
      ],
      function(err, cb) {
        if (err) { next(err); } else {
          console.timeEnd('post'); //logs the amount of time since the console.time call above
          res.send({ saved : !used }); //returns if the item was saved, true for fresh data, false for stale data.
        }
      }
    );
});

app.get('/',function(req,res,next) {
  //just return the fresh data
  client.get(currentDataKey, function(err,data) {
    if (err) { next(err); } else {
      res.send(data);
    }
  });
});

app.listen(8012);

由于使用浏览器 POST 到服务器可能会很棘手，所以让我们使用curl 来测试。

curl --data“您的数据放在这里”--header“内容类型：text/plain”http://localhost:8012/

可以使用快速 bash 脚本来显示填充整个过滤器的外观：

#!/bin/bash
for i in `seq 1 500`;
do
  curl --data “data $i" --header "Content-Type: text/plain" http://localhost:8012/
done

观察填充或完整的过滤器很有趣。由于这个很小，你可以使用 redis-cli 轻松查看。通过在添加项目之间从终端运行 redis-cli get stale-filter ，您将看到各个字节增加。完整的过滤器将为每个字节 \xff 。此时，过滤器将始终返回正值。

结论

布隆过滤器并不是万能的解决方案，但在适当的情况下，布隆过滤器可以为其他数据结构提供快速、有效的补充。

如果您仔细注意开发工具中的往返时间，您会发现使用用户名请求单个路径的次数越多，所需的时间就越长。虽然检查过滤器需要固定的时间，但在本例中，我们正在检查是否存在更多项目。布隆过滤器能够告诉您的信息有限，因此您正在测试每个项目是否存在。当然，在我们的示例中，它相当简单，但测试数百个项目效率很低。

以上是使用 Node.js 和 Redis 探索 Bloom Filter 的魅力的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

我可以在3天內學習WordPress嗎？Apr 09, 2025 am 12:16 AM

能在三天內學會WordPress。 1.掌握基礎知識，如主題、插件等。 2.理解核心功能，包括安裝和工作原理。 3.通過示例學習基本和高級用法。 4.了解調試技巧和性能優化建議。

WordPress是CMS嗎？Apr 08, 2025 am 12:02 AM

WordPress是內容管理系統（CMS）。它提供內容管理、用戶管理、主題和插件功能，支持創建和管理網站內容。其工作原理包括數據庫管理、模板系統和插件架構，適用於從博客到企業網站的各種需求。

WordPress有什麼用？Apr 07, 2025 am 12:06 AM

wordpressgood forvortalyanewebprojectDuetoItsAsatilityAsacms.itexcelsin：1）用戶友好性，允許Aeserywebsitesetup; 2）sexibility andcustomized andcustomization and numerthemesandplugins; 3）seoop timigimization; and4）and4）

我應該使用Wix或WordPress嗎？Apr 06, 2025 am 12:11 AM

Wix適合沒有編程經驗的用戶，WordPress適合希望有更多控制和擴展能力的用戶。 1)Wix提供拖放式編輯器和豐富模板，易於快速搭建網站。 2)WordPress作為開源CMS，擁有龐大社區和插件生態，支持深度自定義和擴展。

WordPress的成本是多少？Apr 05, 2025 am 12:13 AM

WordPress本身免費，但使用需額外費用：1.WordPress.com提供從免費到付費的套餐，價格從每月幾美元到幾十美元不等；2.WordPress.org需購買域名（每年10-20美元）和託管服務（每月5-50美元）；3.插件和主題多數免費，付費的價格在幾十到幾百美元之間；通過選擇合適的託管服務、合理使用插件和主題、定期維護和優化，可以有效控制和優化WordPress的成本。