首頁 >後端開發 >Golang >棘手的 Golang 面試問題 - 部分數據競賽

棘手的 Golang 面試問題 - 部分數據競賽

王林
王林原創
2024-09-06 22:30:12674瀏覽

Tricky Golang interview questions - Part Data Race

這是給您的另一個程式碼審查面試問題。這個問題比前面的問題更高級,針對的是更資深的受眾。該問題需要了解切片以及在並行進程之間共享資料。

如果您不熟悉切片及其構造方式,請查看我之前關於切片標題的文章

什麼是數據競賽?

當兩個或多個執行緒(或 goroutine,在 Go 的情況下)同時存取共享記憶體時,就會發生資料競爭,而這些存取中至少有一個是寫入操作。如果沒有適當的同步機制(例如鎖或通道)來管理訪問,結果可能是不可預測的行為,包括資料損壞、狀態不一致或崩潰。

本質上,資料競爭發生在以下情況:

  • 兩個或多個執行緒(或 goroutine)同時存取相同記憶體位置
  • 至少有一個線程(或 goroutine)正在寫入該記憶體。
  • 沒有同步來控制對此記憶體的存取。

因此,執行緒或 goroutine 存取或修改共享記憶體的順序是不可預測的,從而導致運行之間可能會發生變化的不確定行為。

     +----------------------+      +---------------------+
     | Thread A: Write      |      | Thread B: Read      |
     +----------------------+      +---------------------+
     | 1. Reads x           |      | 1. Reads x          |
     | 2. Adds 1 to x       |      |                     |
     | 3. Writes new value  |      |                     |
     +----------------------+      +---------------------+

                    Shared variable x
                    (Concurrent access without synchronization)

這裡,線程A正在修改x(寫入它),而線程B同時正在讀取它。如果兩個執行緒同時運行且沒有同步,執行緒 B 可以在執行緒 A 完成更新之前讀取 x。因此,數據可能不正確或不一致。

問題:您的一位隊友提交了以下程式碼進行程式碼審查。請仔細檢查程式碼並找出任何潛在的問題。
這是您必須查看的程式碼:

package main  

import (  
    "bufio"  
    "bytes"
    "io"
    "math/rand"
    "time"
)  

func genData() []byte {  
    r := rand.New(rand.NewSource(time.Now().Unix()))  
    buffer := make([]byte, 512)  
    if _, err := r.Read(buffer); err != nil {  
       return nil  
    }  
    return buffer  
}  

func publish(input []byte, output chan<- []byte) {  
    reader := bytes.NewReader(input)  

    bufferSize := 8  
    buffer := make([]byte, bufferSize)  

    for {  
       n, err := reader.Read(buffer)  
       if err != nil || err == io.EOF {  
          return  
       }  
       if n == 0 {  
          break  
       }  
       output <- buffer[:n]  
    }  
}  

func consume(input []byte) {  
    scanner := bufio.NewScanner(bytes.NewReader(input))  
    for scanner.Scan() {  
       b := scanner.Bytes()  
       _ = b  
       // does the magic  
    }  
}  

func main() {  
    data := genData()  
    workersCount := 4  
    chunkChannel := make(chan []byte, workersCount)  

    for i := 0; i < workersCount; i++ {  
       go func() {  
          for chunk := range chunkChannel {  
             consume(chunk)  
          }  
       }()  
    }  

    publish(data, chunkChannel)  
    close(chunkChannel)  
}

我們這裡有什麼?

publish() 函數負責逐塊讀取輸入資料並將每個區塊傳送到輸出通道。首先使用 bytes.NewReader(input) 從輸入資料建立一個讀取器,這允許順序讀取資料。建立大小為 8 的緩衝區來保存從輸入讀取的每個資料區塊。在每次迭代期間, reader.Read(buffer) 從輸入中讀取最多 8 個位元組,然後函數將包含最多 8 個位元組的緩衝區 (buffer[:n]) 的切片傳送到輸出通道。循環繼續,直到 reader.Read(buffer) 遇到錯誤或到達輸入資料的末尾。

consume()函數處理從通道接收到的資料塊。它使用 bufio.Scanner 處理這些資料區塊,該掃描器會掃描每個資料區塊,並可能根據其配置方式將其分解為行或標記。變數 b := Scanner.Bytes() 檢索目前正在掃描的令牌。此函數代表基本的輸入處理。

main() 建立一個緩衝通道 chunkChannel,其容量等於workersCount,在本例中設定為 4。然後函數啟動 4 個工作協程,每個協程都會同時從 chunkChannel 讀取資料。每次工作進程接收到一塊資料時,它都會透過呼叫 Consumer() 函數來處理該資料塊。 publish() 函數讀取產生的數據,將其分成最多 8 個位元組的區塊,並將它們傳送到通道。

程式使用 goroutine 建立多個消費者,允許並發資料處理。每個消費者都在單獨的 goroutine 中運行,獨立處理資料區塊。

如果您執行此程式碼,將會發生可疑情況:

[Running] go run "main.go"

[Done] exited with code=0 in 0.94 seconds

但是有一個問題。我們存在資料爭用風險。在此程式碼中,存在潛在的資料競爭,因為publish()函數為每個區塊重複使用相同的緩衝區切片。消費者同時從此緩衝區中讀取數據,並且由於切片共享底層內存,因此多個消費者可能會讀取相同的內存,從而導致數據競爭。讓我們嘗試使用競爭檢測。 Go 提供了一個內建工具來偵測資料爭用:競賽偵測器。您可以透過使用 -race 標誌來執行程式來啟用它:

go run -race main.go

如果我們將 -race 標誌加入到運行命令中,我們將收到以下輸出:

[Running] go run -race "main.go"

==================
WARNING: DATA RACE
Read at 0x00c00011e018 by goroutine 6:
  runtime.slicecopy()
      /GOROOT/go1.22.0/src/runtime/slice.go:325 +0x0
  bytes.(*Reader).Read()
      /GOROOT/go1.22.0/src/bytes/reader.go:44 +0xcc
  bufio.(*Scanner).Scan()
      /GOROOT/go1.22.0/src/bufio/scan.go:219 +0xef4
  main.consume()
      /GOPATH/example/main.go:40 +0x140
  main.main.func1()
      /GOPATH/example/main.go:55 +0x48

Previous write at 0x00c00011e018 by main goroutine:
  runtime.slicecopy()
      /GOROOT/go1.22.0/src/runtime/slice.go:325 +0x0
  bytes.(*Reader).Read()
      /GOROOT/go1.22.0/src/bytes/reader.go:44 +0x168
  main.publish()
      /GOPATH/example/main.go:27 +0xe4
  main.main()
      /GOPATH/example/main.go:60 +0xdc

Goroutine 6 (running) created at:
  main.main()
      /GOPATH/example/main.go:53 +0x50
==================
Found 1 data race(s)
exit status 66

[Done] exited with code=0 in 0.94 seconds

The warning you’re seeing is a classic data race detected by Go’s race detector. The warning message indicates that two goroutines are accessing the same memory location (0x00c00011e018) concurrently. One goroutine is reading from this memory, while another goroutine is writing to it at the same time, without proper synchronization.

The first part of the warning tells us that Goroutine 6 (which is one of the worker goroutines in your program) is reading from the memory address 0x00c00011e018 during a call to bufio.Scanner.Scan() inside the consume() function.

Read at 0x00c00011e018 by goroutine 6:
  runtime.slicecopy()
  /GOROOT/go1.22.0/src/runtime/slice.go:325 +0x0
  bytes.(*Reader).Read()
  /GOROOT/go1.22.0/src/bytes/reader.go:44 +0xcc
  bufio.(*Scanner).Scan()
  /GOROOT/go1.22.0/src/bufio/scan.go:219 +0xef4
  main.consume()
  /GOPATH/example/main.go:40 +0x140
  main.main.func1()
  /GOPATH/example/main.go:55 +0x48

The second part of the warning shows that the main goroutine previously wrote to the same memory location (0x00c00011e018) during a call to bytes.Reader.Read() inside the publish() function.

Previous write at 0x00c00011e018 by main goroutine:
  runtime.slicecopy()
  /GOROOT/go1.22.0/src/runtime/slice.go:325 +0x0
  bytes.(*Reader).Read()
  /GOROOT/go1.22.0/src/bytes/reader.go:44 +0x168
  main.publish()
  /GOPATH/example/main.go:27 +0xe4
  main.main()
  /GOPATH/example/main.go:60 +0xdc

The final part of the warning explains that Goroutine 6 was created in the main function.

Goroutine 6 (running) created at:
  main.main()
  /GOPATH/example/main.go:53 +0x50

In this case, while one goroutine (Goroutine 6) is reading from the buffer in consume(), the publish() function in the main goroutine is simultaneously writing to the same buffer, leading to the data race.

+-------------------+               +--------------------+
|     Publisher     |               |      Consumer      |
+-------------------+               +--------------------+
        |                                   |
        v                                   |
1. Read data into buffer                    |
        |                                   |
        v                                   |
2. Send slice of buffer to chunkChannel     |
        |                                   |
        v                                   |
 +----------------+                         |
 |  chunkChannel  |                         |
 +----------------+                         |
        |                                   |
        v                                   |
3. Consume reads from slice                 |
                                            v
                                    4. Concurrent access
                                    (Data Race occurs)

Why the Data Race Occurs

The data race in this code arises because of how Go slices work and how memory is shared between goroutines when a slice is reused. To fully understand this, let’s break it down into two parts: the behavior of the buffer slice and the mechanics of how the race occurs. When you pass a slice like buffer[:n] to a function or channel, what you are really passing is the slice header which contains a reference to the slice’s underlying array. Any modifications to the slice or the underlying array will affect all other references to that slice.

buffer = [ a, b, c, d, e, f, g, h ]  <- Underlying array
           ↑
          Slice: buffer[:n]
func publish(input []byte, output chan<- []byte) {  
    reader := bytes.NewReader(input)  

    bufferSize := 8  
    buffer := make([]byte, bufferSize)  

    for {  
       // ....
       output <- buffer[:n] // <-- passing is a reference to the underlying array
    }  
}

If you send buffer[:n] to a channel, both the publish() function and any consumer goroutines will be accessing the same memory. During each iteration, the reader.Read(buffer) function reads up to 8 bytes from the input data into this buffer slice. After reading, the publisher sends buffer[:n] to the output channel, where n is the number of bytes read in the current iteration.

The problem here is that buffer is reused across iterations. Every time reader.Read() is called, it overwrites the data stored in buffer.

  • Iteration 1: publish() function reads the first 8 bytes into buffer and sends buffer[:n] (say, [a, b, c, d, e, f, g, h]) to the channel.
  • Iteration 2: The publish() function overwrites the buffer with the next 8 bytes, let’s say [i, j, k, l, m, n, o, p], and sends buffer[:n] again.

At this point, if one of the worker goroutines is still processing the first chunk, it is now reading stale or corrupted data because the buffer has been overwritten by the second chunk. Reusing a slice neans sharing the same memory.

How to fix the Data Race?

To avoid the race condition, we must ensure that each chunk of data sent to the channel has its own independent memory. This can be achieved by creating a new slice for each chunk and copying the data from the buffer to this new slice. The key fix is to copy the contents of the buffer into a new slice before sending it to the chunkChannel:

chunk := make([]byte, n)    // Step 1: Create a new slice with its own memory
copy(chunk, buffer[:n])     // Step 2: Copy data from buffer to the new slice
output <- chunk             // Step 3: Send the new chunk to the channel

Why this fix works? By creating a new slice (chunk) for each iteration, you ensure that each chunk has its own memory. This prevents the consumers from reading from the buffer that the publisher is still modifying. copy() function copies the contents of the buffer into the newly allocated slice (chunk). This decouples the memory used by each chunk from the buffer. Now, when the publisher reads new data into the buffer, it doesn’t affect the chunks that have already been sent to the channel.

+-------------------------+           +------------------------+
|  Publisher (New Memory) |           | Consumers (Read Copy)  |
|  [ a, b, c ] --> chunk1 |           |  Reading: chunk1       |
|  [ d, e, f ] --> chunk2 |           |  Reading: chunk2       |
+-------------------------+           +------------------------+
         ↑                                    ↑
        (1)                                  (2)
   Publisher Creates New Chunk          Consumers Read Safely

This solution works is that it breaks the connection between the publisher and the consumers by eliminating shared memory. Each consumer now works on its own copy of the data, which the publisher does not modify. Here’s how the modified publish() function looks:

func publish(input []byte, output chan<- []byte) {
    reader := bytes.NewReader(input)
    bufferSize := 8
    buffer := make([]byte, bufferSize)

    for {
        n, err := reader.Read(buffer)
        if err != nil || err == io.EOF {
            return
        }
        if n == 0 {
            break
        }

        // Create a new slice for each chunk and copy the data from the buffer
        chunk := make([]byte, n)
        copy(chunk, buffer[:n])

        // Send the newly created chunk to the channel
        output <- chunk
    }
}

Summary

Slices Are Reference Types:
As mentioned earlier, Go slices are reference types, meaning they point to an underlying array. When you pass a slice to a channel or a function, you’re passing a reference to that array, not the data itself. This is why reusing a slice leads to a data race: multiple goroutines end up referencing and modifying the same memory.

記憶體分配:
當我們使用 make([]byte, n) 建立一個新切片時,Go 會為該切片分配一個單獨的記憶體區塊。這意味著新切片(區塊)有自己的後備數組,獨立於緩衝區。透過將資料從 buffer[:n] 複製到 chunk,我們確保每個 chunk 都有自己的私有記憶體空間。

解耦記憶體:
透過將每個區塊的記憶體與緩衝區解耦,發布者可以繼續將新資料讀入緩衝區,而不​​會影響已經發送到通道的區塊。現在,每個區塊都有自己獨立的資料副本,因此消費者可以在不受發布者乾擾的情況下處理區塊。

防止資料爭用:
資料競爭的主要來源是對共享緩衝區的並發存取。透過創建新的切片並複製數據,我們消除了共享內存,並且每個 goroutine 都對自己的數據進行操作。這消除了競爭條件的可能性,因為不再存在對同一記憶體的任何爭用。

結論

修復的核心很簡單但功能強大:透過確保每個資料區塊都有自己的內存,我們消除了導致資料爭用的共享資源(緩衝區)。這是透過在將資料傳送到通道之前將資料從緩衝區複製到新片中來實現的。透過這種方法,每個消費者都可以處理自己的資料副本,獨立於發布者的操作,從而確保安全的並發處理,而不會出現競爭條件。這種解耦共享記憶體的方法是並發程式設計中的基本策略。它可以防止由競爭條件引起的不可預測的行為,並確保您的 Go 程式保持安全、可預測和正確,即使多個 goroutine 同時存取資料也是如此。

就這麼簡單!

以上是棘手的 Golang 面試問題 - 部分數據競賽的詳細內容。更多資訊請關注PHP中文網其他相關文章!

陳述:
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn