這是編寫 Go 應用程式的第二部分,該應用程式用於根據所選文字確定使用者發送給 LLM 的令牌數量。
在上一篇文章中,我提到我只想建立一些僅用 Golang 寫的東西,在我查看的 Github 儲存庫中,這個似乎非常好:go-hggingface。該代碼似乎很新,但它“有點”適合我。
首先,程式碼存取 Hugginface 以取得所有與 LLM 相關的「標記器」列表,因此使用者應該擁有 HF 標記。因此,我將令牌放入 .env 檔案中,如圖所示。
HF_TOKEN="your-huggingface-token"
然後使用下頁中提供的範例 (https://github.com/gomlx/go-huggingface?tab=readme-ov-file),我圍繞它建立了自己的程式碼。
package main import ( "bytes" "fmt" "log" "os" "os/exec" "runtime" "github.com/gomlx/go-huggingface/hub" "github.com/gomlx/go-huggingface/tokenizers" "github.com/joho/godotenv" "github.com/sqweek/dialog" "fyne.io/fyne/v2" "fyne.io/fyne/v2/app" "fyne.io/fyne/v2/container" "fyne.io/fyne/v2/widget" //"github.com/inancgumus/scree" ) var ( // Model IDs we use for testing. hfModelIDs = []string{ "ibm-granite/granite-3.1-8b-instruct", "meta-llama/Llama-3.3-70B-Instruct", "mistralai/Mistral-7B-Instruct-v0.3", "google/gemma-2-2b-it", "sentence-transformers/all-MiniLM-L6-v2", "protectai/deberta-v3-base-zeroshot-v1-onnx", "KnightsAnalytics/distilbert-base-uncased-finetuned-sst-2-english", "KnightsAnalytics/distilbert-NER", "SamLowe/roberta-base-go_emotions-onnx", } ) func runCmd(name string, arg ...string) { cmd := exec.Command(name, arg...) cmd.Stdout = os.Stdout cmd.Run() } func ClearTerminal() { switch runtime.GOOS { case "darwin": runCmd("clear") case "linux": runCmd("clear") case "windows": runCmd("cmd", "/c", "cls") default: runCmd("clear") } } func FileSelectionDialog() string { // Open a file dialog box and let the user select a text file filePath, err := dialog.File().Filter("Text Files", "txt").Load() if err != nil { if err.Error() == "Cancelled" { fmt.Println("File selection was cancelled.") } log.Fatalf("Error selecting file: %v", err) } // Output the selected file name fmt.Printf("Selected file: %s\n", filePath) return filePath } func main() { var filePath string // read the '.env' file err := godotenv.Load() if err != nil { log.Fatal("Error loading .env file") } // get the value of the 'HF_TOKEN' environment variable hfAuthToken := os.Getenv("HF_TOKEN") if hfAuthToken == "" { log.Fatal("HF_TOKEN environment variable is not set") } // to display a list of LLMs to determine the # of tokens later on regarding the given text var llm string = "" var modelID string = "" myApp := app.New() myWindow := myApp.NewWindow("Select a LLM in the list") items := hfModelIDs // Label to display the selected item selectedItem := widget.NewLabel("Selected LLM: None") // Create a list widget list := widget.NewList( func() int { // Return the number of items in the list return len(items) }, func() fyne.CanvasObject { // Template for each list item return widget.NewLabel("Template") }, func(id widget.ListItemID, obj fyne.CanvasObject) { // Update the template with the actual data obj.(*widget.Label).SetText(items[id]) }, ) // Handle list item selection list.OnSelected = func(id widget.ListItemID) { selectedItem.SetText("Selected LLM:" + items[id]) llm = items[id] } // Layout with the list and selected item label content := container.NewVBox( list, selectedItem, ) // Set the content of the window myWindow.SetContent(content) myWindow.Resize(fyne.NewSize(300, 400)) myWindow.ShowAndRun() ClearTerminal() fmt.Printf("Selected LLM: %s\n", llm) ////// //List files for the selected model for _, modelID := range hfModelIDs { if modelID == llm { fmt.Printf("\n%s:\n", modelID) repo := hub.New(modelID).WithAuth(hfAuthToken) for fileName, err := range repo.IterFileNames() { if err != nil { panic(err) } fmt.Printf("fileName\t%s\n", fileName) fmt.Printf("repo\t%s\n", repo) fmt.Printf("modelID\t%s\n", modelID) } } } //List tokenizer classes for the selected model for _, modelID := range hfModelIDs { if modelID == llm { fmt.Printf("\n%s:\n", modelID) repo := hub.New(modelID).WithAuth(hfAuthToken) fmt.Printf("\trepo=%s\n", repo) config, err := tokenizers.GetConfig(repo) if err != nil { panic(err) } fmt.Printf("\ttokenizer_class=%s\n", config.TokenizerClass) } } // Models URL -> "https://huggingface.co/api/models" repo := hub.New(modelID).WithAuth(hfAuthToken) tokenizer, err := tokenizers.New(repo) if err != nil { panic(err) } // call file selection dialogbox filePath = FileSelectionDialog() // Open the file filerc, err := os.Open(filePath) if err != nil { fmt.Printf("Error opening file: %v\n", err) return } defer filerc.Close() // Put the text file content into a buffer and convert it to a string. buf := new(bytes.Buffer) buf.ReadFrom(filerc) sentence := buf.String() tokens := tokenizer.Encode(sentence) fmt.Println("Sentence:\n", sentence) fmt.Printf("Tokens: \t%v\n", tokens) }
在「hfModelIDs」的「var」部分中,我加入了一些新的引用,例如IBM 的Granite、Meta 的LLama 以及Mistral模型。
Huggingface 令牌也直接在 Go 程式碼中取得和讀取。
我添加了一個對話框來顯示法學碩士列表(我最終會更改),一個對話框來添加文件中的文本(我喜歡這種東西?)以及一些要清除和刪除的代碼行清潔屏幕? !
輸入文字如下;
The popularity of the Rust language continues to explode; yet, many critical codebases remain authored in C, and cannot be realistically rewritten by hand. Automatically translating C to Rust is thus an appealing course of action. Several works have gone down this path, handling an ever-increasing subset of C through a variety of Rust features, such as unsafe. While the prospect of automation is appealing, producing code that relies on unsafe negates the memory safety guarantees offered by Rust, and therefore the main advantages of porting existing codebases to memory-safe languages. We instead explore a different path, and explore what it would take to translate C to safe Rust; that is, to produce code that is trivially memory safe, because it abides by Rust's type system without caveats. Our work sports several original contributions: a type-directed translation from (a subset of) C to safe Rust; a novel static analysis based on "split trees" that allows expressing C's pointer arithmetic using Rust's slices and splitting operations; an analysis that infers exactly which borrows need to be mutable; and a compilation strategy for C's struct types that is compatible with Rust's distinction between non-owned and owned allocations. We apply our methodology to existing formally verified C codebases: the HACL* cryptographic library, and binary parsers and serializers from EverParse, and show that the subset of C we support is sufficient to translate both applications to safe Rust. Our evaluation shows that for the few places that do violate Rust's aliasing discipline, automated, surgical rewrites suffice; and that the few strategic copies we insert have a negligible performance impact. Of particular note, the application of our approach to HACL* results in a 80,000 line verified cryptographic library, written in pure Rust, that implements all modern algorithms - the first of its kind.
檢定
執行後的程式碼會顯示對話方塊 bx,您可以在其中選擇所需的 LLM。
如果一切順利,下一步是在本機下載「tokenizer」檔案(請參閱 Github 儲存庫的說明),然後會顯示一個對話框,選擇包含要評估的內容的文字檔案令牌數量。
到目前為止,我已要求訪問 Meta LLama 和 Google“google/gemma-2–2b-it”模型,並正在等待訪問權限被授予。
google/gemma-2-2b-it: repo=google/gemma-2-2b-it panic: request for metadata from "https://huggingface.co/google/gemma-2-2b-it/resolve/299a8560bedf22ed1c72a8a11e7dce4a7f9f51f8/tokenizer_config.json" failed with the following message: "403 Forbidden"
我認為實現我想要的目標的正確途徑是,一個能夠確定代幣數量的 Golang 程式是用戶發送到 LLM 的查詢。
這個專案的唯一目的是了解針對各種 LLM 的查詢中確定令牌數量背後的內部系統,並發現它們是如何計算的。
感謝您的閱讀並歡迎評論。
最終結論之前,敬請期待…?
以上是計算 Go 中發送給 LLM 的 Token 數量(第 2 部分)的詳細內容。更多資訊請關注PHP中文網其他相關文章!