大佬出走後,第一個模型來了!
就在今天,Stability AI官宣了新的程式碼模型Stable Code Instruct 3B。
圖片
Stability是非常重要的,執行長離職對Stable Diffusion造成了一些困擾,投資公司出了點故障,自己的薪水也可能有問題了。
然而,樓外風雨飄搖,實驗室裡巋然不動,研究該做做,討論該發發,模型該調調,大模型各領域的戰爭是一個沒落下。
不只是鋪開攤子搞全面戰爭,每項研究也都在不斷前進,例如今天的Stable Code Instruct 3B就是在之前的Stable Code 3B的基礎上做了指令調優。
圖片
論文地址:https://static1.squarespace.com/static/6213c340453c3f502425776e/t/6601c5713150412ed /Stable_Code_TechReport_release.pdf
透過自然語言提示,Stable Code Instruct 3B可以處理各種任務,例如程式碼產生、數學和其他與軟體開發相關的查詢。
圖片
Stable Code Instruct 3B在同等參數量的模型中,做到了目前的SOTA,甚至優於比自己大兩倍多的CodeLlama 7B Instruct等模型,並且在軟體工程相關任務中的表現與StarChat 15B相當。
圖片
從上圖可以看出,與Codellama 7B Instruct和DeepSeek-Coder Instruct 1.3B等領先模型相比,Stable Code Instruct 3B在一系列程式設計任務中表現優異。
測試表明,Stable Code Instruct 3B在程式碼完成準確性、對自然語言指令的理解、以及跨不同程式語言的多功能性方面,都能夠打平甚至超越競爭對手。
圖片
Stable Code Instruct 3B根據Stack Overflow 2023開發者調查的結果,將訓練專注於Python、Javascript、 Java、C、C 和Go等程式語言。
上圖使用Multi-PL基準測試,比較了三個模型以各種程式語言產生輸出的強度。可以發現Stable Code Instruct 3B在所有語言中都明顯優於CodeLlama,且參數量還少了一半以上。
除了上述的熱門程式語言,Stable Code Instruct 3B還包括對其他語言(如SQL、PHP和Rust)的訓練,並且即使在沒有經過訓練的語言(如Lua)中,也能提供強大的測試性能。
Stable Code Instruct 3B不僅精通程式碼生成,還精通FIM(程式碼中間填充)任務、資料庫查詢、程式碼翻譯、解釋和創建。
透過指令調優,模型能夠理解細微的指令並採取行動,促進了除了簡單程式碼完成之外的廣泛編碼任務,例如數學理解、邏輯推理和處理軟體開發的複雜技術。
圖片
模式下載:https://huggingface.co/stabilityai/stable-code-instruct-3b
Stable Code Instruct 3B現在可以透過Stability AI會員資格,用於商業目的。對於非商業用途,可以在Hugging Face上下載模型重量和程式碼。
Pictures
Stable Code is built on Stable LM 3B and is a decoder-only Transformer structure with a design similar to LLaMA. The following table is some key structural information:
Picture
##The main differences from LLaMA include:
Positional embedding: Use rotated positional embedding in the first 25% of the header embedding to improve subsequent throughput.
Regularization: Use LayerNorm with learning bias term instead of RMSNorm.
Bias terms: All bias terms in the feedforward network and multi-head self-attention layer are deleted, except for KQV.
Uses the same tokenizer (BPE) as the Stable LM 3B model, with a size of 50,257; in addition, special markers of StarCoder are also referenced, including indicating file name, storage Library stars, fill-in-the-middle (FIM), etc.
For long context training, special markers are used to indicate when two concatenated files belong to the same repository.
Training data
Pre-training data set A variety of publicly accessible large-scale data sources are collected, including code repositories, technical documentation (such as readthedocs), mathematics-focused texts, and extensive web datasets.
The main goal of the initial pre-training phase is to learn rich internal representations to significantly improve the model's ability in mathematical understanding, logical reasoning, and processing complex technical texts related to software development.
Additionally, the training data includes a general text dataset to provide the model with broader language knowledge and context, ultimately enabling the model to handle a wider range of queries and tasks in a conversational manner.
The following table shows the data sources, categories and sampling weights of the pre-training corpus, where the ratio of code and natural language data is 80:20.
Picture
In addition, the researchers also introduced a small synthetic dataset, the data was synthesized from the seed prompts of CodeAlpaca dataset, Contains 174,000 tips.
And followed the WizardLM method, gradually increasing the complexity of the given seed prompts, and obtained an additional 100,000 prompts.
The authors believe that introducing this synthetic data early in the pre-training stage helps the model respond better to natural language text.
Long context dataset
Since multiple files in a repository often depend on each other, the context length is important for encoding Models are important.
The researchers estimated the median and average number of tokens in the software repository to be 12k and 18k respectively, so 16,384 was chosen as the context length.
The next step was to create a long context dataset. The researchers took some files written in popular languages in the repository and combined them together, inserting between each file. A special tag to maintain separation while preserving content flow.
To circumvent any potential bias that might arise from the fixed order of the files, the authors employed a randomization strategy. For each repository, two different sequences of connection files are generated.
Picture
Phase-based training
Stable Code uses 32 Amazon P4d instances for training, containing 256 NVIDIA A100 (40GB HBM2) GPUs, and uses ZeRO for distributed optimization.
Picture
A phased training method is used here, as shown in the picture above.
Training follows standard autoregressive sequence modeling to predict the next token. The model is initialized using the checkpoint of Stable LM 3B. The context length of the first stage of training is 4096, and then continuous pre-training is performed.
Training is performed with BFloat16 mixed precision, and FP32 is used for all-reduce. AdamW optimizer settings are: β1=0.9, β2=0.95, ε=1e−6, λ (weight decay)=0.1. Start with learning rate = 3.2e-4, set the minimum learning rate to 3.2e-5, and use cosine decay.
Picture
One of the core assumptions of natural language model training is the causal order from left to right, but for code Say, this assumption does not always hold (e.g., function calls and function declarations can be in any order for many functions).
To solve this problem, researchers used FIM (fill-in-the-middle). Randomly split the document into three segments: prefix, middle, and suffix, then move the middle segment to the end of the document. After rearrangement, the same autoregressive training process is followed.
After pre-training, the author further improves the model’s dialogue skills through a fine-tuning stage, which includes supervised fine-tuning (SFT) and Direct Preference Optimization (DPO).
First perform SFT fine-tuning using publicly available datasets on Hugging Face: including OpenHermes, Code Feedback, CodeAlpaca.
After performing exact match deduplication, the three datasets provide a total of approximately 500,000 training samples.
Use the cosine learning rate scheduler to control the training process and set the global batch size to 512 to pack the input into sequences of length no longer than 4096.
After SFT, the DPO phase begins, using data from UltraFeedback to curate a dataset containing approximately 7,000 samples. In addition, in order to improve the security of the model, the author also included the Helpful and Harmless RLFH dataset.
The researchers adopted RMSProp as the optimization algorithm and increased the learning rate to a peak of 5e-7 in the initial stage of DPO training.
The following compares the performance of the model on the code completion task, using the Multi-PL benchmark to evaluate the model.
Stable Code Base
The following table shows the size of 3B parameters and below on Multi-PL Performance of different code models.
Picture
Although the parameters of Stable Code are less than 40% and 20% of those of Code Llama and StarCoder 15B, respectively, The model's average performance across programming languages is on par with them.
Stable Code Instruct
The following table evaluates the instructs of several models in the Multi-PL benchmark test Fine-tuned version.
Picture
SQL Performance
Code language An important application of the model is database query tasks. In this area, the performance of Stable Code Instruct is compared with other popular instruction-tuned models, and models trained specifically for SQL. Benchmarks created here using Defog AI.
Picture
Inference performance
Table below The throughput and power consumption when running Stable Code on consumer-grade devices and corresponding system environments are given.
Picture
The results show that when using lower precision, the throughput increases by nearly two times. However, it is important to note that implementing lower precision quantization may result in some (potentially large) degradation in model performance.
Reference:https://www.php.cn/link/8cb3522da182ff9ea5925bbd8975b203
以上是大佬出走後首個發布! Stability官宣程式碼模型Stable Code Instruct 3B的詳細內容。更多資訊請關注PHP中文網其他相關文章!