Rumah >Peranti teknologi >AI >Inferens LLM pada berbilang GPU menggunakan perpustakaan Accelerate

Inferens LLM pada berbilang GPU menggunakan perpustakaan Accelerate

WBOY
WBOYke hadapan
2023-11-30 17:14:391346semak imbas

Model bahasa berskala besar (llm) telah merevolusikan bidang pemprosesan bahasa semula jadi. Apabila model ini berkembang dalam saiz dan kerumitan, permintaan pengiraan inferens juga meningkat dengan ketara. Untuk menangani cabaran ini, memanfaatkan berbilang GPU menjadi kritikal.

Inferens LLM pada berbilang GPU menggunakan perpustakaan Accelerate

Oleh itu, artikel ini akan melakukan inferens pada berbilang GPU secara serentak Kandungan terutamanya termasuk: pengenalan kepada perpustakaan Accelerate, kaedah mudah dan contoh kod kerja, dan penanda aras prestasi menggunakan berbilang GPU

Artikel ini

. skalakan inferens llama2-7b pada berbilang GPU menggunakan berbilang 3090s

Inferens LLM pada berbilang GPU menggunakan perpustakaan Accelerate

Contoh asas

Kami mula-mula memperkenalkan contoh mudah untuk menunjukkan "melalui mesej" berbilang gpu menggunakan Accelerate .

from accelerate import Accelerator from accelerate.utils import gather_object  accelerator = Accelerator()  # each GPU creates a string message=[ f"Hello this is GPU {accelerator.process_index}" ]   # collect the messages from all GPUs messages=gather_object(message)  # output the messages only on the main process with accelerator.print()  accelerator.print(messages)

Outputnya adalah seperti berikut:

['Hello this is GPU 0', 'Hello this is GPU 1', 'Hello this is GPU 2', 'Hello this is GPU 3', 'Hello this is GPU 4']

Inferens berbilang GPU

Berikut ialah kaedah inferens bukan kelompok yang ringkas. Kod ini sangat mudah, kerana perpustakaan Accelerate telah melakukan banyak kerja untuk kami, kami boleh menggunakannya secara langsung:

from accelerate import Accelerator from accelerate.utils import gather_object from transformers import AutoModelForCausalLM, AutoTokenizer from statistics import mean import torch, time, json  accelerator = Accelerator()  # 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-books prompts_all=["The King is dead. Long live the Queen.","Once there were four children whose names were Peter, Susan, Edmund, and Lucy.","The story so far: in the beginning, the universe was created.","It was a bright cold day in April, and the clocks were striking thirteen.","It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.","The sweat wis lashing oafay Sick Boy; he wis trembling.","124 was spiteful. Full of Baby's venom.","As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.","I write this sitting in the kitchen sink.","We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.", ] * 10  # load a base model and tokenizer model_path="models/llama2-7b" model = AutoModelForCausalLM.from_pretrained(model_path,device_map={"": accelerator.process_index},torch_dtype=torch.bfloat16, ) tokenizer = AutoTokenizer.from_pretrained(model_path)   # sync GPUs and start the timer accelerator.wait_for_everyone() start=time.time()  # divide the prompt list onto the available GPUs  with accelerator.split_between_processes(prompts_all) as prompts:# store output of generations in dictresults=dict(outputs=[], num_tokens=0) # have each GPU do inference, prompt by promptfor prompt in prompts:prompt_tokenized=tokenizer(prompt, return_tensors="pt").to("cuda")output_tokenized = model.generate(**prompt_tokenized, max_new_tokens=100)[0] # remove prompt from output output_tokenized=output_tokenized[len(prompt_tokenized["input_ids"][0]):] # store outputs and number of tokens in result{}results["outputs"].append( tokenizer.decode(output_tokenized) )results["num_tokens"] += len(output_tokenized) results=[ results ] # transform to list, otherwise gather_object() will not collect correctly  # collect results from all the GPUs results_gathered=gather_object(results)  if accelerator.is_main_process:timediff=time.time()-startnum_tokens=sum([r["num_tokens"] for r in results_gathered ]) print(f"tokens/sec: {num_tokens//timediff}, time {timediff}, total tokens {num_tokens}, total prompts {len(prompts_all)}")

Menggunakan berbilang GPU akan menyebabkan beberapa overhed komunikasi: prestasi meningkat secara linear pada 4 GPU, dan kemudian dalam ini cenderung stabil dalam tetapan tertentu. Sudah tentu prestasi di sini bergantung pada banyak parameter seperti saiz model dan pengkuantitian, panjang petunjuk, bilangan token yang dijana dan strategi pensampelan, jadi kami hanya membincangkan kes umum

1 GPU: 44 token/saat, masa: 225.5 s

2 GPU: Memproses 88 token sesaat, jumlah masa 112.9 saat

3 GPU: Memproses 128 token sesaat, jumlah masa 77.6 saat

1

1: GPU masa ke 13 saat: : 72.7s

Inferens LLM pada berbilang GPU menggunakan perpustakaan Accelerate

5 GPU: 119 token diproses sesaat, jumlah masa yang diambil 83.8 saat

Pemprosesan kelompok pada berbilang GPU

boleh digunakan untuk mempercepatkan perkara dunia naik. Ini mengurangkan komunikasi antara GPU dan mempercepatkan inferens. Kami hanya perlu menambah fungsi prepare_prompts untuk memasukkan sekumpulan data ke dalam model dan bukannya sekeping data:

from accelerate import Accelerator from accelerate.utils import gather_object from transformers import AutoModelForCausalLM, AutoTokenizer from statistics import mean import torch, time, json  accelerator = Accelerator()  def write_pretty_json(file_path, data):import jsonwith open(file_path, "w") as write_file:json.dump(data, write_file, indent=4)  # 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-books prompts_all=["The King is dead. Long live the Queen.","Once there were four children whose names were Peter, Susan, Edmund, and Lucy.","The story so far: in the beginning, the universe was created.","It was a bright cold day in April, and the clocks were striking thirteen.","It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.","The sweat wis lashing oafay Sick Boy; he wis trembling.","124 was spiteful. Full of Baby's venom.","As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.","I write this sitting in the kitchen sink.","We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.", ] * 10  # load a base model and tokenizer model_path="models/llama2-7b" model = AutoModelForCausalLM.from_pretrained(model_path,device_map={"": accelerator.process_index},torch_dtype=torch.bfloat16, ) tokenizer = AutoTokenizer.from_pretrained(model_path)  tokenizer.pad_token = tokenizer.eos_token  # batch, left pad (for inference), and tokenize def prepare_prompts(prompts, tokenizer, batch_size=16):batches=[prompts[i:i + batch_size] for i in range(0, len(prompts), batch_size)]batches_tok=[]tokenizer.padding_side="left" for prompt_batch in batches:batches_tok.append(tokenizer(prompt_batch, return_tensors="pt", padding='longest', truncatinotallow=False, pad_to_multiple_of=8,add_special_tokens=False).to("cuda") )tokenizer.padding_side="right"return batches_tok  # sync GPUs and start the timer accelerator.wait_for_everyone() start=time.time()  # divide the prompt list onto the available GPUs  with accelerator.split_between_processes(prompts_all) as prompts:results=dict(outputs=[], num_tokens=0) # have each GPU do inference in batchesprompt_batches=prepare_prompts(prompts, tokenizer, batch_size=16) for prompts_tokenized in prompt_batches:outputs_tokenized=model.generate(**prompts_tokenized, max_new_tokens=100) # remove prompt from gen. tokensoutputs_tokenized=[ tok_out[len(tok_in):] for tok_in, tok_out in zip(prompts_tokenized["input_ids"], outputs_tokenized) ]  # count and decode gen. tokens num_tokens=sum([ len(t) for t in outputs_tokenized ])outputs=tokenizer.batch_decode(outputs_tokenized) # store in results{} to be gathered by accelerateresults["outputs"].extend(outputs)results["num_tokens"] += num_tokens results=[ results ] # transform to list, otherwise gather_object() will not collect correctly  # collect results from all the GPUs results_gathered=gather_object(results)  if accelerator.is_main_process:timediff=time.time()-startnum_tokens=sum([r["num_tokens"] for r in results_gathered ]) print(f"tokens/sec: {num_tokens//timediff}, time elapsed: {timediff}, num_tokens {num_tokens}")

Anda dapat melihat bahawa pemprosesan kelompok akan dipercepatkan dengan sangat cepat.

Apa yang perlu ditulis semula ialah: 1 GPU: 520 token/saat, masa: 19.2 saat

Dua GPU mempunyai kuasa pengkomputeran 900 token sesaat, dan masa pengiraan

3 GPU: 1205 token/saat, masa: 8.2sInferens LLM pada berbilang GPU menggunakan perpustakaan Accelerate

Empat GPU: 1655 token/saat, masa diperlukan: 6.0 saat

5 GPU: 1658 token sesaat. Kad,

0 saat: 6

🎜 Ringkasan 🎜🎜🎜🎜 Sehingga artikel ini, llama.cpp dan ctransformer tidak menyokong inferens berbilang GPU Nampaknya llama.cpp mempunyai gabungan berbilang GPU pada bulan Jun, tetapi saya belum melihat kemas kini rasmi. , jadi disahkan bahawa berbilang GPU tidak disokong di sini buat masa ini. Jika sesiapa mengesahkan bahawa ia boleh menyokong berbilang GPU, sila tinggalkan mesej. 🎜🎜🎜🎜Pakej Accelerate huggingface memberikan kita pilihan yang sangat mudah untuk menggunakan berbilang GPU Menggunakan berbilang GPU untuk inferens boleh meningkatkan prestasi dengan ketara, tetapi kos komunikasi antara GPU meningkat dengan ketara apabila bilangan GPU meningkat. 🎜🎜

Atas ialah kandungan terperinci Inferens LLM pada berbilang GPU menggunakan perpustakaan Accelerate. Untuk maklumat lanjut, sila ikut artikel berkaitan lain di laman web China PHP!

Kenyataan:
Artikel ini dikembalikan pada:51cto.com. Jika ada pelanggaran, sila hubungi admin@php.cn Padam