Rumah >Peranti teknologi >AI >LeCun suka: Menjalankan LLaMA pada cip Apple M1/M2! Model parameter 13 bilion hanya memerlukan memori 4GB
Tidak lama dahulu, selepas Meta mengeluarkan model bahasa besar sumber terbuka LLaMA, netizen mengeluarkan pautan muat turun tanpa ambang, yang "sedih" terbuka.
Sebaik sahaja berita keluar, bulatan menjadi rancak dan semua orang mula memuat turun dan mengujinya.
Tetapi rakan-rakan yang tidak mempunyai kad grafik peringkat tinggi hanya boleh melihat model dan mengeluh.
Walau bagaimanapun, ia bukan masalah besar. Georgi Gerganov baru-baru ini membuat projek yang dipanggil "llama.cpp" - LLaMA boleh dijalankan tanpa GPU.
Alamat projek:
Ya, ini juga termasuk Mac yang dilengkapi dengan cip Apple. Dan juga menerima sokongan pemajuan daripada LeCun.
Pada masa ini, ini ialah tutorial yang agak komprehensif. Terdapat dua, berdasarkan pemproses M1 dan M2 Apple.
Artikel pertama: 12 /using-llama-with-m1-mac/
Artikel kedua: https :/ /
Tanpa berlengah lagi, mari kita lihat kesannya dahulu.
Contohnya, pada model 7B terkecil, tanya: "Siapakah orang pertama yang mendarat di bulan?"
Sangat cepat, keputusan keluar dalam beberapa saat.
-p 'Lelaki pertama yang mendarat di bulan ialah'
Lelaki pertama yang mendarat di atas bulan Orang itu ialah angkasawan Neil A. Armstrong berusia 38 tahun.
Pada 20 Julai 1969, Apollo 11 mendarat di bulan.
Apollo 11 ialah misi berawak pertama yang mendarat di bulan dan merupakan kemuncak perlumbaan angkasa lepas. Pada Julai 1969, Neil Armstrong dan Edwin "Buzz" Aldrin menjadi manusia pertama yang mendarat di bulan.
Apollo 11 telah dilancarkan dari Pusat Angkasa Kennedy pada 16 Julai.
Melihat umur Armstrong, nama tengah dan tarikh mendarat di bulan, tiada ralat fakta yang jelas. Ini tidak mudah untuk model sekecil itu.
Jika anda menggunakan petua berikut, anda boleh menjana beberapa kod Python praktikal.
-p 'def open_and_return_content(filename):'
def open_and_return_content(filename): """Opens file (returning the content) and performs basic sanity checks"""if os.path.isfile(filename): with open(filename) as f: content = return contentelse: print('WARNING: file "{}" does not exist'.format(filename), file=sys.stderr) return ''def get_file_info(filename, fullpath): """Get file information (i.e., permission, owner, group, size)"""
Langkah 1: Muat turun model
Perkara pertama yang perlu dilakukan ialah memuat turun model LLaMA.
Anda boleh menghantar permohonan kepada Meta melalui borang rasmi, atau mendapatkannya terus daripada pautan yang dikongsikan oleh netizen.
Ringkasnya, apabila anda selesai, anda akan melihat timbunan perkara berikut:
Seperti yang anda lihat, model yang berbeza berada dalam folder yang berbeza. Setiap model mempunyai params.json yang mengandungi butiran tentang model. Contohnya:
Langkah 2: Pasang kebergantungan
xcode-select --install
brew install pkgconfig cmake
在环境的配置上,假如你用的是Python 3.11,则可以创建一个虚拟环境:
/opt/homebrew/bin/python3.11 -m venv venv
. venv/bin/
pip3 install --pre torch torchvision --extra-index-url
python Python 3.11.2 (main, Feb 16 2023, 02:55:59) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import torch; torch.backends.mps.is_available()True
第三步:编译LLaMA CPP
git clone
make I llama.cpp build info: I UNAME_S:Darwin I UNAME_P:arm I UNAME_M:arm64 I CFLAGS: -I.-O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_ACCELERATE I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread I LDFLAGS: -framework Accelerate I CC: Apple clang version 14.0.0 (clang-1400.0.29.202)I CXX:Apple clang version 14.0.0 (clang-1400.0.29.202) cc-I.-O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_ACCELERATE -c ggml.c -o ggml.o c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c utils.cpp -o utils.o c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread main.cpp ggml.o utils.o -o main-framework Accelerate ./main -h usage: ./main [options] options: -h, --helpshow this help message and exit -s SEED, --seed SEEDRNG seed (default: -1) -t N, --threads N number of threads to use during computation (default: 4) -p PROMPT, --prompt PROMPT prompt to start generation with (default: random) -n N, --n_predict N number of tokens to predict (default: 128) --top_k N top-k sampling (default: 40) --top_p N top-p sampling (default: 0.9) --temp Ntemperature (default: 0.8) -b N, --batch_size Nbatch size for prompt processing (default: 8) -m FNAME, --model FNAME model path (default: models/llama-7B/ggml-model.bin) c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread quantize.cpp ggml.o utils.o -o quantize-framework Accelerate
假设你已经把模型放在llama.cpp repo中的models/下。
python models/7B 1
{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-06, 'vocab_size': 32000}n_parts =1Processing part0Processing variable: tok_embeddings.weight with shape:torch.Size([32000, 4096])and type:torch.float16 Processing variable: norm.weight with shape:torch.Size([4096])and type:torch.float16 Converting to float32 Processing variable: output.weight with shape:torch.Size([32000, 4096])and type:torch.float16 Processing variable: layers.0.attention.wq.weight with shape:torch.Size([4096, 4096])and type:torch.f loat16 Processing variable: layers.0.attention.wk.weight with shape:torch.Size([4096, 4096])and type:torch.f loat16 Processing variable: layers.0.attention.wv.weight with shape:torch.Size([4096, 4096])and type:torch.f loat16 Processing variable: layers.0.attention.wo.weight with shape:torch.Size([4096, 4096])and type:torch.f loat16 Processing variable: layers.0.feed_forward.w1.weight with shape:torch.Size([11008, 4096])and type:tor ch.float16 Processing variable: layers.0.feed_forward.w2.weight with shape:torch.Size([4096, 11008])and type:tor ch.float16 Processing variable: layers.0.feed_forward.w3.weight with shape:torch.Size([11008, 4096])and type:tor ch.float16 Processing variable: layers.0.attention_norm.weight with shape:torch.Size([4096])and type:torch.float 16... Done. Output file: models/7B/ggml-model-f16.bin, (part0 )
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
llama_model_quantize: loading model from './models/7B/ggml-model-f16.bin'llama_model_quantize: n_vocab = 32000llama_model_quantize: n_ctx = 512llama_model_quantize: n_embd= 4096llama_model_quantize: n_mult= 256llama_model_quantize: n_head= 32llama_model_quantize: n_layer = 32llama_model_quantize: f16 = 1... layers.31.attention_norm.weight - [ 4096, 1], type =f32 size =0.016 MB layers.31.ffn_norm.weight - [ 4096, 1], type =f32 size =0.016 MB llama_model_quantize: model size= 25705.02 MB llama_model_quantize: quant size=4017.27 MB llama_model_quantize: hist: 0.000 0.022 0.019 0.033 0.053 0.078 0.104 0.125 0.134 0.125 0.104 0.078 0.053 0.033 0.019 0.022 main: quantize time = 29389.45 ms main:total time = 29389.45 ms
./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128 -p 'The first president of the USA was '
main: seed = 1678615879llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ... llama_model_load: n_vocab = 32000llama_model_load: n_ctx = 512llama_model_load: n_embd= 4096llama_model_load: n_mult= 256llama_model_load: n_head= 32llama_model_load: n_layer = 32llama_model_load: n_rot = 128llama_model_load: f16 = 2llama_model_load: n_ff= 11008llama_model_load: n_parts = 1llama_model_load: ggml ctx size = 4529.34 MB llama_model_load: memory_size = 512.00 MB, n_mem = 16384llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'llama_model_load: .................................... donellama_model_load: model size =4017.27 MB / num tensors = 291 main: prompt: 'The first president of the USA was 'main: number of tokens in prompt = 9 1 -> ''1576 -> 'The' 937 -> ' first'6673 -> ' president' 310 -> ' of' 278 -> ' the'8278 -> ' USA' 471 -> ' was' 29871 -> ' ' sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000 The first president of the USA was 57 years old when he assumed office (George Washington). Nowadays, the US electorate expects the new president to be more young at heart. President Donald Trump was 70 years old when he was inaugurated. In contrast to his predecessors, he is physically fit, healthy and active. And his fitness has been a prominent theme of his presidency. During the presidential campaign, he famously said he would be the “most active president ever” — a statement Trump has not yet achieved, but one that fits his approach to the office. His tweets demonstrate his physical activity. main: mem per token = 14434244 bytes main: load time =1311.74 ms main: sample time = 278.96 ms main:predict time =7375.89 ms / 54.23 ms per token main:total time =9216.61 ms
Atas ialah kandungan terperinci LeCun suka: Menjalankan LLaMA pada cip Apple M1/M2! Model parameter 13 bilion hanya memerlukan memori 4GB. Untuk maklumat lanjut, sila ikut artikel berkaitan lain di laman web China PHP!