Home > Article > Backend Development > What are the options for running LLM locally using pre-trained weights?
I have a cluster that is not connected to the internet, although there is a weight repository available. I need to run LLM inference on it.
The only option I've found so far is to use a combination of the transformers
and langchain
modules, but I don't want to tune the model's hyperparameters. I came across ollama
software but I can't install anything on the cluster except the python library. So, naturally I wondered, what are the options for running LLM inference? There are still some questions.
ollama-python
package without installing their Linux software? Or do I need both to run my reasoning? ollama
on this cluster, how can I provide the model with pretrained weights? If it helps, they are stored in (sometimes multiple) .bin
filesYou don't actually have to install ollama
. Instead, you can run llm directly locally, e.g. mistral
model
llm = gpt4all( model="/home/jeff/.cache/huggingface/hub/gpt4all/mistral-7b-openorca.q4_0.gguf", device='gpu', n_threads=8, callbacks=callbacks, verbose=true)
or for falcon
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline import torch model_id = "tiiuae/falcon-7b-instruct" tokenizer = AutoTokenizer.from_pretrained(model_id) pipeline = pipeline( "text-generation", model=model_id, tokenizer=tokenizer, torch_dtype=torch.bfloat16, # trust_remote_code=True, device_map="auto", max_new_tokens=100, # max_length=200, ) from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline llm = HuggingFacePipeline(pipeline=pipeline)
I have 16g memory nvidia 4090 installed on my laptop, which can support the above 2 models to run locally.
The above is the detailed content of What are the options for running LLM locally using pre-trained weights?. For more information, please follow other related articles on the PHP Chinese website!