Llama 2 in LangChain — FIRST Open Source Conversational Agent!
The best open source model as a conversational agent
Sign up for access to Meta to access the model, use the same email you used on hugging face.
Download and initialize the model. You can use your local GPU or a cloud service. Note: it takes time due to size
Install the necessary libraries:
pip install transformers datasets langchain
Create a Hugging Face API key and paste it into the following code:
HF_API_KEY = "YOUR_API_KEY"
Initialize the Llama 2 70B model:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("facebook/llama-2-70b-chat-hf")
tokenizer = AutoTokenizer.from_pretrained("facebook/llama-2-70b-chat-hf")
Quantize the model:
from transformers import QuantizeConfig
config = QuantizeConfig()
config.int8_loss_scale = 1024
model = model.quantize(config)
Get the model's device:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Move the model to the device:
model = model.to(device)
Set the model's evaluation mode:
model.eval()
The model is now initialized and ready to use!
Here are some additional notes:
The
HFAPIKEY
is required to download the model from Hugging Face. You can create a free API key at this link: https://api-inference.huggingface.co/.The
QuantizeConfig
object is used to configure the quantization of the model. In this case, we are setting theint8lossscale
to 1024. This means that the model will be quantized to 8-bit integers with a loss scale of 1024.The
device
variable is used to specify the device that the model will be running on. In this case, we are using the GPU if it is available, otherwise we will use the CPU.The
model.to(device)
method moves the model to the specified device.The
model.eval()
method sets the model to evaluation mode. This means that the model will not be trained, but will only be used for inference.