Llama 2 in LangChain — FIRST Open Source Conversational Agent!

The best open source model as a conversational agent

  1. Sign up for access to Meta to access the model, use the same email you used on hugging face.

  2. Download and initialize the model. You can use your local GPU or a cloud service. Note: it takes time due to size

  1. Install the necessary libraries:

pip install transformers datasets langchain
  1. Create a Hugging Face API key and paste it into the following code:

HF_API_KEY = "YOUR_API_KEY"
  1. Initialize the Llama 2 70B model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("facebook/llama-2-70b-chat-hf")
tokenizer = AutoTokenizer.from_pretrained("facebook/llama-2-70b-chat-hf")
  1. Quantize the model:

from transformers import QuantizeConfig

config = QuantizeConfig()
config.int8_loss_scale = 1024

model = model.quantize(config)
  1. Get the model's device:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  1. Move the model to the device:

model = model.to(device)
  1. Set the model's evaluation mode:

model.eval()
  1. The model is now initialized and ready to use!

Here are some additional notes:

  • The HFAPIKEY is required to download the model from Hugging Face. You can create a free API key at this link: https://api-inference.huggingface.co/.

  • The QuantizeConfig object is used to configure the quantization of the model. In this case, we are setting the int8lossscale to 1024. This means that the model will be quantized to 8-bit integers with a loss scale of 1024.

  • The device variable is used to specify the device that the model will be running on. In this case, we are using the GPU if it is available, otherwise we will use the CPU.

  • The model.to(device) method moves the model to the specified device.

  • The model.eval() method sets the model to evaluation mode. This means that the model will not be trained, but will only be used for inference.