Replicate
Replicate is a commercial service that can run models uploaded to them in Docker containers, in the format constructed by their Cog build tool. Several of these have already been uploaded as public models.
For public models, Replicate only charges for actual GPU time used; for private models, they also charge for startup and idle time. Several models wrapped for LMQL/LMTP use have already been uploaded publicly, and this chapter documents how to build, operate and deploy more.
Running A 🤗 Transformers Model On Replicate
To run a 🤗 Transformers model on Replicate, you need to:
Export the environment variable
REPLICATE_API_TOKEN
with the credential to use to authenticate the request.Set the
endpoint=
argument to your model toreplicate:ORG/MODEL
, matching the name with which the model was uploaded. If you want to use models from your organization's deployments, set theendpoint=
argument to your deployment toreplicate:deployment/ORG/MODEL
.Set the
tokenizer=
argument to your model to a huggingface transformers name from which correct configuration for the tokenizer in use can be downloaded.
For example:
argmax
"""Review: We had a great stay. Hiking in the mountains was fabulous and the food is really good.\n
Q: What is the underlying sentiment of this review and why?\n
A:[ANALYSIS]\n
Q: Summarizing the above analysis in a single word -- of the options "positive", "negative", and "neutral" -- how is the review best described?\n
A:[CLASSIFICATION]"""
from lmql.model(
# model name is not actually used: endpoint completely overrides model selection
"meta-llama/Llama-2-13b-chat-hf",
# in this case, uses model from https://replicate.com/charles-dyfis-net/llama-2-13b-hf--lmtp-8bit
endpoint="replicate:charles-dyfis-net/llama-2-13b-hf--lmtp-8bit",
# choosing a model with the same tokenizer as meta-llama/Llama-2-13b-hf but ungated in huggingface
tokenizer="AyyYOO/Luna-AI-Llama2-Uncensored-FP16-sharded",
)
where STOPS_AT(ANALYSIS, "\n") and len(TOKENS(ANALYSIS)) < 200
distribution CLASSIFICATION in [" positive", " negative", " neutral"]
Uploading A 🤗 Model To Replicate
You can also upload and deploy your own LMQL models to Replicate. To do so, first install Cog. In addition to that, LMQL provides scripts that largely automate the process of building and uploading models (see the scripts/replicate-build
section of the LMQL source distribution).
Create a corresponding model on the Replicate website.
Copy
config.toml.example
toconfig.toml
, and customize it.Change
dest_prefix
to replaceYOURACCOUNT
with the name of the actual Replicate account to which you will be uploading models.For each model you wish to build and upload, your config file should have a
[models.MODELNAME]
section. Make sure MODELNAME reflects the name of the model as create in your Replicate account.huggingface.repo
should reflect the Hugging Face model name you wish to wrap. If you want to pin a version, also sethuggingface.version
.The
config
section may be used to set any values you want to pass in themodel_args
dictionary.Run the
./build
script, with your current working directory beingscripts/replicate-build
.This will create a
work/
subdirectory for each model defined in your configuration file.In the
work/MODELNAME
directory, run the generated./push
script to build and upload your model, orcog predict
to test your model locally.