Edge Deployment: Running LLMs on Mobile & IoT Devices
Edge Deployment of LLMs
Section titled “Edge Deployment of LLMs”Model Selection for Edge
Section titled “Model Selection for Edge”Suitable models:
- Phi-2 (2.7B parameters)
- Llama 2 7B (quantized)
- TinyLlama (1.1B)
- MobileLLM
Quantization for Mobile
Section titled “Quantization for Mobile”import torchfrom transformers import AutoModelForCausalLM
# Load modelmodel = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
# Quantize to INT8model_int8 = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8)
# Save for mobiletorch.save(model_int8.state_dict(), "model_mobile.pt")Android Integration
Section titled “Android Integration”// Using ONNX Runtimeclass LLMInference { private lateinit var session: OrtSession
fun initialize(modelPath: String) { val env = OrtEnvironment.getEnvironment() session = env.createSession(modelPath) }
fun generate(prompt: String): String { val inputTensor = createInputTensor(prompt) val outputs = session.run(mapOf("input_ids" to inputTensor)) return decodeOutput(outputs) }}iOS Integration
Section titled “iOS Integration”import CoreML
class LLMModel { var model: MLModel?
func loadModel() { guard let modelURL = Bundle.main.url(forResource: "model", withExtension: "mlmodel") else { return } model = try? MLModel(contentsOf: modelURL) }
func predict(prompt: String) -> String? { // Inference code }}Optimization Strategies
Section titled “Optimization Strategies”- Pruning: Remove 30-50% of weights
- Quantization: INT8 or INT4
- Distillation: Smaller student model
- Caching: Cache frequent outputs
- Batching: Process multiple inputs together
Found an issue? Open an issue!