On-Device Artificial Intelligence: Building Smarter Applications, Faster And Private.

It’s not a stretch to say that AI is a handy tool we all use for our daily tasks. It can handle tasks such as recognizing faces, understanding speech or cloning it, analyzing large amounts of data, and creating customized app experiences. For example, music playlists tailored to your listening habits, or workout plans matched with your progress. Here’s the problem: where AI tools live and do their work is very important. Take self-driving vehicles, for instance. These cars require AI to process data from sensors, cameras, and other inputs in order to make split-second decision, such as detecting an obstacle or adjusting speed to sharp turns. If all this processing is dependent on the cloud, latency issues in network connections could lead to system failures or delayed responses. The AI should be operated directly in the car. This allows the car to respond instantly without the need for direct internet access. On-Device AI is what we call. ODAI is simply AI that does its work right where you are – on your phone, in your car, on your wearable device and so on – without the need to connect to a cloud or the internet. This type of setup is categorized under Embedded Artificial Intelligence (EMAI), in which the intelligence is embedded within the device. Okay, so I mentioned ODAI. EMAI is a subset of ODAI. EMAI differs slightly from other terms, such as Edge AI (also known as Web AI), Cloud AI, and Cloud AI. What’s the difference, then? Here’s a quick breakdown: Edge AIThis refers to the running of AI models directly on devices, rather than relying on servers or cloud. This can be demonstrated by a simple security camera that analyses footage on the spot. It processes everything local and is close to the location where data is collected. Embedded AI

In this case, AI is built into the hardware or device itself. It’s as if it has its own mini AI. AI-powered drones can map terrains or monitor areas. One of the biggest differences between the two is the use of dedicated chips with AI models and algorithms for local intelligent tasks. Cloud AI

When the AI is based on remote servers or the cloud, it will be referred to as a remote server. When you use an app for language translation, the app sends your text to a cloud server where the AI will process it and then return the translation. The entire operation takes place in the cloud so you need an internet connection. Web AI

These are programs or apps that are installed in your browser, or are part or a website or online platform. You may see product suggestions that are based on your previous purchases or browsing history. These tools rely on AI models in the cloud for data analysis and recommendations. What’s the main difference? The main difference is where the AI works: on your device or nearby, or far away in the cloud or the web. What Makes On Device AI Useful On device AI is primarily about privacy – keeping your data safe and under your control. It does all the processing directly on your device and avoids sending personal data to external servers. What exactly makes this technology worthwhile? Real-Time Processing AI on the device processes data instantly, as it does not need to send any data to the cloud. Imagine a smart doorbell that recognizes the face of a visitor immediately and notifies you. It would be impractical to send quick notifications if it had to wait until cloud servers analyzed the image. Imagine you are using voice commands to open an app or calling a friend, and then receiving a summary afterward. Your phone processes audio data locally and the AI system manages everything directly on your device, without the assistance of external servers. Your data is kept private, secure, under your control. Offline Functionality ODAI’s biggest advantage is that it does not require internet connectivity to function. This means it can work even in areas where there is poor or no connectivity. Take modern GPS navigation systems for cars as an example. They provide turn-by-turn instructions even if there is no signal. Reduced Latency ODAIAI skips the round trip to send data to the cloud, and wait for a response. This means that if you change a setting or make a change to the device, it will process the input immediately. Your experience will be smoother and more responsive. The Technical Pieces of the On-Device Artificial Intelligence Puzzle ODAI is based on special hardware and efficient models to perform tasks directly on devices such as smartphones, smartwatches and Internet of Things gadgets. AI can now be used locally thanks to advances in hardware technology. This is especially true for tasks that require AI-specific computer processing such as: Neural Processor Units (NPUs).

These chips are optimized for neural networks, deep learning and machine learning applications. They can handle large-scale AI learning efficiently while consuming minimum power. Graphics Processing Units

GPUs are known for their ability to process multiple tasks at once, and they excel when it comes to speeding up AI operations. This is especially true with large datasets. Here are some of the most innovative AI chips available in the market: Product Organization Key Features Spiking neural network chip Indian Institute of Technology Ultra low power consumption Hierarchical learning processor Ceromorphic Alternative transistor structures Intelligent Processing Units(IPUs) Graphcore Graphcore multiple products targeting end devices as well as cloud Katana Edge AI synaptics Combines motion, sound, and vision detection ET-SoC-1 Esperanto Technology Built using RISC-V to run AI and non AI workloads NeuRRAM CEA Techniques for Optimizing AI Models To create AI models that are compatible with resource-constrained devices, it is often necessary to combine clever hardware utilization and techniques to make models smaller. I’d like share a few examples of how teams optimize AI to increase performance while using less energy. Meta’s MobileLLM Meta introduced a model for smartphones in its approach to ODAI. MobileLLM was designed from scratch, rather than scaling up traditional models to balance efficiency and performance. One of the key innovations was to increase the number smaller layers instead of having fewer larger ones. This design choice improved accuracy and speed of the model while keeping it light. You can test the model on Hugging Face, or with vLLM – a library that allows for LLM inferences and serving. Quantization simplifies the internal calculations of a model by using lower precision numbers, such 8-bit integers instead of 32-bit float-point numbers. Quantization reduces memory and computation costs with little impact on model accuracy. Pruning Although neural networks have many weights (connections among neurons), not all of them are important. Pruning removes weights that are not important, resulting in an improved model with a faster speed and less accuracy. Matrix Decomposition AI models are built on large matrices. Matrix decomposition reduces computational complexity and approximates the behavior of the original model by splitting these matrices into smaller ones. Knowledge Distillation This technique involves teaching a smaller model to mimic the outputs from a larger pre-trained, larger model. The smaller model mimics the behavior of the teacher, achieving similar accuracy and efficiency. DistilBERT, for example, reduced BERT’s performance by 97% while reducing its size by 40%. The Technologies Used for On-Device AI All the model compression techniques and specialized chip are cool, because they make ODAI feasible. What’s more interesting to us as developers is putting these tools into action. This section covers the key technologies and frameworks which make ODAI available. MediaPipe Solutions MediaPipe Solutions provides a developer’s toolkit that allows you to add AI-powered features into apps and devices. It offers cross-platform customizable tools that are optimized to run AI locally, from natural language processing to real-time video analytics. MediaPipe Tasks is the core library of MediaPipe Solutions. It allows developers to deploy ML solutions using minimal code. It’s compatible with platforms such as Android, Python, Web/JavaScript and JavaScript. MediaPipe offers a variety of specialized tasks to meet different AI needs.

This API runs lightweight large-language models (LLMs), entirely on the device, for tasks such as text generation and summary. It supports open models such as Gemma, and external options such as Phi-2. Object Detection

The tool allows you to identify and locate objects within images or videos. This is ideal for real-time applications such as detecting people, animals, or objects on the device. Image Segmentation

MediaPipe is able to segment images as well, such that it can isolate a person in a video feed from the background. It can also separate objects both in single images (like photographs) and continuous video streams, like live video or recorded footage. LiteRT LiteRT (formerly called TensorFlow Lite), is a lightweight, high-performance runtime for ODAI. It allows you to run pre-trained models, or convert TensorFlow models, PyTorch models, and JAX model into a LiteRT compatible format using AI Edge Tools. Model Explorer Model Explorer helps you visualize machine learning graphs and models. It simplifies the preparation of these models for on device AI deployment by allowing you to understand the structure of the models and fine-tune for better performance. Model Explorer can be used locally or in Colab to test and experiment. ExecuTorch ExecuTorch is a tool that makes it easy for PyTorch users to deploy models on mobile, wearables and edge devices. It’s a part of the PyTorch Edge eco-system, which supports building AI experience for edge devices such as embedded systems and microcontrollers. Large Language Models for On-Device AI Gemini, a powerful AI model, excels at processing text and images. It can also handle different types of data with ease. What’s the best part? It is designed to work on your devices. Gemini Nano is a lightweight version that can be used on your device. It is designed to be efficient while maintaining privacy. What can Gemini Nano be used for? Call Notes for Pixel devices

This feature creates private transcripts and summaries of conversations. It is completely on-device and ensures privacy for all parties. Pixel Recorder app

TalkBack is a powerful app that allows you to summarize recordings on your device. TalkBack

Nano’s multimodal capabilities enhance the accessibility feature of Android phones by providing clear images descriptions. Note: This is similar to the application we created using LLaVA for a previous article. Gemini Nano is not the only language model specifically designed for ODAI. I’ve gathered a few other models that are worth mentioning. Model Developer Research Paper Octopus v2 – NexaAI’s On-Device Language Model for Super Agent OpenELM Apple ML Research: A significant large language engine integrated into iOS to enhance application functionality Ferret v2 Apple Ferret v2 is a significant improvement over its predecessor. It offers enhanced visual processing and an advanced training regime. MiniCPM Tsinghua University: A GPT-4V level multimodal LLM on your phone Phi-3 MiniCPM MiniCPM MiniCPM MiniCPM MiniCPM MiniCPM MiniCPM MiniCPM MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP Mini CPM MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP MiniCP Mini-Multimodal LLM on your phone Phi-3 Microsoft Phi-3 Phi-3 Microsoft Phi-3 Microsoft Phi-3 Microsoft Phi-3 Microsoft Phi-3 Microsoft Phi-3 There are some compromises to be made in order to get a lightweight and private solution for your application. Here are some of them. Limited Resources Phones and wearables don’t have as much computing power as larger devices. AI models have to be able to fit into limited storage and memory, while still running efficiently. AI can drain batteries, so models must be optimized for power consumption and performance. Data and Updates AI devices such as drones, self driving cars, and similar devices use sensors or lidar for quick data processing. These models or the system as a whole don’t get real-time updates, or additional training, unless they’re connected to the cloud. The system may struggle to adapt to new situations without regular model updates and training. Biases Biases are a common problem in AI and ODAI models do not differ. These biases may lead to unjust decisions or errors such as misidentification of people. In order to keep these models fair and accurate, ODAI must not only address these biases in the training process but also ensure that the solutions work efficiently with the device’s limitations. On-device AI is not limited to these challenges. It’s a growing technology that is still in its infancy, and the limited number of professionals who are familiar with it makes it difficult to implement. The choice between on-device AI and cloud-based artificial intelligence comes down to the needs of your application. Here’s a quick overview to help you understand: Aspect Cloud-Based AI On-Device Privacy Data stays on device, ensuring privacy. Data is sent to a cloud, which could raise privacy concerns. Latency Processes instantly without delay. Relies on internet speeds, which can cause delays. Connectivity Works offline and is reliable in any environment. Requires stable internet connection. Processing Power is limited by the hardware of the device. Cloud servers can be used to perform complex tasks. Cost No ongoing server expenses. Costs of cloud infrastructure can be incurred continuously. ODAI is a great option for apps that require fast processing and privacy. Cloud-based AI is a better option if you need a lot of computing power or frequent updates. The decision depends on the needs of your project and what is most important to you.