Artificial Intelligence (AI), with its major advances, will be in the spotlight by 2024. It is a problem that the term has become ambiguous as it has gained so much attention and common knowledge. We all know what it means to use AI, but we don’t understand what infrastructure is required. Let’s take a look at the concepts that make AI tick. How are data stored, correlated and relationships built so that an algorithm can learn to interpret the data? It all begins with a database, as with most data-oriented architectural designs. Data as Coordinates The process of creating intelligence, artificial or natural, is very similar. We store information in chunks and then connect them. This is shown in multiple visualization tools and metaphors, with dots connected on a graph by lines. Intelligence is a result of those connections and intersections. We can combine “chocolate tastes good” with “drinking warm milk keeps you warm” to make “hot cocoa”. As humans, we don’t stress too much about whether the connections are made correctly. Declaratively, our brain works in this way. For building AI, however, we must be more explicit. Imagine it as a road map. For a plane to reach CountryB from CountryA, it needs a precise system. We have coordinates and 2 axes on our maps. They can be represented by a vector, [28.3772 81.5707]. We need a system that is more complex for our intelligence. Two dimensions are not enough; we need thousands. Vector databases are what they are. Our intelligence can now correlate words based on distance or angle between them, create cross references, and establish patterns where every term occurs. A database that stores data as high-dimensional vectors. It allows for efficient semantic matching and similarity searches. Querying by Approximation As we discussed in the previous session, matching your search terms (your prompts) with the data is an exercise in semantic matching (it establishes a pattern of keywords used in your prompts within its own data), while the similarity search is the distance (angular, or linear) between entries. This is a fairly accurate representation. A similarity search defines each of the numbers within a vector, which is thousands of coordinates in length. This creates a point on this strange multi-dimensional space. To establish similarity, the angles and/or distances between these points are measured. This is why AI is not deterministic – we aren’t either – for the same query, the search could produce different results depending on how the scores were defined at the time. There are algorithms that you can use when building an AI system to determine how your data is evaluated. This can produce more accurate and precise results depending on the data type. The three main algorithms are: Each algorithm performs better with certain types of data. Understanding the shape of your data and how these concepts will correlate to each other is key to choosing the right one. Here’s a rule-of-thumb that will give you a hint for each algorithm: Cosine SimilarityAngle between vectors is measured. If the magnitude (the number) is not as important. It’s perfect for text/semantic similarities Dot Product
Captures linear alignment and correlation. It’s a great way to establish relationships between multiple features/points. Euclidean Distance
Calculates straight-line distance. It is useful for spaces with dense numerical values, as it highlights the spatial distance. INFO
When dealing with non-structured information (like text entries, such as your tweets, your book, multiple recipes or your product documentation), cosine similarities are the way to go. We can now talk about how intelligence works, once we have a better understanding of how data is stored and relationships are built. Let’s start the training! Language Models A system is trained to understand, predict and generate human-like texts by learning statistical patterns and relationships among words and phrases. In such a system language is represented by probabilistic sequences. A language model can then be used for efficient completion, translation, and conversation (hence the quote that 90% of Google’s code is written by AI – auto-completion). These tasks are low-hanging fruit for AI, as they rely on estimating the likelihoods of word combinations. They also improve by reaffirming patterns based on feedback from users (rebalancing similarity scores). We now know what a language models is and can classify them into large and small. Large Language Models (LLMs), as the name suggests, use large datasets with billions of variables, up to 70 billion. They can be diverse and create text that is human-like across different knowledge domains. Imagine them as generalists. This makes them extremely powerful and versatile. As a result, they require a lot of computation work to train. Small Language Models (SLMs). With a smaller dataset ranging from 100,000,000 to 3 BILLION parameters. They require less computational effort and are therefore less versatile. SLMs are also more efficient and can make faster inferences when processing user input. Fine-Tuning Fine-tuning a LLM involves adjusting its weights by specialized training using a specific dataset of high quality. It is basically adapting a model that has been pre-trained to perform better in a specific domain or task. As the model is trained, the heuristics are refined. This results in more accurate and context-specific outcomes without having to create a custom language for each task. Developers will adjust the learning rate, batch size, and weights on each iteration of training, while providing a dataset tailored to that particular knowledge area. Each iteration is also dependent on benchmarking the output of the model. As previously mentioned, fine-tuning can be particularly useful when applying a determined task to a niche knowledge field, such as creating summaries for nutritional scientific articles or correlating symptoms with subsets of possible conditions. Fine-tuning requires many iterations and is not intended for factual data, especially when it is dependent on current events. Context Enhancement With Information The majority of our conversations are dependent on context. AI is no different. There are certainly use cases that do not entirely depend on the current events (translations and summaries, data analysis, etc.). Many others do. It is not yet possible to train LLMs or SLMs on a daily base. A new technique called Retrieve-Augmented Generation can help. It involves injecting a smaller data set into the LLMs to provide them with more specific and/or current information. The LLM doesn’t get better trained with a RAG; it still gets the same generalistic training as before, but it now receives new information before it can generate output. INFO
RAG provides a better context for the LLM, allowing it to gain a deeper understanding of the topic. Data must be formatted and prepared in a way the LLM can digest. It is a multi-step procedure: Retrieval
Query external data, such as web pages, knowledge base, and databases. Pre-Processing
Pre-processing includes tokenization, stemming and removal of stopwords. Grounded Generation
The pre-processed information is then seamlessly integrated into the pretrained LLM. RAG retrieves relevant data from a database based on a query created by the LLM. By integrating an RAG with an LLM, it can be given a better context and a deeper understanding of the topic. This enhanced context allows the LLM’s to generate more precise and informative responses. This approach is mainly used for data-driven responses, as it allows easy access to fresh information through database records that are easily updated. This data is context-focused and therefore more accurate. Imagine a RAG is a tool that can transform your LLM into a specialist. RAG can be used to enhance the context of an LLM, which is especially useful for chatbots and other applications where output quality is directly related to domain knowledge. RAG is a strategy that collects and “injects” data into a language model’s context. But this data needs input, which is why it requires meaning embedded. Embedding In order to make data digestible for the LLM, it is necessary to capture the semantic meaning of each entry so that the language model can establish patterns and relationships. This process, called embedding creates a static vector representation for the data. Different language models use different levels of embedding precision. You can embeddings ranging from 384 to 3072 dimensions. Compared to our cartesian coordinates on a map (e.g. [28.3772] and 81.5707]), which only have two dimensions, the embedded entry for an LLM can have anywhere from 384 to 30,72 dimensions. Let’s build I hope that this has helped you to better understand the meaning of these terms and the processes that encompass the term “AI”. The complexity of AI is not covered here. We need to discuss AI Agents, and how these approaches interact to create richer experiences. Let me know if you would like to see that in a future article. Let me know what you think and what you create with this. Further Reading on SmashingMag: “Using AI for Neurodiversity and Building Inclusive Tool,” Pratik Joglekar, “How To Design Effective Conversational AI experiences: A Comprehensive Guide,” Yinjian Hui “When Words Cannot Define: Designing for AI Beyond Conversational Interactions,” Maximillian Piras, “AI’s Transformative impact On Web Design: Supercharging Production Across The Industry,” Paul Boag