Large language models (LLMs) have revolutionized the field of natural language processing (NLP), enabling state-of-the-art performance on a wide range of tasks, including text classification, translation, summarization, and generation.
When it comes to usecases around leveraing LLMs for extracting insights from our own knowledge repositories, we have broadly two design approaches:
- Fine-tuning a LLM
- RAG (Retrieval Augmented Generation)
Many fields have their own specialised terminology. This vocabulary may be missing from the common pretraining data utilised by LLMs.
Fine Tuning a LLM
The process of fine-tuning a pre-trained LLM on a fresh domain-specific dataset is known as fine-tuning. Fine-tuning is the process of further training a previously trained LLM on a smaller, domain-specific, labelled dataset.
To fine-tune an LLM, you'll need a dataset of labelled data, with each data point representing an input and output pair. A written passage, a query, or a code snippet might be used as input. The result might be a label, a summary, a translation, or code.
Once you have a dataset, you can use a supervised learning method to fine-tune the LLM. By minimising a loss function, the algorithm will learn to map the input to the output.
It can be computationally costly to fine-tune an LLM.
Another subset of the above approach is called PEFT (Parameter-efficient fine-tuning) and LoRA is the most popular approach for PEFT today.
LoRA (Low-Rank Adaptation of Large Language Models) is a fine-tuning approach for LLMs that is more efficient and memory-efficient than standard fine-tuning. Traditional fine-tuning entails altering all of an LLM's parameters. This can be computationally expensive and memory-intensive, particularly for big LLMs with billions of parameters.LoRA, on the other hand, merely modifies a few low-rank matrices. Because of this, LoRA is far more efficient and memory-efficient than conventional fine-tuning.
RAG (Retrieval Augmented Generation)
RAG is an effective strategy for improving the performance and relevance of LLMs by combining "prompt engineering" with "context retrieval" from external data sources.
Given below is a high level process flow for RAG.
- All documents from a domain specific knoweldge source are converted into embeddings and stored in a special vector database. These vector embeddings are nothing but “N-dimensional matrices” of numbers.
- When the user types his query, even the query is converted into an embedding (matrix of numbers) using a AI model.
- Semantic search techniques are used to identify all contextual sentences in the “document-embedding” for the given query. Most popular algorithm is “cosine similarity”. This algorithm uses a cosine maths function to get all the sentences (matrices) that are ‘near’ or ‘close’ to the query (matrix). This entails matrix multiplications and other maths functions.
- All retrieved “semantically similar sentences/paragraphs” from multiple documents are finally again sent to a LLM for ‘summarization’. The LLM would paraphrase all the disparate sentences into a coherent story that is readable.
The table belows shows the advantages/disadvantages of both the approaches. For most usecases, a proper utilization of prompt engineering and RAG would suffice.