Skip to content

LLM-PowerHouse: Unleash LLMs' potential through curated tutorials, best practices, and ready-to-use code for custom training and inferencing.

License

Notifications You must be signed in to change notification settings

ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ—£οΈ LLM PowerHouse

Unleash LLMs' potential through curated tutorials, best practices, and ready-to-use code for custom training and inferencing.

Overview

Welcome to LLM-PowerHouse, your ultimate resource for unleashing the full potential of Large Language Models (LLMs) with custom training and inferencing. This GitHub repository is a comprehensive and curated guide designed to empower developers, researchers, and enthusiasts to harness the true capabilities of LLMs and build intelligent applications that push the boundaries of natural language understanding.

Table of contents

Foundations of LLMs

This section offers fundamental insights into mathematics, Python, and neural networks. It may not be the ideal starting point, but you can consult it whenever necessary.

⬇️ Ready to Embrace Foundations of LLMs? ⬇️
graph LR
    Foundations["πŸ“š Foundations of Large Language Models (LLMs)"] --> ML["1️⃣ Mathematics for Machine Learning"]
    Foundations["πŸ“š Foundations of Large Language Models (LLMs)"] --> Python["2️⃣ Python for Machine Learning"]
    Foundations["πŸ“š Foundations of Large Language Models (LLMs)"] --> NN["3️⃣ Neural Networks"]
    Foundations["πŸ“š Foundations of Large Language Models (LLMs)"] --> NLP["4️⃣ Natural Language Processing (NLP)"]
    
    ML["1️⃣ Mathematics for Machine Learning"] --> LA["πŸ“ Linear Algebra"]
    ML["1️⃣ Mathematics for Machine Learning"] --> Calculus["πŸ“ Calculus"]
    ML["1️⃣ Mathematics for Machine Learning"] --> Probability["πŸ“Š Probability & Statistics"]
    
    Python["2️⃣ Python for Machine Learning"] --> PB["🐍 Python Basics"]
    Python["2️⃣ Python for Machine Learning"] --> DS["πŸ“Š Data Science Libraries"]
    Python["2️⃣ Python for Machine Learning"] --> DP["πŸ”„ Data Preprocessing"]
    Python["2️⃣ Python for Machine Learning"] --> MLL["πŸ€– Machine Learning Libraries"]
    
    NN["3️⃣ Neural Networks"] --> Fundamentals["πŸ”§ Fundamentals"]
    NN["3️⃣ Neural Networks"] --> TO["βš™οΈ Training & Optimization"]
    NN["3️⃣ Neural Networks"] --> Overfitting["πŸ“‰ Overfitting"]
    NN["3️⃣ Neural Networks"] --> MLP["🧠 Implementation of MLP"]
    
    NLP["4️⃣ Natural Language Processing (NLP)"] --> TP["πŸ“ Text Preprocessing"]
    NLP["4️⃣ Natural Language Processing (NLP)"] --> FET["πŸ” Feature Extraction Techniques"]
    NLP["4️⃣ Natural Language Processing (NLP)"] --> WE["🌐 Word Embedding"]
    NLP["4️⃣ Natural Language Processing (NLP)"] --> RNN["πŸ”„ Recurrent Neural Network"]

Loading

1. Mathematics for Machine Learning

Before mastering machine learning, it's essential to grasp the fundamental mathematical concepts that underpin these algorithms.

Concept Description
Linear Algebra Crucial for understanding many algorithms, especially in deep learning. Key concepts include vectors, matrices, determinants, eigenvalues, eigenvectors, vector spaces, and linear transformations.
Calculus Important for optimizing continuous functions in many machine learning algorithms. Essential topics include derivatives, integrals, limits, series, multivariable calculus, and gradients.
Probability and Statistics Vital for understanding how models learn from data and make predictions. Key concepts encompass probability theory, random variables, probability distributions, expectations, variance, covariance, correlation, hypothesis testing, confidence intervals, maximum likelihood estimation, and Bayesian inference.

Further Exploration

Reference Description Link
3Blue1Brown - The Essence of Linear Algebra Offers a series of videos providing geometric intuition to fundamental linear algebra concepts. πŸ”—
StatQuest with Josh Starmer - Statistics Fundamentals Provides clear and straightforward explanations for various statistical concepts through video tutorials. πŸ”—
AP Statistics Intuition by Ms Aerin Curates a collection of Medium articles offering intuitive insights into different probability distributions. πŸ”—
Immersive Linear Algebra Presents an alternative visual approach to understanding linear algebra concepts. πŸ”—
Khan Academy - Linear Algebra Tailored for beginners, this resource provides intuitive explanations for fundamental linear algebra topics. πŸ”—
Khan Academy - Calculus Delivers an interactive course covering the essentials of calculus comprehensively. πŸ”—
Khan Academy - Probability and Statistics Offers easy-to-follow material for learning probability and statistics concepts. πŸ”—

2. Python for Machine Learning

Concept Description
Python Basics Mastery of Python programming entails understanding its basic syntax, data types, error handling, and object-oriented programming principles.
Data Science Libraries Familiarity with essential libraries such as NumPy for numerical operations, Pandas for data manipulation, and Matplotlib and Seaborn for data visualization is crucial for effective data analysis.
Data Preprocessing This phase involves crucial tasks such as feature scaling, handling missing data, outlier detection, categorical data encoding, and data partitioning into training, validation, and test sets to ensure data quality and model performance.
Machine Learning Libraries Proficiency with Scikit-learn, a comprehensive library for machine learning, is indispensable. Understanding and implementing algorithms like linear regression, logistic regression, decision trees, random forests, k-nearest neighbors (K-NN), and K-means clustering are essential for building predictive models. Additionally, familiarity with dimensionality reduction techniques like PCA and t-SNE aids in visualizing complex data structures effectively.

Further Exploration

Reference Description Link
Real Python A comprehensive resource offering articles and tutorials for both beginner and advanced Python concepts. πŸ”—
freeCodeCamp - Learn Python A lengthy video providing a thorough introduction to all core Python concepts. πŸ”—
Python Data Science Handbook A free digital book that is an excellent resource for learning pandas, NumPy, Matplotlib, and Seaborn. πŸ”—
freeCodeCamp - Machine Learning for Everybody A practical introduction to various machine learning algorithms for beginners. πŸ”—
Udacity - Intro to Machine Learning An introductory course on machine learning for beginners, covering fundamental algorithms. πŸ”—

3. Neural Networks

Concept Description
Fundamentals Understand the basic structure of a neural network, including layers, weights, biases, and activation functions like sigmoid, tanh, and ReLU.
Training and Optimization Learn about backpropagation and various loss functions such as Mean Squared Error (MSE) and Cross-Entropy. Become familiar with optimization algorithms like Gradient Descent, Stochastic Gradient Descent, RMSprop, and Adam.
Overfitting Grasp the concept of overfitting, where a model performs well on training data but poorly on unseen data, and explore regularization techniques like dropout, L1/L2 regularization, early stopping, and data augmentation to mitigate it.
Implement a Multilayer Perceptron (MLP) Build a Multilayer Perceptron (MLP), also known as a fully connected network, using PyTorch.

Further Exploration

Reference Description Link
3Blue1Brown - But what is a Neural Network? This video provides an intuitive explanation of neural networks and their inner workings. πŸ”—
freeCodeCamp - Deep Learning Crash Course This video efficiently introduces the most important concepts in deep learning. πŸ”—
Fast.ai - Practical Deep Learning A free course designed for those with coding experience who want to learn about deep learning. πŸ”—
Patrick Loeber - PyTorch Tutorials A series of videos for complete beginners to learn about PyTorch. πŸ”—

4. Natural Language Processing (NLP)

Concept Description
Text Preprocessing Learn various text preprocessing steps such as tokenization (splitting text into words or sentences), stemming (reducing words to their root form), lemmatization (similar to stemming but considers the context), and stop word removal.
Feature Extraction Techniques Become familiar with techniques to convert text data into a format understandable by machine learning algorithms. Key methods include Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and n-grams.
Word Embeddings Understand word embeddings, a type of word representation that allows words with similar meanings to have similar representations. Key methods include Word2Vec, GloVe, and FastText.
Recurrent Neural Networks (RNNs) Learn about RNNs, a type of neural network designed to work with sequence data, and explore LSTMs and GRUs, two RNN variants capable of learning long-term dependencies.

Further Exploration

Reference Description Link
RealPython - NLP with spaCy in Python An exhaustive guide on using the spaCy library for NLP tasks in Python. πŸ”—
Kaggle - NLP Guide A collection of notebooks and resources offering a hands-on explanation of NLP in Python. πŸ”—
Jay Alammar - The Illustrated Word2Vec A detailed reference for understanding the Word2Vec architecture. πŸ”—
Jake Tae - PyTorch RNN from Scratch A practical and straightforward implementation of RNN, LSTM, and GRU models in PyTorch. πŸ”—
colah's blog - Understanding LSTM Networks A theoretical article explaining LSTM networks. πŸ”—

Unlock the Art of LLM Science

In this segment of the curriculum, participants delve into mastering the creation of top-notch LLMs through cutting-edge methodologies.

⬇️ Ready to Embrace LLM Science? ⬇️
graph LR
    Scientist["Art of LLM Science πŸ‘©β€πŸ”¬"] --> Architecture["The LLM architecture πŸ—οΈ"]
    Scientist["Art of LLM Science πŸ‘©β€πŸ”¬"] --> Instruction["Building an instruction dataset πŸ“š"]
    Scientist["Art of LLM Science πŸ‘©β€πŸ”¬"] --> Pretraining["Pretraining models πŸ› οΈ"]
    Scientist["Art of LLM Science πŸ‘©β€πŸ”¬"] --> FineTuning["Supervised Fine-Tuning 🎯"]
    Scientist["Art of LLM Science πŸ‘©β€πŸ”¬"] --> RLHF["RLHF πŸ”"]
    Scientist["Art of LLM Science πŸ‘©β€πŸ”¬"] --> Evaluation["Evaluation πŸ“Š"]
    Scientist["Art of LLM Science πŸ‘©β€πŸ”¬"] --> Quantization["Quantization βš–οΈ"]
    Scientist["Art of LLM Science πŸ‘©β€πŸ”¬"] --> Trends["New Trends πŸ“ˆ"]
    Architecture["The LLM architecture πŸ—οΈ"] --> HLV["High Level View πŸ”"]
    Architecture["The LLM architecture πŸ—οΈ"] --> Tokenization["Tokenization πŸ” "]
    Architecture["The LLM architecture πŸ—οΈ"] --> Attention["Attention Mechanisms 🧠"]
    Architecture["The LLM architecture πŸ—οΈ"] --> Generation["Text Generation ✍️"]
    Instruction["Building an instruction dataset πŸ“š"] --> Alpaca["Alpaca-like dataset πŸ¦™"]
    Instruction["Building an instruction dataset πŸ“š"] --> Advanced["Advanced Techniques πŸ“ˆ"]
    Instruction["Building an instruction dataset πŸ“š"] --> Filtering["Filtering Data πŸ”"]
    Instruction["Building an instruction dataset πŸ“š"] --> Prompt["Prompt Templates πŸ“"]
    Pretraining["Pretraining models πŸ› οΈ"] --> Pipeline["Data Pipeline πŸš€"]
    Pretraining["Pretraining models πŸ› οΈ"] --> CLM["Casual Language Modeling πŸ“"]
    Pretraining["Pretraining models πŸ› οΈ"] --> Scaling["Scaling Laws πŸ“"]
    Pretraining["Pretraining models πŸ› οΈ"] --> HPC["High-Performance Computing πŸ’»"]
    FineTuning["Supervised Fine-Tuning 🎯"] --> Full["Full fine-tuning πŸ› οΈ"]
    FineTuning["Supervised Fine-Tuning 🎯"] --> Lora["Lora and QLoRA πŸŒ€"]
    FineTuning["Supervised Fine-Tuning 🎯"] --> Axoloti["Axoloti 🦠"]
    FineTuning["Supervised Fine-Tuning 🎯"] --> DeepSpeed["DeepSpeed ⚑"]
    RLHF["RLHF πŸ”"] --> Preference["Preference Datasets πŸ“"]
    RLHF["RLHF πŸ”"] --> Optimization["Proximal Policy Optimization 🎯"]
    RLHF["RLHF πŸ”"] --> DPO["Direct Preference Optimization πŸ“ˆ"]
    Evaluation["Evaluation πŸ“Š"] --> Traditional["Traditional Metrics πŸ“"]
    Evaluation["Evaluation πŸ“Š"] --> General["General Benchmarks πŸ“ˆ"]
    Evaluation["Evaluation πŸ“Š"] --> Task["Task-specific Benchmarks πŸ“‹"]
    Evaluation["Evaluation πŸ“Š"] --> HF["Human Evaluation πŸ‘©β€πŸ”¬"]
    Quantization["Quantization βš–οΈ"] --> Base["Base Techniques πŸ› οΈ"]
    Quantization["Quantization βš–οΈ"] --> GGUF["GGUF and llama.cpp 🐐"]
    Quantization["Quantization βš–οΈ"] --> GPTQ["GPTQ and EXL2 πŸ€–"]
    Quantization["Quantization βš–οΈ"] --> AWQ["AWQ πŸš€"]
    Trends["New Trends πŸ“ˆ"] --> Positional["Positional Embeddings 🎯"]
    Trends["New Trends πŸ“ˆ"] --> Merging["Model Merging πŸ”„"]
    Trends["New Trends πŸ“ˆ"] --> MOE["Mixture of Experts 🎭"]
    Trends["New Trends πŸ“ˆ"] --> Multimodal["Multimodal Models πŸ“·"]
Loading

1. The LLM architecture πŸ—οΈ

An overview of the Transformer architecture, with emphasis on inputs (tokens) and outputs (logits), and the importance of understanding the vanilla attention mechanism and its improved versions.

Concept Description
Transformer Architecture (High-Level) Review encoder-decoder Transformers, specifically the decoder-only GPT architecture used in modern LLMs.
Tokenization Understand how raw text is converted into tokens (words or subwords) for the model to process.
Attention Mechanisms Grasp the theory behind attention, including self-attention and scaled dot-product attention, which allows the model to focus on relevant parts of the input during output generation.
Text Generation Learn different methods the model uses to generate output sequences. Common strategies include greedy decoding, beam search, top-k sampling, and nucleus sampling.

Further Exploration

Reference Description Link
The Illustrated Transformer by Jay Alammar A visual and intuitive explanation of the Transformer model πŸ”—
The Illustrated GPT-2 by Jay Alammar Focuses on the GPT architecture, similar to Llama's. πŸ”—
Visual intro to Transformers by 3Blue1Brown Simple visual intro to Transformers πŸ”—
LLM Visualization by Brendan Bycroft 3D visualization of LLM internals πŸ”—
nanoGPT by Andrej Karpathy Reimplementation of GPT from scratch (for programmers) πŸ”—
Decoding Strategies in LLMs Provides code and visuals for decoding strategies πŸ”—

2. Building an instruction dataset πŸ“š

While it's easy to find raw data from Wikipedia and other websites, it's difficult to collect pairs of instructions and answers in the wild. Like in traditional machine learning, the quality of the dataset will directly influence the quality of the model, which is why it might be the most important component in the fine-tuning process.

Concept Description
Alpaca-like dataset This dataset generation method utilizes the OpenAI API (GPT) to synthesize data from scratch, allowing for the specification of seeds and system prompts to foster diversity within the dataset.
Advanced techniques Delve into methods for enhancing existing datasets with Evol-Instruct, and explore approaches for generating top-tier synthetic data akin to those outlined in the Orca and phi-1 research papers.
Filtering data Employ traditional techniques such as regex, near-duplicate removal, and prioritizing answers with substantial token counts to refine datasets.
Prompt templates Recognize the absence of a definitive standard for structuring instructions and responses, underscoring the importance of familiarity with various chat templates like ChatML and Alpaca.

Further Exploration

Reference Description Link
Preparing a Dataset for Instruction tuning by Thomas Capelle Explores the Alpaca and Alpaca-GPT4 datasets and discusses formatting methods. πŸ”—
Generating a Clinical Instruction Dataset by Solano Todeschini Provides a tutorial on creating a synthetic instruction dataset using GPT-4. πŸ”—
GPT 3.5 for news classification by Kshitiz Sahay Demonstrates using GPT 3.5 to create an instruction dataset for fine-tuning Llama 2 in news classification. πŸ”—
Dataset creation for fine-tuning LLM Notebook containing techniques to filter a dataset and upload the result. πŸ”—
Chat Template by Matthew Carrigan Hugging Face's page about prompt templates πŸ”—

3. Pretraining models πŸ› οΈ

Pre-training, being both lengthy and expensive, is not the primary focus of this course. While it's beneficial to grasp the fundamentals of pre-training, practical experience in this area is not mandatory.

Concept Description
Data pipeline Pre-training involves handling vast datasets, such as the 2 trillion tokens used in Llama 2, which necessitates tasks like filtering, tokenization, and vocabulary preparation.
Causal language modeling Understand the distinction between causal and masked language modeling, including insights into the corresponding loss functions. Explore efficient pre-training techniques through resources like Megatron-LM or gpt-neox.
Scaling laws Delve into the scaling laws, which elucidate the anticipated model performance based on factors like model size, dataset size, and computational resources utilized during training.
High-Performance Computing While beyond the scope of this discussion, a deeper understanding of HPC becomes essential for those considering building their own LLMs from scratch, encompassing aspects like hardware selection and distributed workload management.

Further Exploration

Reference Description Link
LLMDataHub by Junhao Zhao Offers a carefully curated collection of datasets tailored for pre-training, fine-tuning, and RLHF. πŸ”—
Training a causal language model from scratch by Hugging Face Guides users through the process of pre-training a GPT-2 model from the ground up using the transformers library. πŸ”—
TinyLlama by Zhang et al. Provides insights into the training process of a Llama model from scratch, offering a comprehensive understanding. πŸ”—
Causal language modeling by Hugging Face Explores the distinctions between causal and masked language modeling, alongside a tutorial on efficiently fine-tuning a DistilGPT-2 model. πŸ”—
Chinchilla's wild implications by nostalgebraist Delves into the scaling laws and their implications for LLMs, offering valuable insights into their broader significance. πŸ”—
BLOOM by BigScience Provides a comprehensive overview of the BLOOM model's construction, offering valuable insights into its engineering aspects and encountered challenges. πŸ”—
OPT-175 Logbook by Meta Offers research logs detailing the successes and failures encountered during the pre-training of a large language model with 175B parameters. πŸ”—
LLM 360 Presents a comprehensive framework for open-source LLMs, encompassing training and data preparation code, datasets, evaluation metrics, and models. πŸ”—

4. Supervised Fine-Tuning 🎯

Pre-trained models are trained to predict the next word, so they're not great as assistants. But with SFT, you can adjust them to follow instructions. Plus, you can fine-tune them on different data, even private stuff GPT-4 hasn't seen, and use them without needing paid APIs like OpenAI's.

Concept Description
Full fine-tuning Full fine-tuning involves training all parameters in the model, though it's not the most efficient approach, it can yield slightly improved results.
LoRA LoRA, a parameter-efficient technique (PEFT) based on low-rank adapters, focuses on training only these adapters rather than all model parameters.
QLoRA QLoRA, another PEFT stemming from LoRA, also quantizes model weights to 4 bits and introduces paged optimizers to manage memory spikes efficiently.
Axolotl Axolotl stands as a user-friendly and potent fine-tuning tool, extensively utilized in numerous state-of-the-art open-source models.
DeepSpeed DeepSpeed facilitates efficient pre-training and fine-tuning of large language models across multi-GPU and multi-node settings, often integrated within Axolotl for enhanced performance.

Further Exploration

Reference Description Link
The Novice's LLM Training Guide by Alpin Provides an overview of essential concepts and parameters for fine-tuning LLMs. πŸ”—
LoRA insights by Sebastian Raschka Offers practical insights into LoRA and guidance on selecting optimal parameters. πŸ”—
Fine-Tune Your Own Llama 2 Model Presents a hands-on tutorial on fine-tuning a Llama 2 model using Hugging Face libraries. πŸ”—
Padding Large Language Models by Benjamin Marie Outlines best practices for padding training examples in causal LLMs. πŸ”—

RLHF πŸ”

Following supervised fine-tuning, RLHF serves as a crucial step in harmonizing the LLM's responses with human expectations. This entails acquiring preferences from human or artificial feedback, thereby mitigating biases, implementing model censorship, or fostering more utilitarian behavior. RLHF is notably more intricate than SFT and is frequently regarded as discretionary.

Concept Description
Preference datasets Typically containing several answers with some form of ranking, these datasets are more challenging to produce than instruction datasets.
Proximal Policy Optimization This algorithm utilizes a reward model to predict whether a given text is highly ranked by humans. It then optimizes the SFT model using a penalty based on KL divergence.
Direct Preference Optimization DPO simplifies the process by framing it as a classification problem. It employs a reference model instead of a reward model (requiring no training) and only necessitates one hyperparameter, rendering it more stable and efficient.

Further Exploration

Reference Description Link
An Introduction to Training LLMs using RLHF by Ayush Thakur Explain why RLHF is desirable to reduce bias and increase performance in LLMs. πŸ”—
Illustration RLHF by Hugging Face Introduction to RLHF with reward model training and fine-tuning with reinforcement learning. πŸ”—
StackLLaMA by Hugging Face Tutorial to efficiently align a LLaMA model with RLHF using the transformers library πŸ”—
LLM Training RLHF and Its Alternatives by Sebastian Rashcka Overview of the RLHF process and alternatives like RLAIF. πŸ”—
Fine-tune Llama2 with DPO Tutorial to fine-tune a Llama2 model with DPO πŸ”—

6. Evaluation πŸ“Š

Assessing LLMs is an often overlooked aspect of the pipeline, characterized by its time-consuming nature and moderate reliability. Your evaluation criteria should be tailored to your downstream task, while bearing in mind Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."

Concept Description
Traditional metrics Metrics like perplexity and BLEU score, while less favored now due to their contextual limitations, remain crucial for comprehension and determining their applicable contexts.
General benchmarks The primary benchmark for general-purpose LLMs, such as ChatGPT, is the Open LLM Leaderboard, which is founded on the Language Model Evaluation Harness. Other notable benchmarks include BigBench and MT-Bench.
Task-specific benchmarks Tasks like summarization, translation, and question answering boast dedicated benchmarks, metrics, and even subdomains (e.g., medical, financial), exemplified by PubMedQA for biomedical question answering.
Human evaluation The most dependable evaluation method entails user acceptance rates or human-comparison metrics. Additionally, logging user feedback alongside chat traces, facilitated by tools like LangSmith, aids in pinpointing potential areas for enhancement.

Further Evaluation

Reference Description Link
Perplexity of fixed-length models by Hugging Face Provides an overview of perplexity along with code to implement it using the transformers library. πŸ”—
BLEU at your own risk by Rachael Tatman Offers insights into the BLEU score, highlighting its various issues through examples. πŸ”—
A Survey on Evaluation of LLMs by Chang et al. Presents a comprehensive paper covering what to evaluate, where to evaluate, and how to evaluate language models. πŸ”—
Chatbot Arena Leaderboard by lmsys Showcases an Elo rating system for general-purpose language models, based on comparisons made by humans. πŸ”—

7. Quantization βš–οΈ

Quantization involves converting the weights (and activations) of a model to lower precision. For instance, weights initially stored using 16 bits may be transformed into a 4-bit representation. This technique has gained significance in mitigating the computational and memory expenses linked with LLMs

Concept Description
Base techniques Explore various levels of precision (FP32, FP16, INT8, etc.) and learn how to conduct naΓ―ve quantization using techniques like absmax and zero-point.
GGUF and llama.cpp Originally intended for CPU execution, llama.cpp and the GGUF format have emerged as popular tools for running LLMs on consumer-grade hardware.
GPTQ and EXL2 GPTQ and its variant, the EXL2 format, offer remarkable speed but are limited to GPU execution. However, quantizing models using these formats can be time-consuming.
AWQ This newer format boasts higher accuracy compared to GPTQ, as indicated by lower perplexity, but demands significantly more VRAM and may not necessarily exhibit faster performance.

Further Exploration

Reference Description Link
Introduction to quantization Offers an overview of quantization, including absmax and zero-point quantization, and demonstrates LLM.int8() with accompanying code. πŸ”—
Quantize Llama models with llama.cpp Provides a tutorial on quantizing a Llama 2 model using llama.cpp and the GGUF format. πŸ”—
4-bit LLM Quantization with GPTQ Offers a tutorial on quantizing an LLM using the GPTQ algorithm with AutoGPTQ. πŸ”—
ExLlamaV2 Presents a guide on quantizing a Mistral model using the EXL2 format and running it with the ExLlamaV2 library, touted as the fastest library for LLMs. πŸ”—
Understanding Activation-Aware Weight Quantization by FriendliAI Provides an overview of the AWQ technique and its associated benefits. πŸ”—

8. New Trends πŸ“ˆ

Concept Description
Positional embeddings Explore how LLMs encode positions, focusing on relative positional encoding schemes like RoPE. Implement extensions to context length using techniques such as YaRN (which multiplies the attention matrix by a temperature factor) or ALiBi (applying attention penalty based on token distance).
Model merging Model merging has gained popularity as a method for creating high-performance models without additional fine-tuning. The widely-used mergekit library incorporates various merging methods including SLERP, DARE, and TIES.
Mixture of Experts The resurgence of the MoE architecture, exemplified by Mixtral, has led to the emergence of alternative approaches like frankenMoE, seen in community-developed models such as Phixtral, offering cost-effective and high-performance alternatives.
Multimodal models These models, such as CLIP, Stable Diffusion, or LLaVA, process diverse inputs (text, images, audio, etc.) within a unified embedding space, enabling versatile applications like text-to-image generation.
glaive-function-calling-v2 High-quality dataset with pairs of instructions and answers in different languages.
See Locutusque/function-calling-chatml for a variant without conversation tags.
Agent-FLAN Mix of AgentInstruct, ToolBench, and ShareGPT datasets.

Further Exploration

Reference Description Link
Extending the RoPE by EleutherAI Article summarizing various position-encoding techniques. πŸ”—
Understanding YaRN by Rajat Chawla Introduction to YaRN. πŸ”—
Merge LLMs with mergekit Tutorial on model merging using mergekit. πŸ”—
Mixture of Experts Explained by Hugging Face Comprehensive guide on MoEs and their functioning. πŸ”—
Large Multimodal Models by Chip Huyen: Overview of multimodal systems and recent developments in the field. πŸ”—

Building Production-Ready LLM Applications

Learn to create and deploy robust LLM-powered applications, focusing on model augmentation and practical deployment strategies for production environments.

⬇️ Ready to Build Production-Ready LLM Applications?⬇️
graph LR
    Scientist["Production-Ready LLM Applications πŸ‘©β€πŸ”¬"] --> Architecture["Running LLMs πŸ—οΈ"]
    Scientist --> Storage["Building a Vector Storage πŸ“¦"]
    Scientist --> Retrieval["Retrieval Augmented Generation πŸ”"]
    Scientist --> AdvancedRAG["Advanced RAG βš™οΈ"]
    Scientist --> Optimization["Inference Optimization ⚑"]
    Scientist --> Deployment["Deploying LLMs πŸš€"]
    Scientist --> Secure["Securing LLMs πŸ”’"]

    Architecture --> APIs["LLM APIs 🌐"]
    Architecture --> OpenSource["Open Source LLMs 🌍"]
    Architecture --> PromptEng["Prompt Engineering πŸ’¬"]
    Architecture --> StructOutputs["Structure Outputs πŸ—‚οΈ"]

    Storage --> Ingest["Ingesting Documents πŸ“₯"]
    Storage --> Split["Splitting Documents βœ‚οΈ"]
    Storage --> Embed["Embedding Models 🧩"]
    Storage --> VectorDB["Vector Databases πŸ“Š"]

    Retrieval --> Orchestrators["Orchestrators 🎼"]
    Retrieval --> Retrievers["Retrievers πŸ€–"]
    Retrieval --> Memory["Memory 🧠"]
    Retrieval --> Evaluation["Evaluation πŸ“ˆ"]

    AdvancedRAG --> Query["Query Construction πŸ”§"]
    AdvancedRAG --> Agents["Agents and Tools πŸ› οΈ"]
    AdvancedRAG --> PostProcess["Post Processing πŸ”„"]
    AdvancedRAG --> Program["Program LLMs πŸ’»"]

    Optimization --> FlashAttention["Flash Attention ⚑"]
    Optimization --> KeyValue["Key-value Cache πŸ”‘"]
    Optimization --> SpecDecoding["Speculative Decoding πŸš€"]

    Deployment --> LocalDeploy["Local Deployment πŸ–₯️"]
    Deployment --> DemoDeploy["Demo Deployment 🎀"]
    Deployment --> ServerDeploy["Server Deployment πŸ–§"]
    Deployment --> EdgeDeploy["Edge Deployment 🌐"]

    Secure --> PromptEngSecure["Prompt Engineering πŸ”"]
    Secure --> Backdoors["Backdoors πŸšͺ"]
    Secure --> Defensive["Defensive measures πŸ›‘οΈ"]
Loading

1. Running LLMs

Running LLMs can be demanding due to significant hardware requirements. Based on your use case, you might opt to use a model through an API (like GPT-4) or run it locally. In either scenario, employing additional prompting and guidance techniques can improve and constrain the output for your applications.

Category Details
LLM APIs APIs offer a convenient way to deploy LLMs. This space is divided between private LLMs (OpenAI, Google, Anthropic, Cohere, etc.) and open-source LLMs (OpenRouter, Hugging Face, Together AI, etc.).
Open-source LLMs The Hugging Face Hub is an excellent resource for finding LLMs. Some can be run directly in Hugging Face Spaces, or downloaded and run locally using apps like LM Studio or through the command line interface with llama.cpp or Ollama.
Prompt Engineering Techniques such as zero-shot prompting, few-shot prompting, chain of thought, and ReAct are commonly used in prompt engineering. These methods are more effective with larger models but can also be adapted for smaller ones.
Structuring Outputs Many tasks require outputs to be in a specific format, such as a strict template or JSON. Libraries like LMQL, Outlines, and Guidance can help guide the generation process to meet these structural requirements.

Further Exploration

Reference Description Link
Run an LLM locally with LM Studio by Nisha Arya A brief guide on how to use LM Studio for running a local LLM. πŸ”—
Prompt engineering guide by DAIR.AI An extensive list of prompt techniques with examples. πŸ”—
Outlines - Quickstart A quickstart guide detailing the guided generation techniques enabled by the Outlines library. πŸ”—
LMQL - Overview An introduction to the LMQL language, explaining its features and usage. πŸ”—

2. Building a Vector Storage

Creating a vector storage is the first step in building a Retrieval Augmented Generation (RAG) pipeline. This involves loading and splitting documents, and then using the relevant chunks to produce vector representations (embeddings) that are stored for future use during inference.

Category Details
Ingesting Documents Document loaders are convenient wrappers that handle various formats such as PDF, JSON, HTML, Markdown, etc. They can also retrieve data directly from some databases and APIs (e.g., GitHub, Reddit, Google Drive).
Splitting Documents Text splitters break down documents into smaller, semantically meaningful chunks. Instead of splitting text after a certain number of characters, it's often better to split by header or recursively, with some additional metadata.
Embedding Models Embedding models convert text into vector representations, providing a deeper and more nuanced understanding of language, which is essential for performing semantic search.
Vector Databases Vector databases (like Chroma, Pinecone, Milvus, FAISS, Annoy, etc.) store embedding vectors and enable efficient retrieval of data based on vector similarity.

Further Exploration

Reference Description Link
LangChain - Text splitters A list of different text splitters implemented in LangChain. πŸ”—
Sentence Transformers library A popular library for embedding models. πŸ”—
MTEB Leaderboard Leaderboard for evaluating embedding models. πŸ”—
The Top 5 Vector Databases by Moez Ali A comparison of the best and most popular vector databases. πŸ”—

3. Retrieval Augmented Generation

Using RAG, LLMs access relevant documents from a database to enhance the precision of their responses. This method is widely used to expand the model's knowledge base without the need for fine-tuning.

Category Details
Orchestrators Orchestrators (like LangChain, LlamaIndex, FastRAG, etc.) are popular frameworks to connect your LLMs with tools, databases, memories, etc. and augment their abilities.
Retrievers User instructions are not optimized for retrieval. Different techniques (e.g., multi-query retriever, HyDE, etc.) can be applied to rephrase/expand them and improve performance.
Memory To remember previous instructions and answers, LLMs and chatbots like ChatGPT add this history to their context window. This buffer can be improved with summarization (e.g., using a smaller LLM), a vector store + RAG, etc.
Evaluation We need to evaluate both the document retrieval (context precision and recall) and generation stages (faithfulness and answer relevancy). It can be simplified with tools Ragas and DeepEval.

Further Exploration

Reference Description Link
Llamaindex - High-level concepts Main concepts to know when building RAG pipelines. πŸ”—
Pinecone - Retrieval Augmentation Overview of the retrieval augmentation process. πŸ”—
LangChain - Q&A with RAG Step-by-step tutorial to build a typical RAG pipeline. πŸ”—
LangChain - Memory types List of different types of memories with relevant usage. πŸ”—
RAG pipeline - Metrics Overview of the main metrics used to evaluate RAG pipelines. πŸ”—

4. Advanced RAG

Real-world applications often demand intricate pipelines that utilize SQL or graph databases and dynamically choose the appropriate tools and APIs. These sophisticated methods can improve a basic solution and offer extra capabilities.

Category Details
Query construction Structured data stored in traditional databases requires a specific query language like SQL, Cypher, metadata, etc. We can directly translate the user instruction into a query to access the data with query construction.
Agents and tools Agents augment LLMs by automatically selecting the most relevant tools to provide an answer. These tools can be as simple as using Google or Wikipedia, or more complex like a Python interpreter or Jira.
Post-processing The final step processes the inputs that are fed to the LLM. It enhances the relevance and diversity of documents retrieved with re-ranking, RAG-fusion, and classification.
Program LLMs Frameworks like DSPy allow you to optimize prompts and weights based on automated evaluations in a programmatic way.

Further Exploration

Reference Description Link
LangChain - Query Construction Blog post about different types of query construction. πŸ”—
LangChain - SQL Tutorial on how to interact with SQL databases with LLMs, involving Text-to-SQL and an optional SQL agent. πŸ”—
Pinecone - LLM agents Introduction to agents and tools with different types. πŸ”—
LLM Powered Autonomous Agents by Lilian Weng More theoretical article about LLM agents. πŸ”—
LangChain - OpenAI's RAG Overview of the RAG strategies employed by OpenAI, including post-processing. πŸ”—
DSPy in 8 Steps General-purpose guide to DSPy introducing modules, signatures, and optimizers. πŸ”—

5. Inference Optimization

Text generation is an expensive process that requires powerful hardware. Besides quantization, various techniques have been proposed to increase throughput and lower inference costs.

Category Details
Flash Attention Optimization of the attention mechanism to transform its complexity from quadratic to linear, speeding up both training and inference.
Key-value cache Understanding the key-value cache and the improvements introduced in Multi-Query Attention (MQA) and Grouped-Query Attention (GQA).
Speculative decoding Using a small model to produce drafts that are then reviewed by a larger model to speed up text generation.

Further Exploration

Reference Description Link
GPU Inference by Hugging Face Explain how to optimize inference on GPUs. πŸ”—
LLM Inference by Databricks Best practices for how to optimize LLM inference in production. πŸ”—
Optimizing LLMs for Speed and Memory by Hugging Face Explain three main techniques to optimize speed and memory, namely quantization, Flash Attention, and architectural innovations. πŸ”—
Assisted Generation by Hugging Face HF's version of speculative decoding, it's an interesting blog post about how it works with code to implement it. πŸ”—

6. Deploying LLMs

Deploying LLMs at scale is a complex engineering task that may require multiple GPU clusters. However, demos and local applications can often be achieved with significantly less complexity.

Category Details
Local deployment Privacy is an important advantage that open-source LLMs have over private ones. Local LLM servers (LM Studio, Ollama, oobabooga, kobold.cpp, etc.) capitalize on this advantage to power local apps.
Demo deployment Frameworks like Gradio and Streamlit are helpful to prototype applications and share demos. You can also easily host them online, for example using Hugging Face Spaces.
Server deployment Deploying LLMs at scale requires cloud infrastructure (see also SkyPilot) or on-prem infrastructure and often leverages optimized text generation frameworks like TGI, vLLM, etc.
Edge deployment In constrained environments, high-performance frameworks like MLC LLM and mnn-llm can deploy LLMs in web browsers, Android, and iOS.

Further Exploration

Reference Description Link
Streamlit - Build a basic LLM app Tutorial to make a basic ChatGPT-like app using Streamlit. πŸ”—
HF LLM Inference Container Deploy LLMs on Amazon SageMaker using Hugging Face's inference container. πŸ”—
Philschmid blog by Philipp Schmid Collection of high-quality articles about LLM deployment using Amazon SageMaker. πŸ”—
Optimizing latency by Hamel Husain Comparison of TGI, vLLM, CTranslate2, and mlc in terms of throughput and latency. πŸ”—

7. Securing LLMs

Along with the usual security concerns of software, LLMs face distinct vulnerabilities arising from their training and prompting methods.

Category Details
Prompt hacking Techniques related to prompt engineering, including prompt injection (adding instructions to alter the model’s responses), data/prompt leaking (accessing original data or prompts), and jailbreaking (crafting prompts to bypass safety features).
Backdoors Attack vectors targeting the training data itself, such as poisoning the training data with false information or creating backdoors (hidden triggers to alter the model’s behavior during inference).
Defensive measures Protecting LLM applications involves testing them for vulnerabilities (e.g., using red teaming and tools like garak) and monitoring them in production (using a framework like langfuse).

Further Exploration

Reference Description Link
OWASP LLM Top 10 by HEGO Wiki List of the 10 most critical vulnerabilities found in LLM applications. πŸ”—
Prompt Injection Primer by Joseph Thacker Short guide dedicated to prompt injection techniques for engineers. πŸ”—
LLM Security by @llm_sec Extensive list of resources related to LLM security. πŸ”—
Red teaming LLMs by Microsoft Guide on how to perform red teaming assessments with LLMs. πŸ”—

In-Depth Articles

NLP

Article Resources
LLMs Overview πŸ”—
NLP Embeddings πŸ”—
Preprocessing πŸ”—
Sampling πŸ”—
Tokenization πŸ”—
Transformer πŸ”—
Interview Preparation πŸ”—

Models

Article Resources
Generative Pre-trained Transformer (GPT) πŸ”—

Training

Article Resources
Activation Function πŸ”—
Fine Tuning Models πŸ”—
Enhancing Model Compression: Inference and Training Optimization Strategies πŸ”—
Model Summary πŸ”—
Splitting Datasets πŸ”—
Train Loss > Val Loss πŸ”—
Parameter Efficient Fine-Tuning πŸ”—
Gradient Descent and Backprop πŸ”—
Overfitting And Underfitting πŸ”—
Gradient Accumulation and Checkpointing πŸ”—
Flash Attention πŸ”—

Enhancing Model Compression: Inference and Training Optimization Strategies

Article Resources
Quantization πŸ”—
Intro to Quantization πŸ”—
Knowledge Distillation πŸ”—
Pruning πŸ”—
DeepSpeed πŸ”—
Sharding πŸ”—
Mixed Precision Training πŸ”—
Inference Optimization πŸ”—

Evaluation Metrics

Article Resources
Classification πŸ”—
Regression πŸ”—
Generative Text Models πŸ”—

Open LLMs

Article Resources
Open Source LLM Space for Commercial Use πŸ”—
Open Source LLM Space for Research Use πŸ”—
LLM Training Frameworks πŸ”—
Effective Deployment Strategies for Language Models πŸ”—
Tutorials about LLM πŸ”—
Courses about LLM πŸ”—
Deployment πŸ”—

Resources for cost analysis and network visualization

Article Resources
Lambda Labs vs AWS Cost Analysis πŸ”—
Neural Network Visualization πŸ”—

Codebase Mastery: Building with Perfection

Title Repository
Instruction based data prepare using OpenAI πŸ”—
Optimal Fine-Tuning using the Trainer API: From Training to Model Inference πŸ”—
Efficient Fine-tuning and inference LLMs with PEFT and LoRA πŸ”—
Efficient Fine-tuning and inference LLMs Accelerate πŸ”—
Efficient Fine-tuning with T5 πŸ”—
Train Large Language Models with LoRA and Hugging Face πŸ”—
Fine-Tune Your Own Llama 2 Model in a Colab Notebook πŸ”—
Guanaco Chatbot Demo with LLaMA-7B Model πŸ”—
PEFT Finetune-Bloom-560m-tagger πŸ”—
Finetune_Meta_OPT-6-1b_Model_bnb_peft πŸ”—
Finetune Falcon-7b with BNB Self Supervised Training πŸ”—
FineTune LLaMa2 with QLoRa πŸ”—
Stable_Vicuna13B_8bit_in_Colab πŸ”—
GPT-Neo-X-20B-bnb2bit_training πŸ”—
MPT-Instruct-30B Model Training πŸ”—
RLHF_Training_for_CustomDataset_for_AnyModel πŸ”—
Fine_tuning_Microsoft_Phi_1_5b_on_custom_dataset(dialogstudio) πŸ”—
Finetuning OpenAI GPT3.5 Turbo πŸ”—
Finetuning Mistral-7b FineTuning Model using Autotrain-advanced πŸ”—
RAG LangChain Tutorial πŸ”—
Mistral DPO Trainer πŸ”—
LLM Sharding πŸ”—
Integrating Unstructured and Graph Knowledge with Neo4j and LangChain for Enhanced Question πŸ”—
vLLM Benchmarking πŸ”—
Milvus Vector Database πŸ”—
Decoding Strategies πŸ”—
Peft QLora SageMaker Training πŸ”—
Optimize Single Model SageMaker Endpoint πŸ”—
Multi Adapter Inference πŸ”—
Inf2 LLM SM Deployment πŸ”—
Text Chunk Visualization In Progress πŸ”—
Fine-tune Llama 3 with ORPO πŸ”—
4 bit LLM Quantization with GPTQ πŸ”—
Model Family Tree πŸ”—
Create MoEs with MergeKit πŸ”—

LLM PlayLab

LLM Projects Respository
CSVQConnect πŸ”—
AI_VIRTUAL_ASSISTANT πŸ”—
DocuBotMultiPDFConversationalAssistant πŸ”—
autogpt πŸ”—
meta_llama_2finetuned_text_generation_summarization πŸ”—
text_generation_using_Llama πŸ”—
llm_using_petals πŸ”—
llm_using_petals πŸ”—
Salesforce-xgen πŸ”—
text_summarization_using_open_llama_7b πŸ”—
Text_summarization_using_GPT-J πŸ”—
codllama πŸ”—
Image_to_text_using_LLaVA πŸ”—
Tabular_data_using_llamaindex πŸ”—
nextword_sentence_prediction πŸ”—
Text-Generation-using-DeciLM-7B-instruct πŸ”—
Gemini-blog-creation πŸ”—
Prepare_holiday_cards_with_Gemini_and_Sheets πŸ”—
Code-Generattion_using_phi2_llm πŸ”—
RAG-USING-GEMINI πŸ”—
Resturant-Recommendation-Multi-Modal-RAG-using-Gemini πŸ”—
slim-sentiment-tool πŸ”—
Synthetic-Data-Generation-Using-LLM πŸ”—
Architecture-for-building-a-Chat-Assistant πŸ”—
LLM-CHAT-ASSISTANT-WITH-DYNAMIC-CONTEXT-BASED-ON-QUERY πŸ”—
Text Classifier using LLM πŸ”—
Multiclass sentiment Analysis πŸ”—
Text-Generation-Using-GROQ πŸ”—
DataAgents πŸ”—
PandasQuery_tabular_data πŸ”—
Exploratory_Data_Analysis_using_LLM πŸ”—

LLM Datasets

Dataset # Authors Date Notes Category
Buzz 31.2M Alignment Lab AI May 2024 Huge collection of 435 datasets with data augmentation, deduplication, and other techniques. General Purpose
WebInstructSub 2.39M Yue et al. May 2024 Instructions created by retrieving document from Common Crawl, extracting QA pairs, and refining them. See the MAmmoTH2 paper (this is a subset). General Purpose
Bagel >2M? Jon Durbin Jan 2024 Collection of datasets decontaminated with cosine similarity. General Purpose
Hercules v4.5 1.72M Sebastian Gabarain Apr 2024 Large-scale general-purpose dataset with math, code, RP, etc. See v4 for the list of datasets. General Purpose
Dolphin-2.9 1.39M Cognitive Computations Apr 2023 Large-scale general-purpose dataset used by the Dolphin models. General Purpose
WildChat-1M 1.04M Zhao et al. May 2023 Real conversations between human users and GPT-3.5/4, including metadata. See the WildChat paper. General Purpose
OpenHermes-2.5 1M Teknium Nov 2023 Another large-scale dataset used by the OpenHermes models. General Purpose
SlimOrca 518k Lian et al. Sep 2023 Curated subset of OpenOrca using GPT-4-as-a-judge to remove wrong answers. General Purpose
Tulu V2 Mix 326k Ivison et al. Nov 2023 Mix of high-quality datasets. See Tulu 2 paper. General Purpose
UltraInteract SFT 289k Yuan et al. Apr 2024 Focus on math, coding, and logic tasks with step-by-step answers. See Eurus paper. General Purpose
NeurIPS-LLM-data 204k Jindal et al. Nov 2023 Winner of NeurIPS LLM Efficiency Challenge, with an interesting data preparation strategy. General Purpose
UltraChat 200k 200k Tunstall et al., Ding et al. Oct 2023 Heavily filtered version of the UItraChat dataset, consisting of 1.4M dialogues generated by ChatGPT. General Purpose
WizardLM_evol_instruct_V2 143k Xu et al. Jun 2023 Latest version of Evol-Instruct applied to Alpaca and ShareGPT data. See WizardLM paper. General Purpose
sft_datablend_v1 128k NVIDIA Jan 2024 Blend of publicly available datasets: OASST, CodeContests, FLAN, T0, Open_Platypus, and GSM8K and others (45 total). General Purpose
Synthia-v1.3 119k Migel Tissera Nov 2023 High-quality synthetic data generated using GPT-4. General Purpose
FuseChat-Mixture 95k Wan et al. Feb 2024 Selection of samples from high-quality datasets. See FuseChat paper. General Purpose
oasst1 84.4k KΓΆpf et al. Mar 2023 Human-generated assistant-style conversation corpus in 35 different languages. See OASST1 paper and oasst2. General Purpose
WizardLM_evol_instruct_70k 70k Xu et al. Apr 2023 Evol-Instruct applied to Alpaca and ShareGPT data. See WizardLM paper. General Purpose
airoboros-3.2 58.7k Jon Durbin Dec 2023 High-quality uncensored dataset. General Purpose
ShareGPT_Vicuna_unfiltered 53k anon823 1489123 Mar 2023 Filtered version of the ShareGPT dataset, consisting of real conversations between users and ChatGPT. General Purpose
lmsys-chat-1m-smortmodelsonly 45.8k Nebulous, Zheng et al. Sep 2023 Filtered version of lmsys-chat-1m with responses from GPT-4, GPT-3.5-turbo, Claude-2, Claude-1, and Claude-instant-1. General Purpose
Open-Platypus 24.9k Lee et al. Sep 2023 Collection of datasets that were deduplicated using Sentence Transformers (it contains an NC dataset). See Platypus paper. General Purpose
databricks-dolly-15k 15k Conover et al. May 2023 Generated by Databricks employees, prompt/response pairs in eight different instruction categories, including the seven outlined in the InstructGPT paper. General Purpose
OpenMathInstruct-1 5.75M Toshniwal et al. Feb 2024 Problems from GSM8K and MATH, solutions generated by Mixtral-8x7B. Math
MetaMathQA 395k Yu et al. Dec 2023 Bootstrap mathematical questions by rewriting them from multiple perspectives. See MetaMath paper. Math
MathInstruct 262k Yue et al. Sep 2023 Compiled from 13 math rationale datasets, six of which are newly curated, and focuses on chain-of-thought and program-of-thought. Math
Orca-Math 200k Mitra et al. Feb 2024 Grade school math world problems generated using GPT4-Turbo. See Orca-Math paper. Math
CodeFeedback-Filtered-Instruction 157k Zheng et al. Feb 2024 Filtered version of Magicoder-OSS-Instruct, ShareGPT (Python), Magicoder-Evol-Instruct, and Evol-Instruct-Code. Code
Tested-143k-Python-Alpaca 143k Vezora Mar 2024 Collection of generated Python code that passed automatic tests to ensure high quality. Code
glaive-code-assistant 136k Glaive.ai Sep 2023 Synthetic data of problems and solutions with ~60% Python samples. Also see the v2 version. Code
Magicoder-Evol-Instruct-110K 110k Wei et al. Nov 2023 A decontaminated version of evol-codealpaca-v1. Decontamination is done in the same way as StarCoder (bigcode decontamination process). See Magicoder paper. Code
dolphin-coder 109k Eric Hartford Nov 2023 Dataset transformed from leetcode-rosetta. Code
synthetic_tex_to_sql 100k Gretel.ai Apr 2024 Synthetic text-to-SQL samples (~23M tokens), covering diverse domains. Code
sql-create-context 78.6k b-mc2 Apr 2023 Cleansed and augmented version of the WikiSQL and Spider datasets. Code
Magicoder-OSS-Instruct-75K 75k Wei et al. Nov 2023 OSS-Instruct dataset generated by gpt-3.5-turbo-1106. See Magicoder paper. Code
Code-Feedback 66.4k Zheng et al. Feb 2024 Diverse Code Interpreter-like dataset with multi-turn dialogues and interleaved text and code responses. See OpenCodeInterpreter paper. Code
Open-Critic-GPT 55.1k Vezora Jul 2024 Use a local model to create, introduce, and identify bugs in code across multiple programming languages. Code
self-oss-instruct-sc2-exec-filter-50k 50.7k Lozhkov et al. Apr 2024 Created in three steps with seed functions from TheStack v1, self-instruction with StarCoder2, and self-validation. See the blog post. Code
Bluemoon 290k Squish42 Jun 2023 Posts from the Blue Moon roleplaying forum cleaned and scraped by a third party. Conversation & Role-Play
PIPPA 16.8k Gosling et al., kingbri Aug 2023 Deduped version of Pygmalion's PIPPA in ShareGPT format. Conversation & Role-Play
Capybara 16k LDJnr Dec 2023 Strong focus on information diversity across a wide range of domains with multi-turn conversations. Conversation & Role-Play
RPGPT_PublicDomain-alpaca 4.26k practical dreamer May 2023 Synthetic dataset of public domain character dialogue in roleplay format made with build-a-dataset. Conversation & Role-Play
Pure-Dove 3.86k LDJnr Sep 2023 Highly filtered multi-turn conversations between GPT-4 and real humans. Conversation & Role-Play
Opus Samantha 1.85k macadelicc Apr 2024 Multi-turn conversations with Claude 3 Opus. Conversation & Role-Play
LimaRP-augmented 804 lemonilia, grimulkan Jan 2024 Augmented and cleansed version of LimaRP, consisting of human roleplaying conversations. Conversation & Role-Play
glaive-function-calling-v2 113k Sahil Chaudhary Sep 2023 High-quality dataset with pairs of instructions and answers in different languages.
See Locutusque/function-calling-chatml for a variant without conversation tags.
Agent & Function calling
xlam-function-calling-60k 60k Salesforce Jun 2024 Samples created using a data generation pipeline designed to produce verifiable data for function-calling applications. Agent & Function calling
Agent-FLAN 34.4k internlm Mar 2024 Mix of AgentInstruct, ToolBench, and ShareGPT datasets. Agent & Function calling

LLM Alligmment

Alignment is an emerging field of study where you ensure that an AI system performs exactly what you want it to perform. In the context of LLMs specifically, alignment is a process that trains an LLM to ensure that the generated outputs align with human values and goals.

What are the current methods for LLM alignment?

You will find many alignment methods in research literature, we will only stick to 3 alignment methods for the sake of discussion

πŸ“Œ RLHF:

  • Step 1 & 2: Train an LLM (pre-training for the base model + supervised/instruction fine-tuning for chat model)
  • Step 3: RLHF uses an ancillary language model (it could be much smaller than the main LLM) to learn human preferences. This can be done using a preference dataset - it contains a prompt, and a response/set of responses graded by expert human labelers. This is called a β€œreward model”.
  • Step 4: Use a reinforcement learning algorithm (eg: PPO - proximal policy optimization), where the LLM is the agent, the reward model provides a positive or negative reward to the LLM based on how well it’s responses align with the β€œhuman preferred responses”. In theory, it is as simple as that. However, implementation isn’t that easy - requiring lot of human experts and compute resources. To overcome the β€œexpense” of RLHF, researchers developed DPO.
  • RLHF : RLHF: Reinforcement Learning from Human Feedback

πŸ“Œ DPO:

  • Step 1&2 remain the same
  • Step 4: DPO eliminates the need for the training of a reward model (i.e step 3). How? DPO defines an additional preference loss as a function of it’s policy and uses the language model directly as the reward model. The idea is simple, If you are already training such a powerful LLM, why not train itself to distinguish between good and bad responses, instead of using another model?
  • DPO is shown to be more computationally efficient (in case of RLHF you also need to constantly monitor the behavior of the reward model) and has better performance than RLHF in several settings.
  • Blog on DPO : Aligning LLMs with Direct Preference Optimization (DPO)β€” background, overview, intuition and paper summary

πŸ“Œ ORPO:

  • The newest method out of all 3, ORPO combines Step 2, 3 & 4 into a single step - so the dataset required for this method is a combination of a fine-tuning + preference dataset.
  • The supervised fine-tuning and alignment/preference optimization is performed in a single step. This is because the fine-tuning step, while allowing the model to specialize to tasks and domains, can also increase the probability of undesired responses from the model.
  • ORPO combines the steps using a single objective function by incorporating an odds ratio (OR) term - reward preferred responses & penalizing rejected responses.
  • Blog on ORPO : ORPO Outperforms SFT+DPO | Train Phi-2 with ORPO

Data Generation

Data Filtering

Datasets Descriptions Link
Rule-based filtering Remove samples based on a list of unwanted words, like refusals and "As an AI assistant" πŸ”—
SemHash Fuzzy deduplication based on fast embedding generation with a distilled model. πŸ”—

SFT Datasets

Datasets Descriptions Link
Distilabel General-purpose framework that can generate and augment data (SFT, DPO) with techniques like UltraFeedback and DEITA πŸ”—
Auto Data Lightweight library to automatically generate fine-tuning datasets with API models. πŸ”—
Bonito Library for generating synthetic instruction tuning datasets for your data without GPT (see also AutoBonito). πŸ”—
Augmentoolkit Framework to convert raw text into datasets using open-source and closed-source models. πŸ”—
Magpie Your efficient and high-quality synthetic data generation pipeline by prompting aligned LLMs with nothing. πŸ”—
Genstruct An instruction generation model, which is designed to generate valid instructions from raw data πŸ”—
DataDreamer A python library for prompting and synthetic data generation. πŸ”—

Pre-training datasets

Datasets Descriptions Link
llm-swarm Generate synthetic datasets for pretraining or fine-tuning using either local LLMs or Inference Endpoints on the Hugging Face Hub πŸ”—
Cosmopedia Hugging Face's code for generating the Cosmopedia dataset. πŸ”—
textbook_quality A repository for generating textbook-quality data, mimicking the approach of the Microsoft's Phi models. πŸ”—

Data exploration

Datasets Descriptions Link
sentence-transformers A python module for working with popular language embedding models. πŸ”—
Lilac Tool to curate better data for LLMs, used by NousResearch, databricks, cohere, Alignment Lab AI. It can also apply filters. πŸ”—
Nomic Atlas Interact with instructed data to find insights and store embeddings. πŸ”—
text-clustering) Easily embed, cluster and semantically label text datasets πŸ”—

Data scraping

Datasets Descriptions Link
Trafilatura Python and command-line tool to gather text and metadata on the web. Used for the creation of RefinedWeb(https://arxiv.org/abs/2306.01116). πŸ”—
marker Convert PDF to markdown + JSON quickly with high accuracy πŸ”—

Understand LLM

Resources Link
Brown, Tom B. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020). πŸ”—
Kambhampati, Subbarao, et al. "LLMs can't plan, but can help planning in LLM-modulo frameworks." arXiv preprint arXiv:2402.01817 (2024). πŸ”—

What I am learning

After immersing myself in the recent GenAI text-based language model hype for nearly a month, I have made several observations about its performance on my specific tasks.

Please note that these observations are subjective and specific to my own experiences, and your conclusions may differ.

  • We need a minimum of 7B parameter models (<7B) for optimal natural language understanding performance. Models with fewer parameters result in a significant decrease in performance. However, using models with more than 7 billion parameters requires a GPU with greater than 24GB VRAM (>24GB).
  • Benchmarks can be tricky as different LLMs perform better or worse depending on the task. It is crucial to find the model that works best for your specific use case. In my experience, MPT-7B is still the superior choice compared to Falcon-7B.
  • Prompts change with each model iteration. Therefore, multiple reworks are necessary to adapt to these changes. While there are potential solutions, their effectiveness is still being evaluated.
  • For fine-tuning, you need at least one GPU with greater than 24GB VRAM (>24GB). A GPU with 32GB or 40GB VRAM is recommended.
  • Fine-tuning only the last few layers to speed up LLM training/finetuning may not yield satisfactory results. I have tried this approach, but it didn't work well.
  • Loading 8-bit or 4-bit models can save VRAM. For a 7B model, instead of requiring 16GB, it takes approximately 10GB or less than 6GB, respectively. However, this reduction in VRAM usage comes at the cost of significantly decreased inference speed. It may also result in lower performance in text understanding tasks.
  • Those who are exploring LLM applications for their companies should be aware of licensing considerations. Training a model with another model as a reference and requiring original weights is not advisable for commercial settings.
  • There are three major types of LLMs: basic (like GPT-2/3), chat-enabled, and instruction-enabled. Most of the time, basic models are not usable as they are and require fine-tuning. Chat versions tend to be the best, but they are often not open-source.
  • Not every problem needs to be solved with LLMs. Avoid forcing a solution around LLMs. Similar to the situation with deep reinforcement learning in the past, it is important to find the most appropriate approach.
  • I have tried but didn't use langchains and vector-dbs. I never needed them. Simple Python, embeddings, and efficient dot product operations worked well for me.
  • LLMs do not need to have complete world knowledge. Humans also don't possess comprehensive knowledge but can adapt. LLMs only need to know how to utilize the available knowledge. It might be possible to create smaller models by separating the knowledge component.
  • The next wave of innovation might involve simulating "thoughts" before answering, rather than simply predicting one word after another. This approach could lead to significant advancements.
  • The overparameterization of LLMs presents a significant challenge: they tend to memorize extensive amounts of training data. This becomes particularly problematic in RAG scenarios when the context conflicts with this "implicit" knowledge. However, the situation escalates further when the context itself contains contradictory information. A recent survey paper comprehensively analyzes these "knowledge conflicts" in LLMs, categorizing them into three distinct types:
    • Context-Memory Conflicts: Arise when external context contradicts the LLM's internal knowledge.

      • Solution
        • Fine-tune on counterfactual contexts to prioritize external information.
        • Utilize specialized prompts to reinforce adherence to context
        • Apply decoding techniques to amplify context probabilities.
        • Pre-train on diverse contexts across documents.
    • Inter-Context Conflicts: Contradictions between multiple external sources.

      • Solution:
        • Employ specialized models for contradiction detection.
        • Utilize fact-checking frameworks integrated with external tools.
        • Fine-tune discriminators to identify reliable sources.
        • Aggregate high-confidence answers from augmented queries.
    • Intra-Memory Conflicts: The LLM gives inconsistent outputs for similar inputs due to conflicting internal knowledge.

      • Solution:
        • Fine-tune with consistency loss functions.
        • Implement plug-in methods, retraining on word definitions.
        • Ensemble one model's outputs with another's coherence scoring.
        • Apply contrastive decoding, focusing on truthful layers/heads.
  • The difference between PPO and DPOs: in DPO you don’t need to train a reward model anymore. Having good and bad data would be sufficient!
  • ORPO: β€œA straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. β€œ Hong, Lee, Thorne (2024)
  • KTO: β€œKTO does not need preferences -- only a binary signal of whether an output is desirable or undesirable for a given input. This makes it far easier to use in the real world, where preference data is scarce and expensive.” Ethayarajh et al (2024)

Contributing

Contributions are welcome! If you'd like to contribute to this project, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License.

About The Author

Sunil Ghimire is a NLP Engineer passionate about literature. He believes that words and data are the two most powerful tools to change the world.


Star History Chart

Created with ❀️ by Sunil Ghimire

About

LLM-PowerHouse: Unleash LLMs' potential through curated tutorials, best practices, and ready-to-use code for custom training and inferencing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages