Awesome-ML-System-CoDesign

LLM Inference/Serving

Observability of llm applications: Exploration and practice from the perspective of trace[None][2024]
Triton inference server[None][2024]
TensorFlow: A system for Large-Scale machine learning[Proc.~USENIX OSDI][2016]
Taming throughput-latency tradeoff in llm inference with sarathi-serve[arXiv][2024]
Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills[arXiv][2023]
Gptcache: An open-source semantic cache for llm applications enabling faster answers and cost savings[Proc.~the 3rd Workshop for Natural Language Processing Open Source Software][2023]
Legion: Expressing locality and independence with logical regions[Proc. IEEE SC][2012]
Semantic parsing on Freebase from question-answer pairs[Proc.~EMNLP][2013]
Clipper: A Low-Latency online prediction serving system[Proc.~USENIX NSDI][2017]
Flashattention: Fast and memory-efficient exact attention with io-awareness[Proc.~NeurIPS][2022]
Bert: Pre-training of deep bidirectional transformers for language understanding[Proc.~ACL][2018]
Retrieval-augmented generation for large language models: A survey[arXiv][2023]
Prompt cache: Modular attention reuse for low-latency inference[arXiv][2023]
Musketeer: all for one, one for all in data processing systems[Proc. ACM Eurosys][2015]
Serving DNNs like clockwork: Performance predictability from the bottom up[Proc.~USENIX OSDI][2020]
Flashdecoding++: Faster large language model inference on gpus[Proc.~Machine Learning and Systems][2023]
Data interpreter: An llm agent for data science[arXiv][2024]
Metagpt: Meta programming for a multi-agent collaborative framework[arXiv][2023]
Inference without interference: Disaggregate llm inference for mixed downstream workloads[arXiv][2024]
A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions[arXiv][2023]
Tool calling: Enhancing medication consultation via retrieval-augmented large language models[arXiv][2024]
Dryad: distributed data-parallel programs from sequential building blocks[Proc.~ACM Eurosys][2007]
Query expansion by prompting large language models[arXiv][2023]
Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity[arXiv][2024]
Ragcache: Efficient knowledge caching for retrieval-augmented generation[arXiv][2024]
Dspy: Compiling declarative language model calls into self-improving pipelines[arXiv][2023]
An llm compiler for parallel function calling[arXiv][2023]
Efficient memory management for large language model serving with pagedattention[Proc.~ACM SOSP][2023]
Retrieval-augmented generation for knowledge-intensive nlp tasks[None][2020]
AlpaServe: Statistical multiplexing with model parallelism for deep learning serving[Proc.~USENIX OSDI][2023]
Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache[arXiv][2024]
Parrot: Efficient serving of llm-based applications with semantic variable[Proc.~USENIX OSDI][2024]
Truthfulqa: Measuring how models mimic human falsehoods[arXiv][2021]
Optimizing llm queries in relational workloads[arXiv][2024]
Optimizing llm queries in relational workloads[arXiv][2024]
Online speculative decoding[arXiv][2023]
Ra-isf: Learning to answer and understand from retrieval augmentation via iterative self-feedback[arXiv][2024]
Self-refine: Iterative refinement with self-feedback[Proc.~NeurIPS][2024]
Specinfer: Accelerating large language model serving with tree-based speculative inference and verification[Proc.~ACM ASPLOS][2024]
Spotserve: Serving generative large language models on preemptible instances[Proc.~ACM ASPLOS][2024]
Ray: A distributed framework for emerging AI applications[Proc.~USENIX OSDI][2018]
Lossless acceleration of large language model via adaptive n-gram parallel decoding[arXiv][2024]
Splitwise: Efficient generative llm inference using phase splitting[arXiv][2023]
Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface[Proc.~NeurIPS][2023]
Fairness in serving large language models[arXiv][2023]
Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms[arXiv][2024]
Gemma 2: Improving open language models at a practical size[arXiv][2024]
Llama: Open and efficient foundation language models[arXiv][2023]
Llama 2: Open foundation and fine-tuned chat models[arXiv][2023]
Attention is All you Need[Proc.~NeurIPS][2017]
Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism[arXiv][2024]
Autogen: Enabling next-gen llm applications via multi-agent conversation framework[arXiv][2023]
Smoothquant: Accurate and efficient post-training quantization for large language models[Proc.~ICML][2023]
C-pack: Packaged resources to advance general chinese embedding[arXiv][2023]
HotpotQA: A dataset for diverse, explainable multi-hop question answering[Proc.~EMNLP][2018]
Orca: A distributed serving system for Transformer-Based generative models[Proc.~USENIX OSDI][2022]
Apache spark: a unified engine for big data processing[Communications of the ACM][2016]
SHEPHERD: Serving DNNs in the wild[Proc.~USENIX NSDI][2023]
Efficiently programming large language models using sglang[arXiv][2023]
Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving[arXiv][2024]
On optimal caching and model multiplexing for large model inference[arXiv][2023]

Second batch:

NVIDIA Effective Transformer[Github][2020]
NVIDIA FasterTransformer[Github][2021]
DeepSpeed Inference[Github][2022]
NVIDIA H100 Tensor Core GPU Architecture[Webpage][2022]
AnyScale LLMPerf leaderboard[Github][2023]
AWS Inferentia[Webpage][2023]
ChatGLM2-6B[Webpage][2023]
CTranslate2[Github][2023]
DeepSpeed-FastGen[Github][2023]
DeepSpeed-Inference v.s. ZeRO-Inference[Github][2023]
DeepSpeed-MII[Github][2023]
FlexFlow-Serve[Github][2023]
FlexGen[Github][2023]
ggml[Github][2023]
gpt-fast[Github][2023]
Graphcore[Webpage][2023]
Graphcore PopTransformer[Github][2023]
Huggingface Text Generation Inference[Github][2023]
Intel Extension for Transformers[Github][2023]
InterLM LMDeploy[Github][2023]
LightLLM[Github][2023]
Llama-v2-7b benchmark[Webpage][2023]
NVIDIA cuDNN MultiHeadAttn[Webpage][2023]
NVIDIA CUTLASS[Github][2023]
NVIDIA TensorRT-LLM[Github][2023]
OpenLLM[Github][2023]
RayLLM[Github][2023]
Sambanova[Webpage][2023]
vLLM[Github][2023]
Xorbits Inference (Xinference)[Github][2023]
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills[arXiv][2023]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints[arXiv][2023]
Batch: machine learning inference serving on serverless platforms with adaptive batching[SC20][2020]
LLM in a flash: Efficient Large Language Model Inference with Limited Memory[arXiv][2023]
Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale[arXiv][2022]
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers[arXiv][2023]
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding[arXiv][2023]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding[arXiv][2023]
${$PipeSwitch$}$: Fast pipelined context switching for deep learning applications[OSDI 20][2020]
Exponentially Faster Language Modelling[arXiv][2023]
Longformer: The long-document transformer[arXiv][2020]
Demystifying parallel and distributed deep learning: An in-depth concurrency analysis[ACM Computing Surveys][2019]
Improving language models by retrieving from trillions of tokens[ICML][2022]
Petals: Collaborative inference and fine-tuning of large models[arXiv][2022]
Distributed Inference and Fine-tuning of Large Language Models Over The Internet[arXiv][2023]
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers[arXiv][2023]
Speculative computation, parallelism, and functional programming[IEEE Trans. Comput.][1985]
Medusa: Simple framework for accelerating llm generation with multiple decoding heads[Github][2023]
DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization[arXiv][2023]
Transformer Inference Arithmetic[Webpage][2022]
Accelerating large language model decoding with speculative sampling[arXiv][2023]
Punica: Multi-Tenant LoRA Serving[arXiv][2023]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance[arXiv][2023]
Evaluating large language models trained on code[arXiv][2021]
Et: re-thinking self-attention for transformer models on gpus[HPCA][2021]
Extending context window of large language models via positional interpolation[arXiv][2023]
Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks[arXiv][2022]
Adapting Language Models to Compress Contexts[arXiv][2023]
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality[Webpage][2023]
Generating long sequences with sparse transformers[arXiv][2019]
Accelerating transformer networks through recomposing softmax layers[IISWC][2022]
Palm: Scaling language modeling with pathways[arXiv][2022]
Adaptively Sparse Transformers[EMNLP-IJCNLP][2019]
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention[arXiv][2023]
LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking[arXiv][2023]
LLM Inference Performance Engineering: Best Practices[Webpage][2023]
Language modeling with gated convolutional networks[ICML][2017]
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference[arXiv][2023]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale[arXiv][2022]
Qlora: Efficient finetuning of quantized llms[arXiv][2023]
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression[arXiv][2023]
The case for 4-bit precision: k-bit Inference Scaling Laws[arXiv][2022]
Cerebras-GPT: Open compute-optimal language models trained on the Cerebras wafer-scale cluster[arXiv][2023]
Longnet: Scaling transformers to 1,000,000,000 tokens[arXiv][2023]
Towards Next-Generation Intelligent Assistants Leveraging LLM Techniques[KDD][2023]
Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs[ACM Transactions on Architecture and Code Optimization][2023]
Glam: Efficient scaling of language models with mixture-of-experts[ICML][2022]
A Comprehensive Performance Study of Large Language Models on Novel AI Accelerators[arXiv][2023]
LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models[arXiv][2023]
Reducing Transformer Depth on Demand with Structured Dropout[ICLR][2019]
Hierarchical Neural Story Generation[ACL][2018]
Turbotransformers: an efficient gpu serving system for transformer models[ACM SIGPLAN][2021]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[JMLR][2022]
The CoRa tensor compiler: Compilation for ragged tensors with minimal padding[Machine Learning and Systems][2022]
Extending Context Window of Large Language Models via Semantic Compression[arXiv][2023]
Tensorir: An abstraction for automatic tensorized program optimization[ASPLOS][2023]
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot[arXiv][2023]
Gptq: Accurate post-training quantization for generative pre-trained transformers[arXiv][2022]
OPTQ: Accurate quantization for generative pre-trained transformers[ICLR][2022]
Compiling machine learning programs via high-level tracing[Systems for Machine Learning][2018]
Hungry Hungry Hippos: Towards Language Modeling with State Space Models[ICLR][2022]
Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding[Webpage][2023]
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts[Machine Learning and Systems][2023]
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs[WANT NeurIPS 2023][2023]
In-context autoencoder for context compression in a large language model[arXiv][2023]
Lossless acceleration for Seq2seq generation with aggressive decoding[arXiv][2022]
Mask-Predict: Parallel Decoding of Conditional Masked Language Models[EMNLP-IJCNLP][2019]
Semi-autoregressive training improves mask-predict decoding[arXiv][2020]
A survey of quantization methods for efficient neural network inference[Low-Power Computer Vision][2022]
Prompt Cache: Modular Attention Reuse for Low-Latency Inference[arXiv][2023]
PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination[ICML][2020]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces[arXiv][2023]
Efficiently Modeling Long Sequences with Structured State Spaces[ICLR][2021]
Non-autoregressive neural machine translation[ICLR][2018]
Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade[ACL-IJCNLP 2021][2021]
Knowledge Distillation of Large Language Models[arXiv][2023]
Cocktail: A multidimensional optimization for model serving in cloud[USENIX NSDI 2022][2022]
Non-autoregressive neural machine translation with enhanced decoder input[AAAI][2019]
Turbocharge NLP Inference at the Edge via Elastic Pipelining[ASPLOS][2023]
Star-Transformer[NAACL-HLT][2019]
Global memory augmentation for transformers[arXiv][2020]
Memory-efficient Transformers via Top-k Attention[Second Workshop on Simple and Efficient Natural Language Processing][2021]
Compression of deep learning models for text: A survey[ACM TKDD][2022]
Microsecond-scale preemption for concurrent ${$GPU-accelerated$}$${$DNN$}$ inferences[OSDI][2022]
Simplifying Transformer Blocks[arXiv][2023]
FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models[SIGPLAN][2022]
Magic pyramid: Accelerating inference with early exiting and token pruning[arXiv][2021]
REST: Retrieval-Based Speculative Decoding[arXiv][2023]
The curious case of neural text degeneration[arXiv][2019]
FlashDecoding++: Faster Large Language Model Inference on GPUs[arXiv][2023]
SPEED: Speculative Pipelined Execution for Efficient Decoding[arXiv][2023]
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference[arXiv][2023]
Tutel: Adaptive mixture-of-experts at scale[Machine Learning and Systems][2023]
Calculon: a methodology and tool for high-level co-design of systems and large language models[HPC][2023]
GPT-Zip: Deep Compression of Finetuned Large Language Models[Workshop on Efficient Systems for Foundation Models][2023]
Compressing LLMs: The Truth is Rarely Pure and Never Simple[arXiv][2023]
TASO: optimizing deep learning computation with automatic generation of graph substitutions[OPSI][2019]
Beyond Data and Model Parallelism for Deep Neural Networks[Machine Learning and Systems][2019]
Mistral 7B[arXiv][2023]
Llmlingua: Compressing prompts for accelerated inference of large language models[arXiv][2023]
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression[arXiv][2023]
HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment[arXiv][2023]
TinyBERT: Distilling BERT for Natural Language Understanding[EMNLP][2020]
S$^3$: Increasing GPU Utilization during Generative Inference for Higher Throughput[arXiv][2023]
The promise and peril of generative AI[Nature][2023]
Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings[ISCA][2023]
Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation[EMNLP][2020]
MLIR-based code generation for GPU tensor cores[Compiler Construction][2022]
Transformers are rnns: Fast autoregressive transformers with linear attention[ICML][2020]
CTRL: A conditional transformer language model for controllable generation[arXiv][2019]
Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context[ACL][2018]
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization[arXiv][2023]
SqueezeLLM: Dense-and-Sparse Quantization[arXiv][2023]
Full stack optimization of transformer inference: a survey[arXiv][2023]
Big little transformer decoder[arXiv][2023]
Reformer: The Efficient Transformer[ICLR][2019]
Accelerating Inference for Pretrained Language Models by Unified Multi-Perspective Early Exiting[COLING][2022]
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference[EMNLP][2021]
Ziplm: Hardware-aware structured pruning of language models[arXiv][2023]
Efficient Memory Management for Large Language Model Serving with PagedAttention[OSP][2023]
Relax: Composable Abstractions for End-to-End Dynamic Machine Learning[arxiv][2023]
Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement[EMNLP][2018]
xFormers: A modular and hackable Transformer modelling library[arxiv][2022]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding[ICLR][2020]
Fast inference from transformers via speculative decoding[ICML][2023]
Accelerating Distributed ${MoE}$ Training and Inference with Lina[USENIX ATC 23][2023]
Cascadebert: Accelerating inference of pre-trained language models via calibrated complete models cascade[arxiv][2020]
A Speed Odyssey for Deployable Quantization of LLMs[arxiv][2023]
Compressing Context to Enhance Inference Efficiency of Large Language Models[EMNLP][2023]
An efficient transformer decoder with compressed sub-layers[AAAI][2021]
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation[arxiv][2023]
${AlpaServe}$: Statistical Multiplexing with Model Parallelism for Deep Learning Serving[OSDI 23][2023]
A global past-future early exit method for accelerating inference of pre-trained language models[NAACL][2021]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration[arxiv][2023]
Ring Attention with Blockwise Transformers for Near-Infinite Context[arxiv][2023]
Lost in the middle: How language models use long contexts[arxiv][2023]
FastBERT: a Self-distilling BERT with Adaptive Inference Time[ACL][2020]
Online Speculative Decoding[arxiv][2023]
CacheGen: Fast Context Loading for Language Model Applications[arxiv][2023]
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time[arxiv][2023]
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models[arxiv][2023]
Deja vu: Contextual sparsity for efficient llms at inference time[ICML][2023]
BumbleBee: Secure Two-party Inference Framework for Large Transformers[Cryptology ePrint Archive][2023]
LLM-Pruner: On the Structural Pruning of Large Language Models[arxiv][2023]
Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools[CSUR][2020]
Long Range Language Modeling via Gated State Spaces[ICLR][2022]
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification[arxiv][2023]
SpotServe: Serving Generative Large Language Models on Preemptible Instances[ASPLOS][2024]
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism[VLDB][2023]
Are sixteen heads really better than one?[NeurIPS][2019]
Online normalizer calculation for softmax[arxiv][2018]
Accelerating sparse deep neural networks[arxiv][2021]
Adapler: Speeding up inference by adaptive length reduction[arxiv][2022]
Landmark Attention: Random-Access Infinite Context Length for Transformers[arxiv][2023]
PaSS: Parallel Speculative Sampling[arxiv][2023]
Learning to compress prompts with gist tokens[arxiv][2023]
Generating benchmarks for factuality evaluation of language models[arxiv][2023]
Saturn: An Optimized Data System for Large Model Deep Learning Workloads[arxiv][2023]
Memory-efficient pipeline-parallel dnn training[ICML][2021]
Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models[NeurIPS][2023]
Paella: Low-latency Model Serving with Software-defined GPU Scheduling[SOSP][2023]
Evomoe: An evolutional mixture-of-experts training framework via dense-to-sparse gate[arxiv][2021]
FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement[ACM on Management of Data][2023]
The statistical recurrent unit[International Conference on Machine Learning][2017]
GPT-4 Technical Report[CoRR][2023]
Resurrecting recurrent neural networks for long sequences[arXiv][2023]
MemGPT: Towards LLMs as Operating Systems[arXiv][2023]
Faster Causal Attention Over Large Sequences Through Sparse Flash Attention[arXiv][2023]
nuqmm: Quantized matmul for efficient inference of large-scale generative language models[arXiv][2022]
RWKV: Reinventing RNNs for the Transformer Era[arXiv][2023]
Instruction tuning with gpt-4[arXiv][2023]
Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models[arXiv][2023]
Efficiently scaling transformer inference[Proceedings of Machine Learning and Systems][2023]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation[International Conference on Learning Representations][2021]
The future of AI is hybrid[Qualcomm][2023]
Self-attention Does Not Need O($n^2$) Memory[arXiv][2021]
Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale[International Conference on Machine Learning][2022]
Zero-shot text-to-image generation[International Conference on Machine Learning][2021]
Mlperf inference benchmark[2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)][2020]
Hash layers for large sparse models[Advances in Neural Information Processing Systems][2021]
Efficient content-based sparse attention with routing transformers[Transactions of the Association for Computational Linguistics][2021]
Long short-term memory recurrent neural network architectures for large scale acoustic modeling[2014]
Apache TVM Unity: a vision for the ML software and hardware ecosystem[2022]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter[arXiv][2019]
Movement pruning: Adaptive sparsity by fine-tuning[Advances in Neural Information Processing Systems][2020]
What Matters In The Structured Pruning of Generative Language Models?[arXiv][2023]
Accelerating Transformer Inference for Translation via Parallel Decoding[arXiv][2023]
Memory Augmented Language Models through Mixture of Word Experts[arXiv][2023]
Consistent Accelerated Inference via Confident Adaptive Transformers[Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing][2021]
Fast transformer decoding: One write-head is all you need[arXiv][2019]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer[arXiv][2017]
Efficient LLM Inference on CPUs[arXiv][2023]
S-LoRA: Serving Thousands of Concurrent LoRA Adapters[arXiv][2023]
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU[International Conference on Machine Learning][2023]
Speeding up neural machine translation decoding by shrinking run-time vocabulary[Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)][2017]
Welder: Scheduling Deep Learning Memory Access via Tile-graph[17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)][2023]
Megatron-lm: Training multi-billion parameter language models using model parallelism[arXiv][2019]
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU[arXiv][2023]
Accelerating llm inference with staged speculative decoding[arXiv][2023]
Blockwise parallel decoding for deep autoregressive models[Advances in Neural Information Processing Systems][2018]
Roformer: Enhanced transformer with rotary position embedding[arXiv][2021]
A Simple and Effective Pruning Approach for Large Language Models[arXiv][2023]
Patient Knowledge Distillation for BERT Model Compression[Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)][2019]
A simple hash-based early exiting approach for language understanding and generation[arXiv][2022]
Retentive Network: A Successor to Transformer for Large Language Models[arXiv][2023]
Spectr: Fast speculative decoding via optimal transport[Workshop on Efficient Systems for Foundation Models@ ICML2023][2023]
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs[arXiv][2023]
Stanford alpaca: An instruction-following llama model[arXiv][2023]
Sparse sinkhorn attention[International Conference on Machine Learning][2020]
Efficient Transformers: A Survey[ACM Comput. Surv.][2023]
DeciLM 6B[Hugging Face][2023]
MLC-LLM[GitHub][2023]
Branchynet: Fast inference via early exiting from deep neural networks[2016 23rd International Conference on Pattern Recognition (ICPR)][2016]
Triton: an intermediate language and compiler for tiled neural network computations[Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages][2019]
MLP-Mixer: An all-MLP architecture for vision[Advances in Neural Information Processing Systems][2021]
AutoML in the Age of Large Language Models: Current Challenges, Future Opportunities and Risks[arXiv][2023]
Llama 2: Open foundation and fine-tuned chat models[arXiv][2023]
Efficient methods for natural language processing: A survey[Transactions of the Association for Computational Linguistics][2023]
Flash-Decoding for long-context inference[arXiv][2023]
Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization[16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)][2022]
Mini-GPTs: Efficient Large Language Models through Contextual Pruning[arXiv][2023]
SUMMA: Scalable universal matrix multiplication algorithm[Concurrency: Practice and Experience][1997]
Attention is all you need[Advances in Neural Information Processing Systems][2017]
Linformer: Self-attention with linear complexity[arXiv][2020]
Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers[Advances in Neural Information Processing Systems][2020]
LightSeq: A high performance inference library for transformers[arXiv][2020]
Tabi: An Efficient Multi-Level Inference System for Large Language Models[Proceedings of the Eighteenth European Conference on Computer Systems][2023]
Chain-of-thought prompting elicits reasoning in large language models[Advances in Neural Information Processing Systems][2022]
MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters[19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)][2022]
Bloom: A 176b-parameter open-access multilingual language model[arXiv][2022]
Fast Distributed Inference Serving for Large Language Models[arXiv][2023]
TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task[Proceedings of the Sixth Conference on Machine Translation][2021]
Speeding up Transformer Decoding via an Attention Refinement Network[Proceedings of the 29th International Conference on Computational Linguistics][2022]
PyTorch 2.0: The Journey to Bringing Compiler Technologies to the Core of PyTorch (Keynote)[Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization][2023]
Autogen: Enabling next-gen llm applications via multi-agent conversation framework[arXiv][2023]
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases[arXiv][2023]
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity[arXiv][2023]
Smoothquant: Accurate and efficient post-training quantization for large language models[arxiv][2022]
Efficient Streaming Language Models with Attention Sinks[arxiv][2023]
Sharing Attention Weights for Fast Transformer[IJCAI][2019]
A survey on non-autoregressive generation for neural machine translation and beyond[IEEE Transactions on Pattern Analysis and Machine Intelligence][2023]
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference[ACL][2020]
Wizardlm: Empowering large language models to follow complex instructions[arxiv][2023]
LLMCad: Fast and Scalable On-device Large Language Model Inference[arxiv][2023]
Retrieval meets Long Context Large Language Models[arxiv][2023]
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt[arxiv][2023]
Baichuan 2: Open large-scale language models[arxiv][2023]
Inference with reference: Lossless acceleration of large language models[arxiv][2023]
Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding[arxiv][2023]
A comprehensive study on post-training quantization for large language models[arxiv][2023]
Zeroquant: Efficient and affordable post-training quantization for large-scale transformers[Advances in Neural Information Processing Systems][2022]
TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference[ACL][2021]
SparseTIR: Composable abstractions for sparse compilation in deep learning[ASPLOS][2023]
A Scalable GPT-2 Inference Hardware Architecture on FPGA[IJCNN][2023]
Orca: A Distributed Serving System for Transformer-Based Generative Models[OSDI][2022]
Metaformer is actually what you need for vision[CVPR][2022]
RPTQ: Reorder-based Post-training Quantization for Large Language Models[arxiv][2023]
LARGE LANGUAGE MODEL CASCADES WITH MIX-TURE OF THOUGHT REPRESENTATIONS FOR COST-EFFICIENT REASONING[arxiv][2023]
Big bird: Transformers for longer sequences[Advances in Neural Information Processing Systems][2020]
Glm-130b: An open bilingual pre-trained model[arxiv][2022]
Learning to Skip for Language Modeling[arxiv][2023]
An attention free transformer[arxiv][2021]
Bytetransformer: A high-performance transformer boosted for variable-length inputs[IPDPS][2023]
DePA: Improving Non-autoregressive Translation with Dependency-Aware Decoder[IWSLT][2023]
${$MArk$}$: Exploiting Cloud Services for ${$Cost-Effective$}$,${$SLO-Aware$}$ Machine Learning Inference Serving[USENIX ATC][2019]
Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding[arxiv][2023]
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models[arxiv][2023]
LatticeGen: A Cooperative Framework which Hides Generated Text in a Lattice for Privacy-Aware Generation on Cloud[arxiv][2023]
Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer[EMNLP][2023]
Opt: Open pre-trained transformer language models[arxiv][2022]
H $ _2 $ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models[arxiv][2023]
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving[arxiv][2023]
Alpa: Automating inter-and ${$Intra-Operator$}$ parallelism for distributed deep learning[OSDI][2022]
${$EINNET$}$: Optimizing Tensor Programs with ${$Derivation-Based$}$ Transformations[OSDI][2023]
PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation[SOSP][2023]
Informer: Beyond efficient transformer for long sequence time-series forecasting[AAAI][2021]
Transpim: A memory-based acceleration via software-hardware co-design for transformer[HPCA][2022]
Bert loses patience: Fast and robust inference with early exit[Advances in Neural Information Processing Systems][2020]
Mixture-of-experts with expert choice routing[Advances in Neural Information Processing Systems][2022]
DistillSpec: Improving Speculative Decoding via Knowledge Distillation[arxiv][2023]
${$PetS$}$: A Unified Framework for ${$Parameter-Efficient$}$ Transformers Serving[USENIX ATC][2022]
On Optimal Caching and Model Multiplexing for Large Model Inference[arxiv][2023]
Minigpt-4: Enhancing vision-language understanding with advanced large language models[arxiv][2023]
A survey on model compression for large language models[arxiv][2023]
Falcon LLM: A New Frontier in Natural Language Processing[AC Investment Research Journal][2023]

LLM-based Agent Inference/Serving

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
paper-bibs.bib		paper-bibs.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-ML-System-CoDesign

LLM Inference/Serving

LLM-based Agent Inference/Serving

LLM Training

About

Releases

Packages

Languages

wizard1203/Awesome-ML-System-CoDesign

Folders and files

Latest commit

History

Repository files navigation

Awesome-ML-System-CoDesign

LLM Inference/Serving

LLM-based Agent Inference/Serving

LLM Training

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages