- Observability of llm applications: Exploration and practice from the perspective of trace[None][2024]
- Triton inference server[None][2024]
- TensorFlow: A system for Large-Scale machine learning[Proc.~USENIX OSDI][2016]
- Taming throughput-latency tradeoff in llm inference with sarathi-serve[arXiv][2024]
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills[arXiv][2023]
- Gptcache: An open-source semantic cache for llm applications enabling faster answers and cost savings[Proc.~the 3rd Workshop for Natural Language Processing Open Source Software][2023]
- Legion: Expressing locality and independence with logical regions[Proc. IEEE SC][2012]
- Semantic parsing on Freebase from question-answer pairs[Proc.~EMNLP][2013]
- Clipper: A Low-Latency online prediction serving system[Proc.~USENIX NSDI][2017]
- Flashattention: Fast and memory-efficient exact attention with io-awareness[Proc.~NeurIPS][2022]
- Bert: Pre-training of deep bidirectional transformers for language understanding[Proc.~ACL][2018]
- Retrieval-augmented generation for large language models: A survey[arXiv][2023]
- Prompt cache: Modular attention reuse for low-latency inference[arXiv][2023]
- Musketeer: all for one, one for all in data processing systems[Proc. ACM Eurosys][2015]
- Serving DNNs like clockwork: Performance predictability from the bottom up[Proc.~USENIX OSDI][2020]
- Flashdecoding++: Faster large language model inference on gpus[Proc.~Machine Learning and Systems][2023]
- Data interpreter: An llm agent for data science[arXiv][2024]
- Metagpt: Meta programming for a multi-agent collaborative framework[arXiv][2023]
- Inference without interference: Disaggregate llm inference for mixed downstream workloads[arXiv][2024]
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions[arXiv][2023]
- Tool calling: Enhancing medication consultation via retrieval-augmented large language models[arXiv][2024]
- Dryad: distributed data-parallel programs from sequential building blocks[Proc.~ACM Eurosys][2007]
- Query expansion by prompting large language models[arXiv][2023]
- Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity[arXiv][2024]
- Ragcache: Efficient knowledge caching for retrieval-augmented generation[arXiv][2024]
- Dspy: Compiling declarative language model calls into self-improving pipelines[arXiv][2023]
- An llm compiler for parallel function calling[arXiv][2023]
- Efficient memory management for large language model serving with pagedattention[Proc.~ACM SOSP][2023]
- Retrieval-augmented generation for knowledge-intensive nlp tasks[None][2020]
- AlpaServe: Statistical multiplexing with model parallelism for deep learning serving[Proc.~USENIX OSDI][2023]
- Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache[arXiv][2024]
- Parrot: Efficient serving of llm-based applications with semantic variable[Proc.~USENIX OSDI][2024]
- Truthfulqa: Measuring how models mimic human falsehoods[arXiv][2021]
- Optimizing llm queries in relational workloads[arXiv][2024]
- Optimizing llm queries in relational workloads[arXiv][2024]
- Online speculative decoding[arXiv][2023]
- Ra-isf: Learning to answer and understand from retrieval augmentation via iterative self-feedback[arXiv][2024]
- Self-refine: Iterative refinement with self-feedback[Proc.~NeurIPS][2024]
- Specinfer: Accelerating large language model serving with tree-based speculative inference and verification[Proc.~ACM ASPLOS][2024]
- Spotserve: Serving generative large language models on preemptible instances[Proc.~ACM ASPLOS][2024]
- Ray: A distributed framework for emerging AI applications[Proc.~USENIX OSDI][2018]
- Lossless acceleration of large language model via adaptive n-gram parallel decoding[arXiv][2024]
- Splitwise: Efficient generative llm inference using phase splitting[arXiv][2023]
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface[Proc.~NeurIPS][2023]
- Fairness in serving large language models[arXiv][2023]
- Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms[arXiv][2024]
- Gemma 2: Improving open language models at a practical size[arXiv][2024]
- Llama: Open and efficient foundation language models[arXiv][2023]
- Llama 2: Open foundation and fine-tuned chat models[arXiv][2023]
- Attention is All you Need[Proc.~NeurIPS][2017]
- Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism[arXiv][2024]
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework[arXiv][2023]
- Smoothquant: Accurate and efficient post-training quantization for large language models[Proc.~ICML][2023]
- C-pack: Packaged resources to advance general chinese embedding[arXiv][2023]
- HotpotQA: A dataset for diverse, explainable multi-hop question answering[Proc.~EMNLP][2018]
- Orca: A distributed serving system for Transformer-Based generative models[Proc.~USENIX OSDI][2022]
- Apache spark: a unified engine for big data processing[Communications of the ACM][2016]
- SHEPHERD: Serving DNNs in the wild[Proc.~USENIX NSDI][2023]
- Efficiently programming large language models using sglang[arXiv][2023]
- Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving[arXiv][2024]
- On optimal caching and model multiplexing for large model inference[arXiv][2023]
Second batch:
-
NVIDIA Effective Transformer[Github][2020]
-
NVIDIA FasterTransformer[Github][2021]
-
DeepSpeed Inference[Github][2022]
-
NVIDIA H100 Tensor Core GPU Architecture[Webpage][2022]
-
AnyScale LLMPerf leaderboard[Github][2023]
-
AWS Inferentia[Webpage][2023]
-
ChatGLM2-6B[Webpage][2023]
-
CTranslate2[Github][2023]
-
DeepSpeed-FastGen[Github][2023]
-
DeepSpeed-Inference v.s. ZeRO-Inference[Github][2023]
-
DeepSpeed-MII[Github][2023]
-
FlexFlow-Serve[Github][2023]
-
FlexGen[Github][2023]
-
ggml[Github][2023]
-
gpt-fast[Github][2023]
-
Graphcore[Webpage][2023]
-
Graphcore PopTransformer[Github][2023]
-
Huggingface Text Generation Inference[Github][2023]
-
Intel Extension for Transformers[Github][2023]
-
InterLM LMDeploy[Github][2023]
-
LightLLM[Github][2023]
-
Llama-v2-7b benchmark[Webpage][2023]
-
NVIDIA cuDNN MultiHeadAttn[Webpage][2023]
-
NVIDIA CUTLASS[Github][2023]
-
NVIDIA TensorRT-LLM[Github][2023]
-
OpenLLM[Github][2023]
-
RayLLM[Github][2023]
-
Sambanova[Webpage][2023]
-
vLLM[Github][2023]
-
Xorbits Inference (Xinference)[Github][2023]
-
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills[arXiv][2023]
-
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints[arXiv][2023]
-
Batch: machine learning inference serving on serverless platforms with adaptive batching[SC20][2020]
-
LLM in a flash: Efficient Large Language Model Inference with Limited Memory[arXiv][2023]
-
Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale[arXiv][2022]
-
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers[arXiv][2023]
-
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding[arXiv][2023]
-
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding[arXiv][2023]
-
${$PipeSwitch$}$: Fast pipelined context switching for deep learning applications[OSDI 20][2020]
-
Exponentially Faster Language Modelling[arXiv][2023]
-
Longformer: The long-document transformer[arXiv][2020]
-
Demystifying parallel and distributed deep learning: An in-depth concurrency analysis[ACM Computing Surveys][2019]
-
Improving language models by retrieving from trillions of tokens[ICML][2022]
-
Petals: Collaborative inference and fine-tuning of large models[arXiv][2022]
-
Distributed Inference and Fine-tuning of Large Language Models Over The Internet[arXiv][2023]
-
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers[arXiv][2023]
-
Speculative computation, parallelism, and functional programming[IEEE Trans. Comput.][1985]
-
Medusa: Simple framework for accelerating llm generation with multiple decoding heads[Github][2023]
-
DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization[arXiv][2023]
-
Transformer Inference Arithmetic[Webpage][2022]
-
Accelerating large language model decoding with speculative sampling[arXiv][2023]
-
Punica: Multi-Tenant LoRA Serving[arXiv][2023]
-
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance[arXiv][2023]
-
Evaluating large language models trained on code[arXiv][2021]
-
Et: re-thinking self-attention for transformer models on gpus[HPCA][2021]
-
Extending context window of large language models via positional interpolation[arXiv][2023]
-
Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks[arXiv][2022]
-
Adapting Language Models to Compress Contexts[arXiv][2023]
-
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality[Webpage][2023]
-
Generating long sequences with sparse transformers[arXiv][2019]
-
Accelerating transformer networks through recomposing softmax layers[IISWC][2022]
-
Palm: Scaling language modeling with pathways[arXiv][2022]
-
Adaptively Sparse Transformers[EMNLP-IJCNLP][2019]
-
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention[arXiv][2023]
-
LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking[arXiv][2023]
-
LLM Inference Performance Engineering: Best Practices[Webpage][2023]
-
Language modeling with gated convolutional networks[ICML][2017]
-
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference[arXiv][2023]
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale[arXiv][2022]
-
Qlora: Efficient finetuning of quantized llms[arXiv][2023]
-
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression[arXiv][2023]
-
The case for 4-bit precision: k-bit Inference Scaling Laws[arXiv][2022]
-
Cerebras-GPT: Open compute-optimal language models trained on the Cerebras wafer-scale cluster[arXiv][2023]
-
Longnet: Scaling transformers to 1,000,000,000 tokens[arXiv][2023]
-
Towards Next-Generation Intelligent Assistants Leveraging LLM Techniques[KDD][2023]
-
Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs[ACM Transactions on Architecture and Code Optimization][2023]
-
Glam: Efficient scaling of language models with mixture-of-experts[ICML][2022]
-
A Comprehensive Performance Study of Large Language Models on Novel AI Accelerators[arXiv][2023]
-
LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models[arXiv][2023]
-
Reducing Transformer Depth on Demand with Structured Dropout[ICLR][2019]
-
Hierarchical Neural Story Generation[ACL][2018]
-
Turbotransformers: an efficient gpu serving system for transformer models[ACM SIGPLAN][2021]
-
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[JMLR][2022]
-
The CoRa tensor compiler: Compilation for ragged tensors with minimal padding[Machine Learning and Systems][2022]
-
Extending Context Window of Large Language Models via Semantic Compression[arXiv][2023]
-
Tensorir: An abstraction for automatic tensorized program optimization[ASPLOS][2023]
-
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot[arXiv][2023]
-
Gptq: Accurate post-training quantization for generative pre-trained transformers[arXiv][2022]
-
OPTQ: Accurate quantization for generative pre-trained transformers[ICLR][2022]
-
Compiling machine learning programs via high-level tracing[Systems for Machine Learning][2018]
-
Hungry Hungry Hippos: Towards Language Modeling with State Space Models[ICLR][2022]
-
Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding[Webpage][2023]
-
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts[Machine Learning and Systems][2023]
-
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs[WANT NeurIPS 2023][2023]
-
In-context autoencoder for context compression in a large language model[arXiv][2023]
-
Lossless acceleration for Seq2seq generation with aggressive decoding[arXiv][2022]
-
Mask-Predict: Parallel Decoding of Conditional Masked Language Models[EMNLP-IJCNLP][2019]
-
Semi-autoregressive training improves mask-predict decoding[arXiv][2020]
-
A survey of quantization methods for efficient neural network inference[Low-Power Computer Vision][2022]
-
Prompt Cache: Modular Attention Reuse for Low-Latency Inference[arXiv][2023]
-
PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination[ICML][2020]
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces[arXiv][2023]
-
Efficiently Modeling Long Sequences with Structured State Spaces[ICLR][2021]
-
Non-autoregressive neural machine translation[ICLR][2018]
-
Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade[ACL-IJCNLP 2021][2021]
-
Knowledge Distillation of Large Language Models[arXiv][2023]
-
Cocktail: A multidimensional optimization for model serving in cloud[USENIX NSDI 2022][2022]
-
Non-autoregressive neural machine translation with enhanced decoder input[AAAI][2019]
-
Turbocharge NLP Inference at the Edge via Elastic Pipelining[ASPLOS][2023]
-
Star-Transformer[NAACL-HLT][2019]
-
Global memory augmentation for transformers[arXiv][2020]
-
Memory-efficient Transformers via Top-k Attention[Second Workshop on Simple and Efficient Natural Language Processing][2021]
-
Compression of deep learning models for text: A survey[ACM TKDD][2022]
-
Microsecond-scale preemption for concurrent ${$GPU-accelerated$}$${$DNN$}$ inferences[OSDI][2022]
-
Simplifying Transformer Blocks[arXiv][2023]
-
FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models[SIGPLAN][2022]
-
Magic pyramid: Accelerating inference with early exiting and token pruning[arXiv][2021]
-
REST: Retrieval-Based Speculative Decoding[arXiv][2023]
-
The curious case of neural text degeneration[arXiv][2019]
-
FlashDecoding++: Faster Large Language Model Inference on GPUs[arXiv][2023]
-
SPEED: Speculative Pipelined Execution for Efficient Decoding[arXiv][2023]
-
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference[arXiv][2023]
-
Tutel: Adaptive mixture-of-experts at scale[Machine Learning and Systems][2023]
-
Calculon: a methodology and tool for high-level co-design of systems and large language models[HPC][2023]
-
GPT-Zip: Deep Compression of Finetuned Large Language Models[Workshop on Efficient Systems for Foundation Models][2023]
-
Compressing LLMs: The Truth is Rarely Pure and Never Simple[arXiv][2023]
-
TASO: optimizing deep learning computation with automatic generation of graph substitutions[OPSI][2019]
-
Beyond Data and Model Parallelism for Deep Neural Networks[Machine Learning and Systems][2019]
-
Mistral 7B[arXiv][2023]
-
Llmlingua: Compressing prompts for accelerated inference of large language models[arXiv][2023]
-
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression[arXiv][2023]
-
HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment[arXiv][2023]
-
TinyBERT: Distilling BERT for Natural Language Understanding[EMNLP][2020]
-
S$^3$: Increasing GPU Utilization during Generative Inference for Higher Throughput[arXiv][2023]
-
The promise and peril of generative AI[Nature][2023]
-
Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings[ISCA][2023]
-
Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation[EMNLP][2020]
-
MLIR-based code generation for GPU tensor cores[Compiler Construction][2022]
-
Transformers are rnns: Fast autoregressive transformers with linear attention[ICML][2020]
-
CTRL: A conditional transformer language model for controllable generation[arXiv][2019]
-
Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context[ACL][2018]
-
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization[arXiv][2023]
-
SqueezeLLM: Dense-and-Sparse Quantization[arXiv][2023]
-
Full stack optimization of transformer inference: a survey[arXiv][2023]
-
Big little transformer decoder[arXiv][2023]
-
Reformer: The Efficient Transformer[ICLR][2019]
-
Accelerating Inference for Pretrained Language Models by Unified Multi-Perspective Early Exiting[COLING][2022]
-
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference[EMNLP][2021]
-
Ziplm: Hardware-aware structured pruning of language models[arXiv][2023]
-
Efficient Memory Management for Large Language Model Serving with PagedAttention[OSP][2023]
-
Relax: Composable Abstractions for End-to-End Dynamic Machine Learning[arxiv][2023]
-
Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement[EMNLP][2018]
-
xFormers: A modular and hackable Transformer modelling library[arxiv][2022]
-
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding[ICLR][2020]
-
Fast inference from transformers via speculative decoding[ICML][2023]
-
Accelerating Distributed
${MoE}$ Training and Inference with Lina[USENIX ATC 23][2023] -
Cascadebert: Accelerating inference of pre-trained language models via calibrated complete models cascade[arxiv][2020]
-
A Speed Odyssey for Deployable Quantization of LLMs[arxiv][2023]
-
Compressing Context to Enhance Inference Efficiency of Large Language Models[EMNLP][2023]
-
An efficient transformer decoder with compressed sub-layers[AAAI][2021]
-
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation[arxiv][2023]
-
${AlpaServe}$ : Statistical Multiplexing with Model Parallelism for Deep Learning Serving[OSDI 23][2023] -
A global past-future early exit method for accelerating inference of pre-trained language models[NAACL][2021]
-
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration[arxiv][2023]
-
Ring Attention with Blockwise Transformers for Near-Infinite Context[arxiv][2023]
-
Lost in the middle: How language models use long contexts[arxiv][2023]
-
FastBERT: a Self-distilling BERT with Adaptive Inference Time[ACL][2020]
-
Online Speculative Decoding[arxiv][2023]
-
CacheGen: Fast Context Loading for Language Model Applications[arxiv][2023]
-
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time[arxiv][2023]
-
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models[arxiv][2023]
-
Deja vu: Contextual sparsity for efficient llms at inference time[ICML][2023]
-
BumbleBee: Secure Two-party Inference Framework for Large Transformers[Cryptology ePrint Archive][2023]
-
LLM-Pruner: On the Structural Pruning of Large Language Models[arxiv][2023]
-
Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools[CSUR][2020]
-
Long Range Language Modeling via Gated State Spaces[ICLR][2022]
-
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification[arxiv][2023]
-
SpotServe: Serving Generative Large Language Models on Preemptible Instances[ASPLOS][2024]
-
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism[VLDB][2023]
-
Are sixteen heads really better than one?[NeurIPS][2019]
-
Online normalizer calculation for softmax[arxiv][2018]
-
Accelerating sparse deep neural networks[arxiv][2021]
-
Adapler: Speeding up inference by adaptive length reduction[arxiv][2022]
-
Landmark Attention: Random-Access Infinite Context Length for Transformers[arxiv][2023]
-
PaSS: Parallel Speculative Sampling[arxiv][2023]
-
Learning to compress prompts with gist tokens[arxiv][2023]
-
Generating benchmarks for factuality evaluation of language models[arxiv][2023]
-
Saturn: An Optimized Data System for Large Model Deep Learning Workloads[arxiv][2023]
-
Memory-efficient pipeline-parallel dnn training[ICML][2021]
-
Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models[NeurIPS][2023]
-
Paella: Low-latency Model Serving with Software-defined GPU Scheduling[SOSP][2023]
-
Evomoe: An evolutional mixture-of-experts training framework via dense-to-sparse gate[arxiv][2021]
-
FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement[ACM on Management of Data][2023]
-
The statistical recurrent unit[International Conference on Machine Learning][2017]
-
GPT-4 Technical Report[CoRR][2023]
-
Resurrecting recurrent neural networks for long sequences[arXiv][2023]
-
MemGPT: Towards LLMs as Operating Systems[arXiv][2023]
-
Faster Causal Attention Over Large Sequences Through Sparse Flash Attention[arXiv][2023]
-
nuqmm: Quantized matmul for efficient inference of large-scale generative language models[arXiv][2022]
-
RWKV: Reinventing RNNs for the Transformer Era[arXiv][2023]
-
Instruction tuning with gpt-4[arXiv][2023]
-
Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models[arXiv][2023]
-
Efficiently scaling transformer inference[Proceedings of Machine Learning and Systems][2023]
-
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation[International Conference on Learning Representations][2021]
-
The future of AI is hybrid[Qualcomm][2023]
-
Self-attention Does Not Need O(
$n^2$ ) Memory[arXiv][2021] -
Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale[International Conference on Machine Learning][2022]
-
Zero-shot text-to-image generation[International Conference on Machine Learning][2021]
-
Mlperf inference benchmark[2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)][2020]
-
Hash layers for large sparse models[Advances in Neural Information Processing Systems][2021]
-
Efficient content-based sparse attention with routing transformers[Transactions of the Association for Computational Linguistics][2021]
-
Long short-term memory recurrent neural network architectures for large scale acoustic modeling[2014]
-
Apache TVM Unity: a vision for the ML software and hardware ecosystem[2022]
-
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter[arXiv][2019]
-
Movement pruning: Adaptive sparsity by fine-tuning[Advances in Neural Information Processing Systems][2020]
-
What Matters In The Structured Pruning of Generative Language Models?[arXiv][2023]
-
Accelerating Transformer Inference for Translation via Parallel Decoding[arXiv][2023]
-
Memory Augmented Language Models through Mixture of Word Experts[arXiv][2023]
-
Consistent Accelerated Inference via Confident Adaptive Transformers[Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing][2021]
-
Fast transformer decoding: One write-head is all you need[arXiv][2019]
-
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer[arXiv][2017]
-
Efficient LLM Inference on CPUs[arXiv][2023]
-
S-LoRA: Serving Thousands of Concurrent LoRA Adapters[arXiv][2023]
-
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU[International Conference on Machine Learning][2023]
-
Speeding up neural machine translation decoding by shrinking run-time vocabulary[Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)][2017]
-
Welder: Scheduling Deep Learning Memory Access via Tile-graph[17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)][2023]
-
Megatron-lm: Training multi-billion parameter language models using model parallelism[arXiv][2019]
-
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU[arXiv][2023]
-
Accelerating llm inference with staged speculative decoding[arXiv][2023]
-
Blockwise parallel decoding for deep autoregressive models[Advances in Neural Information Processing Systems][2018]
-
Roformer: Enhanced transformer with rotary position embedding[arXiv][2021]
-
A Simple and Effective Pruning Approach for Large Language Models[arXiv][2023]
-
Patient Knowledge Distillation for BERT Model Compression[Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)][2019]
-
A simple hash-based early exiting approach for language understanding and generation[arXiv][2022]
-
Retentive Network: A Successor to Transformer for Large Language Models[arXiv][2023]
-
Spectr: Fast speculative decoding via optimal transport[Workshop on Efficient Systems for Foundation Models@ ICML2023][2023]
-
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs[arXiv][2023]
-
Stanford alpaca: An instruction-following llama model[arXiv][2023]
-
Sparse sinkhorn attention[International Conference on Machine Learning][2020]
-
Efficient Transformers: A Survey[ACM Comput. Surv.][2023]
-
DeciLM 6B[Hugging Face][2023]
-
MLC-LLM[GitHub][2023]
-
Branchynet: Fast inference via early exiting from deep neural networks[2016 23rd International Conference on Pattern Recognition (ICPR)][2016]
-
Triton: an intermediate language and compiler for tiled neural network computations[Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages][2019]
-
MLP-Mixer: An all-MLP architecture for vision[Advances in Neural Information Processing Systems][2021]
-
AutoML in the Age of Large Language Models: Current Challenges, Future Opportunities and Risks[arXiv][2023]
-
Llama 2: Open foundation and fine-tuned chat models[arXiv][2023]
-
Efficient methods for natural language processing: A survey[Transactions of the Association for Computational Linguistics][2023]
-
Flash-Decoding for long-context inference[arXiv][2023]
-
Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization[16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)][2022]
-
Mini-GPTs: Efficient Large Language Models through Contextual Pruning[arXiv][2023]
-
SUMMA: Scalable universal matrix multiplication algorithm[Concurrency: Practice and Experience][1997]
-
Attention is all you need[Advances in Neural Information Processing Systems][2017]
-
Linformer: Self-attention with linear complexity[arXiv][2020]
-
Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers[Advances in Neural Information Processing Systems][2020]
-
LightSeq: A high performance inference library for transformers[arXiv][2020]
-
Tabi: An Efficient Multi-Level Inference System for Large Language Models[Proceedings of the Eighteenth European Conference on Computer Systems][2023]
-
Chain-of-thought prompting elicits reasoning in large language models[Advances in Neural Information Processing Systems][2022]
-
MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters[19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)][2022]
-
Bloom: A 176b-parameter open-access multilingual language model[arXiv][2022]
-
Fast Distributed Inference Serving for Large Language Models[arXiv][2023]
-
TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task[Proceedings of the Sixth Conference on Machine Translation][2021]
-
Speeding up Transformer Decoding via an Attention Refinement Network[Proceedings of the 29th International Conference on Computational Linguistics][2022]
-
PyTorch 2.0: The Journey to Bringing Compiler Technologies to the Core of PyTorch (Keynote)[Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization][2023]
-
Autogen: Enabling next-gen llm applications via multi-agent conversation framework[arXiv][2023]
-
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases[arXiv][2023]
-
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity[arXiv][2023]
-
Smoothquant: Accurate and efficient post-training quantization for large language models[arxiv][2022]
-
Efficient Streaming Language Models with Attention Sinks[arxiv][2023]
-
Sharing Attention Weights for Fast Transformer[IJCAI][2019]
-
A survey on non-autoregressive generation for neural machine translation and beyond[IEEE Transactions on Pattern Analysis and Machine Intelligence][2023]
-
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference[ACL][2020]
-
Wizardlm: Empowering large language models to follow complex instructions[arxiv][2023]
-
LLMCad: Fast and Scalable On-device Large Language Model Inference[arxiv][2023]
-
Retrieval meets Long Context Large Language Models[arxiv][2023]
-
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt[arxiv][2023]
-
Baichuan 2: Open large-scale language models[arxiv][2023]
-
Inference with reference: Lossless acceleration of large language models[arxiv][2023]
-
Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding[arxiv][2023]
-
A comprehensive study on post-training quantization for large language models[arxiv][2023]
-
Zeroquant: Efficient and affordable post-training quantization for large-scale transformers[Advances in Neural Information Processing Systems][2022]
-
TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference[ACL][2021]
-
SparseTIR: Composable abstractions for sparse compilation in deep learning[ASPLOS][2023]
-
A Scalable GPT-2 Inference Hardware Architecture on FPGA[IJCNN][2023]
-
Orca: A Distributed Serving System for Transformer-Based Generative Models[OSDI][2022]
-
Metaformer is actually what you need for vision[CVPR][2022]
-
RPTQ: Reorder-based Post-training Quantization for Large Language Models[arxiv][2023]
-
LARGE LANGUAGE MODEL CASCADES WITH MIX-TURE OF THOUGHT REPRESENTATIONS FOR COST-EFFICIENT REASONING[arxiv][2023]
-
Big bird: Transformers for longer sequences[Advances in Neural Information Processing Systems][2020]
-
Glm-130b: An open bilingual pre-trained model[arxiv][2022]
-
Learning to Skip for Language Modeling[arxiv][2023]
-
An attention free transformer[arxiv][2021]
-
Bytetransformer: A high-performance transformer boosted for variable-length inputs[IPDPS][2023]
-
DePA: Improving Non-autoregressive Translation with Dependency-Aware Decoder[IWSLT][2023]
-
${$MArk$}$: Exploiting Cloud Services for ${$Cost-Effective$}$,${$SLO-Aware$}$ Machine Learning Inference Serving[USENIX ATC][2019]
-
Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding[arxiv][2023]
-
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models[arxiv][2023]
-
LatticeGen: A Cooperative Framework which Hides Generated Text in a Lattice for Privacy-Aware Generation on Cloud[arxiv][2023]
-
Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer[EMNLP][2023]
-
Opt: Open pre-trained transformer language models[arxiv][2022]
-
H $ _2 $ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models[arxiv][2023]
-
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving[arxiv][2023]
-
Alpa: Automating inter-and ${$Intra-Operator$}$ parallelism for distributed deep learning[OSDI][2022]
-
${$EINNET$}$: Optimizing Tensor Programs with ${$Derivation-Based$}$ Transformations[OSDI][2023]
-
PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation[SOSP][2023]
-
Informer: Beyond efficient transformer for long sequence time-series forecasting[AAAI][2021]
-
Transpim: A memory-based acceleration via software-hardware co-design for transformer[HPCA][2022]
-
Bert loses patience: Fast and robust inference with early exit[Advances in Neural Information Processing Systems][2020]
-
Mixture-of-experts with expert choice routing[Advances in Neural Information Processing Systems][2022]
-
DistillSpec: Improving Speculative Decoding via Knowledge Distillation[arxiv][2023]
-
${$PetS$}$: A Unified Framework for ${$Parameter-Efficient$}$ Transformers Serving[USENIX ATC][2022]
-
On Optimal Caching and Model Multiplexing for Large Model Inference[arxiv][2023]
-
Minigpt-4: Enhancing vision-language understanding with advanced large language models[arxiv][2023]
-
A survey on model compression for large language models[arxiv][2023]
-
Falcon LLM: A New Frontier in Natural Language Processing[AC Investment Research Journal][2023]