Skip to content

wizard1203/Awesome-ML-System-CoDesign

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Awesome-ML-System-CoDesign

LLM Inference/Serving

  • Observability of llm applications: Exploration and practice from the perspective of trace[None][2024]
  • Triton inference server[None][2024]
  • TensorFlow: A system for Large-Scale machine learning[Proc.~USENIX OSDI][2016]
  • Taming throughput-latency tradeoff in llm inference with sarathi-serve[arXiv][2024]
  • Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills[arXiv][2023]
  • Gptcache: An open-source semantic cache for llm applications enabling faster answers and cost savings[Proc.~the 3rd Workshop for Natural Language Processing Open Source Software][2023]
  • Legion: Expressing locality and independence with logical regions[Proc. IEEE SC][2012]
  • Semantic parsing on Freebase from question-answer pairs[Proc.~EMNLP][2013]
  • Clipper: A Low-Latency online prediction serving system[Proc.~USENIX NSDI][2017]
  • Flashattention: Fast and memory-efficient exact attention with io-awareness[Proc.~NeurIPS][2022]
  • Bert: Pre-training of deep bidirectional transformers for language understanding[Proc.~ACL][2018]
  • Retrieval-augmented generation for large language models: A survey[arXiv][2023]
  • Prompt cache: Modular attention reuse for low-latency inference[arXiv][2023]
  • Musketeer: all for one, one for all in data processing systems[Proc. ACM Eurosys][2015]
  • Serving DNNs like clockwork: Performance predictability from the bottom up[Proc.~USENIX OSDI][2020]
  • Flashdecoding++: Faster large language model inference on gpus[Proc.~Machine Learning and Systems][2023]
  • Data interpreter: An llm agent for data science[arXiv][2024]
  • Metagpt: Meta programming for a multi-agent collaborative framework[arXiv][2023]
  • Inference without interference: Disaggregate llm inference for mixed downstream workloads[arXiv][2024]
  • A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions[arXiv][2023]
  • Tool calling: Enhancing medication consultation via retrieval-augmented large language models[arXiv][2024]
  • Dryad: distributed data-parallel programs from sequential building blocks[Proc.~ACM Eurosys][2007]
  • Query expansion by prompting large language models[arXiv][2023]
  • Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity[arXiv][2024]
  • Ragcache: Efficient knowledge caching for retrieval-augmented generation[arXiv][2024]
  • Dspy: Compiling declarative language model calls into self-improving pipelines[arXiv][2023]
  • An llm compiler for parallel function calling[arXiv][2023]
  • Efficient memory management for large language model serving with pagedattention[Proc.~ACM SOSP][2023]
  • Retrieval-augmented generation for knowledge-intensive nlp tasks[None][2020]
  • AlpaServe: Statistical multiplexing with model parallelism for deep learning serving[Proc.~USENIX OSDI][2023]
  • Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache[arXiv][2024]
  • Parrot: Efficient serving of llm-based applications with semantic variable[Proc.~USENIX OSDI][2024]
  • Truthfulqa: Measuring how models mimic human falsehoods[arXiv][2021]
  • Optimizing llm queries in relational workloads[arXiv][2024]
  • Optimizing llm queries in relational workloads[arXiv][2024]
  • Online speculative decoding[arXiv][2023]
  • Ra-isf: Learning to answer and understand from retrieval augmentation via iterative self-feedback[arXiv][2024]
  • Self-refine: Iterative refinement with self-feedback[Proc.~NeurIPS][2024]
  • Specinfer: Accelerating large language model serving with tree-based speculative inference and verification[Proc.~ACM ASPLOS][2024]
  • Spotserve: Serving generative large language models on preemptible instances[Proc.~ACM ASPLOS][2024]
  • Ray: A distributed framework for emerging AI applications[Proc.~USENIX OSDI][2018]
  • Lossless acceleration of large language model via adaptive n-gram parallel decoding[arXiv][2024]
  • Splitwise: Efficient generative llm inference using phase splitting[arXiv][2023]
  • Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface[Proc.~NeurIPS][2023]
  • Fairness in serving large language models[arXiv][2023]
  • Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms[arXiv][2024]
  • Gemma 2: Improving open language models at a practical size[arXiv][2024]
  • Llama: Open and efficient foundation language models[arXiv][2023]
  • Llama 2: Open foundation and fine-tuned chat models[arXiv][2023]
  • Attention is All you Need[Proc.~NeurIPS][2017]
  • Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism[arXiv][2024]
  • Autogen: Enabling next-gen llm applications via multi-agent conversation framework[arXiv][2023]
  • Smoothquant: Accurate and efficient post-training quantization for large language models[Proc.~ICML][2023]
  • C-pack: Packaged resources to advance general chinese embedding[arXiv][2023]
  • HotpotQA: A dataset for diverse, explainable multi-hop question answering[Proc.~EMNLP][2018]
  • Orca: A distributed serving system for Transformer-Based generative models[Proc.~USENIX OSDI][2022]
  • Apache spark: a unified engine for big data processing[Communications of the ACM][2016]
  • SHEPHERD: Serving DNNs in the wild[Proc.~USENIX NSDI][2023]
  • Efficiently programming large language models using sglang[arXiv][2023]
  • Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving[arXiv][2024]
  • On optimal caching and model multiplexing for large model inference[arXiv][2023]

Second batch:

  • NVIDIA Effective Transformer[Github][2020]

  • NVIDIA FasterTransformer[Github][2021]

  • DeepSpeed Inference[Github][2022]

  • NVIDIA H100 Tensor Core GPU Architecture[Webpage][2022]

  • AnyScale LLMPerf leaderboard[Github][2023]

  • AWS Inferentia[Webpage][2023]

  • ChatGLM2-6B[Webpage][2023]

  • CTranslate2[Github][2023]

  • DeepSpeed-FastGen[Github][2023]

  • DeepSpeed-Inference v.s. ZeRO-Inference[Github][2023]

  • DeepSpeed-MII[Github][2023]

  • FlexFlow-Serve[Github][2023]

  • FlexGen[Github][2023]

  • ggml[Github][2023]

  • gpt-fast[Github][2023]

  • Graphcore[Webpage][2023]

  • Graphcore PopTransformer[Github][2023]

  • Huggingface Text Generation Inference[Github][2023]

  • Intel Extension for Transformers[Github][2023]

  • InterLM LMDeploy[Github][2023]

  • LightLLM[Github][2023]

  • Llama-v2-7b benchmark[Webpage][2023]

  • NVIDIA cuDNN MultiHeadAttn[Webpage][2023]

  • NVIDIA CUTLASS[Github][2023]

  • NVIDIA TensorRT-LLM[Github][2023]

  • OpenLLM[Github][2023]

  • RayLLM[Github][2023]

  • Sambanova[Webpage][2023]

  • vLLM[Github][2023]

  • Xorbits Inference (Xinference)[Github][2023]

  • SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills[arXiv][2023]

  • GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints[arXiv][2023]

  • Batch: machine learning inference serving on serverless platforms with adaptive batching[SC20][2020]

  • LLM in a flash: Efficient Large Language Model Inference with Limited Memory[arXiv][2023]

  • Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale[arXiv][2022]

  • Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers[arXiv][2023]

  • Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding[arXiv][2023]

  • LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding[arXiv][2023]

  • ${$PipeSwitch$}$: Fast pipelined context switching for deep learning applications[OSDI 20][2020]

  • Exponentially Faster Language Modelling[arXiv][2023]

  • Longformer: The long-document transformer[arXiv][2020]

  • Demystifying parallel and distributed deep learning: An in-depth concurrency analysis[ACM Computing Surveys][2019]

  • Improving language models by retrieving from trillions of tokens[ICML][2022]

  • Petals: Collaborative inference and fine-tuning of large models[arXiv][2022]

  • Distributed Inference and Fine-tuning of Large Language Models Over The Internet[arXiv][2023]

  • Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers[arXiv][2023]

  • Speculative computation, parallelism, and functional programming[IEEE Trans. Comput.][1985]

  • Medusa: Simple framework for accelerating llm generation with multiple decoding heads[Github][2023]

  • DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization[arXiv][2023]

  • Transformer Inference Arithmetic[Webpage][2022]

  • Accelerating large language model decoding with speculative sampling[arXiv][2023]

  • Punica: Multi-Tenant LoRA Serving[arXiv][2023]

  • FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance[arXiv][2023]

  • Evaluating large language models trained on code[arXiv][2021]

  • Et: re-thinking self-attention for transformer models on gpus[HPCA][2021]

  • Extending context window of large language models via positional interpolation[arXiv][2023]

  • Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks[arXiv][2022]

  • Adapting Language Models to Compress Contexts[arXiv][2023]

  • Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality[Webpage][2023]

  • Generating long sequences with sparse transformers[arXiv][2019]

  • Accelerating transformer networks through recomposing softmax layers[IISWC][2022]

  • Palm: Scaling language modeling with pathways[arXiv][2022]

  • Adaptively Sparse Transformers[EMNLP-IJCNLP][2019]

  • SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention[arXiv][2023]

  • LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking[arXiv][2023]

  • LLM Inference Performance Engineering: Best Practices[Webpage][2023]

  • Language modeling with gated convolutional networks[ICML][2017]

  • SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference[arXiv][2023]

  • LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale[arXiv][2022]

  • Qlora: Efficient finetuning of quantized llms[arXiv][2023]

  • SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression[arXiv][2023]

  • The case for 4-bit precision: k-bit Inference Scaling Laws[arXiv][2022]

  • Cerebras-GPT: Open compute-optimal language models trained on the Cerebras wafer-scale cluster[arXiv][2023]

  • Longnet: Scaling transformers to 1,000,000,000 tokens[arXiv][2023]

  • Towards Next-Generation Intelligent Assistants Leveraging LLM Techniques[KDD][2023]

  • Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs[ACM Transactions on Architecture and Code Optimization][2023]

  • Glam: Efficient scaling of language models with mixture-of-experts[ICML][2022]

  • A Comprehensive Performance Study of Large Language Models on Novel AI Accelerators[arXiv][2023]

  • LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models[arXiv][2023]

  • Reducing Transformer Depth on Demand with Structured Dropout[ICLR][2019]

  • Hierarchical Neural Story Generation[ACL][2018]

  • Turbotransformers: an efficient gpu serving system for transformer models[ACM SIGPLAN][2021]

  • Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[JMLR][2022]

  • The CoRa tensor compiler: Compilation for ragged tensors with minimal padding[Machine Learning and Systems][2022]

  • Extending Context Window of Large Language Models via Semantic Compression[arXiv][2023]

  • Tensorir: An abstraction for automatic tensorized program optimization[ASPLOS][2023]

  • SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot[arXiv][2023]

  • Gptq: Accurate post-training quantization for generative pre-trained transformers[arXiv][2022]

  • OPTQ: Accurate quantization for generative pre-trained transformers[ICLR][2022]

  • Compiling machine learning programs via high-level tracing[Systems for Machine Learning][2018]

  • Hungry Hungry Hippos: Towards Language Modeling with State Space Models[ICLR][2022]

  • Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding[Webpage][2023]

  • MegaBlocks: Efficient Sparse Training with Mixture-of-Experts[Machine Learning and Systems][2023]

  • Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs[WANT NeurIPS 2023][2023]

  • In-context autoencoder for context compression in a large language model[arXiv][2023]

  • Lossless acceleration for Seq2seq generation with aggressive decoding[arXiv][2022]

  • Mask-Predict: Parallel Decoding of Conditional Masked Language Models[EMNLP-IJCNLP][2019]

  • Semi-autoregressive training improves mask-predict decoding[arXiv][2020]

  • A survey of quantization methods for efficient neural network inference[Low-Power Computer Vision][2022]

  • Prompt Cache: Modular Attention Reuse for Low-Latency Inference[arXiv][2023]

  • PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination[ICML][2020]

  • Mamba: Linear-Time Sequence Modeling with Selective State Spaces[arXiv][2023]

  • Efficiently Modeling Long Sequences with Structured State Spaces[ICLR][2021]

  • Non-autoregressive neural machine translation[ICLR][2018]

  • Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade[ACL-IJCNLP 2021][2021]

  • Knowledge Distillation of Large Language Models[arXiv][2023]

  • Cocktail: A multidimensional optimization for model serving in cloud[USENIX NSDI 2022][2022]

  • Non-autoregressive neural machine translation with enhanced decoder input[AAAI][2019]

  • Turbocharge NLP Inference at the Edge via Elastic Pipelining[ASPLOS][2023]

  • Star-Transformer[NAACL-HLT][2019]

  • Global memory augmentation for transformers[arXiv][2020]

  • Memory-efficient Transformers via Top-k Attention[Second Workshop on Simple and Efficient Natural Language Processing][2021]

  • Compression of deep learning models for text: A survey[ACM TKDD][2022]

  • Microsecond-scale preemption for concurrent ${$GPU-accelerated$}$${$DNN$}$ inferences[OSDI][2022]

  • Simplifying Transformer Blocks[arXiv][2023]

  • FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models[SIGPLAN][2022]

  • Magic pyramid: Accelerating inference with early exiting and token pruning[arXiv][2021]

  • REST: Retrieval-Based Speculative Decoding[arXiv][2023]

  • The curious case of neural text degeneration[arXiv][2019]

  • FlashDecoding++: Faster Large Language Model Inference on GPUs[arXiv][2023]

  • SPEED: Speculative Pipelined Execution for Efficient Decoding[arXiv][2023]

  • Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference[arXiv][2023]

  • Tutel: Adaptive mixture-of-experts at scale[Machine Learning and Systems][2023]

  • Calculon: a methodology and tool for high-level co-design of systems and large language models[HPC][2023]

  • GPT-Zip: Deep Compression of Finetuned Large Language Models[Workshop on Efficient Systems for Foundation Models][2023]

  • Compressing LLMs: The Truth is Rarely Pure and Never Simple[arXiv][2023]

  • TASO: optimizing deep learning computation with automatic generation of graph substitutions[OPSI][2019]

  • Beyond Data and Model Parallelism for Deep Neural Networks[Machine Learning and Systems][2019]

  • Mistral 7B[arXiv][2023]

  • Llmlingua: Compressing prompts for accelerated inference of large language models[arXiv][2023]

  • LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression[arXiv][2023]

  • HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment[arXiv][2023]

  • TinyBERT: Distilling BERT for Natural Language Understanding[EMNLP][2020]

  • S$^3$: Increasing GPU Utilization during Generative Inference for Higher Throughput[arXiv][2023]

  • The promise and peril of generative AI[Nature][2023]

  • Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings[ISCA][2023]

  • Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation[EMNLP][2020]

  • MLIR-based code generation for GPU tensor cores[Compiler Construction][2022]

  • Transformers are rnns: Fast autoregressive transformers with linear attention[ICML][2020]

  • CTRL: A conditional transformer language model for controllable generation[arXiv][2019]

  • Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context[ACL][2018]

  • Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization[arXiv][2023]

  • SqueezeLLM: Dense-and-Sparse Quantization[arXiv][2023]

  • Full stack optimization of transformer inference: a survey[arXiv][2023]

  • Big little transformer decoder[arXiv][2023]

  • Reformer: The Efficient Transformer[ICLR][2019]

  • Accelerating Inference for Pretrained Language Models by Unified Multi-Perspective Early Exiting[COLING][2022]

  • Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference[EMNLP][2021]

  • Ziplm: Hardware-aware structured pruning of language models[arXiv][2023]

  • Efficient Memory Management for Large Language Model Serving with PagedAttention[OSP][2023]

  • Relax: Composable Abstractions for End-to-End Dynamic Machine Learning[arxiv][2023]

  • Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement[EMNLP][2018]

  • xFormers: A modular and hackable Transformer modelling library[arxiv][2022]

  • GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding[ICLR][2020]

  • Fast inference from transformers via speculative decoding[ICML][2023]

  • Accelerating Distributed ${MoE}$ Training and Inference with Lina[USENIX ATC 23][2023]

  • Cascadebert: Accelerating inference of pre-trained language models via calibrated complete models cascade[arxiv][2020]

  • A Speed Odyssey for Deployable Quantization of LLMs[arxiv][2023]

  • Compressing Context to Enhance Inference Efficiency of Large Language Models[EMNLP][2023]

  • An efficient transformer decoder with compressed sub-layers[AAAI][2021]

  • LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation[arxiv][2023]

  • ${AlpaServe}$: Statistical Multiplexing with Model Parallelism for Deep Learning Serving[OSDI 23][2023]

  • A global past-future early exit method for accelerating inference of pre-trained language models[NAACL][2021]

  • AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration[arxiv][2023]

  • Ring Attention with Blockwise Transformers for Near-Infinite Context[arxiv][2023]

  • Lost in the middle: How language models use long contexts[arxiv][2023]

  • FastBERT: a Self-distilling BERT with Adaptive Inference Time[ACL][2020]

  • Online Speculative Decoding[arxiv][2023]

  • CacheGen: Fast Context Loading for Language Model Applications[arxiv][2023]

  • Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time[arxiv][2023]

  • LLM-QAT: Data-Free Quantization Aware Training for Large Language Models[arxiv][2023]

  • Deja vu: Contextual sparsity for efficient llms at inference time[ICML][2023]

  • BumbleBee: Secure Two-party Inference Framework for Large Transformers[Cryptology ePrint Archive][2023]

  • LLM-Pruner: On the Structural Pruning of Large Language Models[arxiv][2023]

  • Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools[CSUR][2020]

  • Long Range Language Modeling via Gated State Spaces[ICLR][2022]

  • SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification[arxiv][2023]

  • SpotServe: Serving Generative Large Language Models on Preemptible Instances[ASPLOS][2024]

  • Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism[VLDB][2023]

  • Are sixteen heads really better than one?[NeurIPS][2019]

  • Online normalizer calculation for softmax[arxiv][2018]

  • Accelerating sparse deep neural networks[arxiv][2021]

  • Adapler: Speeding up inference by adaptive length reduction[arxiv][2022]

  • Landmark Attention: Random-Access Infinite Context Length for Transformers[arxiv][2023]

  • PaSS: Parallel Speculative Sampling[arxiv][2023]

  • Learning to compress prompts with gist tokens[arxiv][2023]

  • Generating benchmarks for factuality evaluation of language models[arxiv][2023]

  • Saturn: An Optimized Data System for Large Model Deep Learning Workloads[arxiv][2023]

  • Memory-efficient pipeline-parallel dnn training[ICML][2021]

  • Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models[NeurIPS][2023]

  • Paella: Low-latency Model Serving with Software-defined GPU Scheduling[SOSP][2023]

  • Evomoe: An evolutional mixture-of-experts training framework via dense-to-sparse gate[arxiv][2021]

  • FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement[ACM on Management of Data][2023]

  • The statistical recurrent unit[International Conference on Machine Learning][2017]

  • GPT-4 Technical Report[CoRR][2023]

  • Resurrecting recurrent neural networks for long sequences[arXiv][2023]

  • MemGPT: Towards LLMs as Operating Systems[arXiv][2023]

  • Faster Causal Attention Over Large Sequences Through Sparse Flash Attention[arXiv][2023]

  • nuqmm: Quantized matmul for efficient inference of large-scale generative language models[arXiv][2022]

  • RWKV: Reinventing RNNs for the Transformer Era[arXiv][2023]

  • Instruction tuning with gpt-4[arXiv][2023]

  • Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models[arXiv][2023]

  • Efficiently scaling transformer inference[Proceedings of Machine Learning and Systems][2023]

  • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation[International Conference on Learning Representations][2021]

  • The future of AI is hybrid[Qualcomm][2023]

  • Self-attention Does Not Need O($n^2$) Memory[arXiv][2021]

  • Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale[International Conference on Machine Learning][2022]

  • Zero-shot text-to-image generation[International Conference on Machine Learning][2021]

  • Mlperf inference benchmark[2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)][2020]

  • Hash layers for large sparse models[Advances in Neural Information Processing Systems][2021]

  • Efficient content-based sparse attention with routing transformers[Transactions of the Association for Computational Linguistics][2021]

  • Long short-term memory recurrent neural network architectures for large scale acoustic modeling[2014]

  • Apache TVM Unity: a vision for the ML software and hardware ecosystem[2022]

  • DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter[arXiv][2019]

  • Movement pruning: Adaptive sparsity by fine-tuning[Advances in Neural Information Processing Systems][2020]

  • What Matters In The Structured Pruning of Generative Language Models?[arXiv][2023]

  • Accelerating Transformer Inference for Translation via Parallel Decoding[arXiv][2023]

  • Memory Augmented Language Models through Mixture of Word Experts[arXiv][2023]

  • Consistent Accelerated Inference via Confident Adaptive Transformers[Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing][2021]

  • Fast transformer decoding: One write-head is all you need[arXiv][2019]

  • Outrageously large neural networks: The sparsely-gated mixture-of-experts layer[arXiv][2017]

  • Efficient LLM Inference on CPUs[arXiv][2023]

  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters[arXiv][2023]

  • FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU[International Conference on Machine Learning][2023]

  • Speeding up neural machine translation decoding by shrinking run-time vocabulary[Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)][2017]

  • Welder: Scheduling Deep Learning Memory Access via Tile-graph[17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)][2023]

  • Megatron-lm: Training multi-billion parameter language models using model parallelism[arXiv][2019]

  • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU[arXiv][2023]

  • Accelerating llm inference with staged speculative decoding[arXiv][2023]

  • Blockwise parallel decoding for deep autoregressive models[Advances in Neural Information Processing Systems][2018]

  • Roformer: Enhanced transformer with rotary position embedding[arXiv][2021]

  • A Simple and Effective Pruning Approach for Large Language Models[arXiv][2023]

  • Patient Knowledge Distillation for BERT Model Compression[Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)][2019]

  • A simple hash-based early exiting approach for language understanding and generation[arXiv][2022]

  • Retentive Network: A Successor to Transformer for Large Language Models[arXiv][2023]

  • Spectr: Fast speculative decoding via optimal transport[Workshop on Efficient Systems for Foundation Models@ ICML2023][2023]

  • FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs[arXiv][2023]

  • Stanford alpaca: An instruction-following llama model[arXiv][2023]

  • Sparse sinkhorn attention[International Conference on Machine Learning][2020]

  • Efficient Transformers: A Survey[ACM Comput. Surv.][2023]

  • DeciLM 6B[Hugging Face][2023]

  • MLC-LLM[GitHub][2023]

  • Branchynet: Fast inference via early exiting from deep neural networks[2016 23rd International Conference on Pattern Recognition (ICPR)][2016]

  • Triton: an intermediate language and compiler for tiled neural network computations[Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages][2019]

  • MLP-Mixer: An all-MLP architecture for vision[Advances in Neural Information Processing Systems][2021]

  • AutoML in the Age of Large Language Models: Current Challenges, Future Opportunities and Risks[arXiv][2023]

  • Llama 2: Open foundation and fine-tuned chat models[arXiv][2023]

  • Efficient methods for natural language processing: A survey[Transactions of the Association for Computational Linguistics][2023]

  • Flash-Decoding for long-context inference[arXiv][2023]

  • Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization[16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)][2022]

  • Mini-GPTs: Efficient Large Language Models through Contextual Pruning[arXiv][2023]

  • SUMMA: Scalable universal matrix multiplication algorithm[Concurrency: Practice and Experience][1997]

  • Attention is all you need[Advances in Neural Information Processing Systems][2017]

  • Linformer: Self-attention with linear complexity[arXiv][2020]

  • Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers[Advances in Neural Information Processing Systems][2020]

  • LightSeq: A high performance inference library for transformers[arXiv][2020]

  • Tabi: An Efficient Multi-Level Inference System for Large Language Models[Proceedings of the Eighteenth European Conference on Computer Systems][2023]

  • Chain-of-thought prompting elicits reasoning in large language models[Advances in Neural Information Processing Systems][2022]

  • MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters[19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)][2022]

  • Bloom: A 176b-parameter open-access multilingual language model[arXiv][2022]

  • Fast Distributed Inference Serving for Large Language Models[arXiv][2023]

  • TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task[Proceedings of the Sixth Conference on Machine Translation][2021]

  • Speeding up Transformer Decoding via an Attention Refinement Network[Proceedings of the 29th International Conference on Computational Linguistics][2022]

  • PyTorch 2.0: The Journey to Bringing Compiler Technologies to the Core of PyTorch (Keynote)[Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization][2023]

  • Autogen: Enabling next-gen llm applications via multi-agent conversation framework[arXiv][2023]

  • Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases[arXiv][2023]

  • Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity[arXiv][2023]

  • Smoothquant: Accurate and efficient post-training quantization for large language models[arxiv][2022]

  • Efficient Streaming Language Models with Attention Sinks[arxiv][2023]

  • Sharing Attention Weights for Fast Transformer[IJCAI][2019]

  • A survey on non-autoregressive generation for neural machine translation and beyond[IEEE Transactions on Pattern Analysis and Machine Intelligence][2023]

  • DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference[ACL][2020]

  • Wizardlm: Empowering large language models to follow complex instructions[arxiv][2023]

  • LLMCad: Fast and Scalable On-device Large Language Model Inference[arxiv][2023]

  • Retrieval meets Long Context Large Language Models[arxiv][2023]

  • Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt[arxiv][2023]

  • Baichuan 2: Open large-scale language models[arxiv][2023]

  • Inference with reference: Lossless acceleration of large language models[arxiv][2023]

  • Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding[arxiv][2023]

  • A comprehensive study on post-training quantization for large language models[arxiv][2023]

  • Zeroquant: Efficient and affordable post-training quantization for large-scale transformers[Advances in Neural Information Processing Systems][2022]

  • TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference[ACL][2021]

  • SparseTIR: Composable abstractions for sparse compilation in deep learning[ASPLOS][2023]

  • A Scalable GPT-2 Inference Hardware Architecture on FPGA[IJCNN][2023]

  • Orca: A Distributed Serving System for Transformer-Based Generative Models[OSDI][2022]

  • Metaformer is actually what you need for vision[CVPR][2022]

  • RPTQ: Reorder-based Post-training Quantization for Large Language Models[arxiv][2023]

  • LARGE LANGUAGE MODEL CASCADES WITH MIX-TURE OF THOUGHT REPRESENTATIONS FOR COST-EFFICIENT REASONING[arxiv][2023]

  • Big bird: Transformers for longer sequences[Advances in Neural Information Processing Systems][2020]

  • Glm-130b: An open bilingual pre-trained model[arxiv][2022]

  • Learning to Skip for Language Modeling[arxiv][2023]

  • An attention free transformer[arxiv][2021]

  • Bytetransformer: A high-performance transformer boosted for variable-length inputs[IPDPS][2023]

  • DePA: Improving Non-autoregressive Translation with Dependency-Aware Decoder[IWSLT][2023]

  • ${$MArk$}$: Exploiting Cloud Services for ${$Cost-Effective$}$,${$SLO-Aware$}$ Machine Learning Inference Serving[USENIX ATC][2019]

  • Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding[arxiv][2023]

  • Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models[arxiv][2023]

  • LatticeGen: A Cooperative Framework which Hides Generated Text in a Lattice for Privacy-Aware Generation on Cloud[arxiv][2023]

  • Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer[EMNLP][2023]

  • Opt: Open pre-trained transformer language models[arxiv][2022]

  • H $ _2 $ O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models[arxiv][2023]

  • Atom: Low-bit Quantization for Efficient and Accurate LLM Serving[arxiv][2023]

  • Alpa: Automating inter-and ${$Intra-Operator$}$ parallelism for distributed deep learning[OSDI][2022]

  • ${$EINNET$}$: Optimizing Tensor Programs with ${$Derivation-Based$}$ Transformations[OSDI][2023]

  • PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation[SOSP][2023]

  • Informer: Beyond efficient transformer for long sequence time-series forecasting[AAAI][2021]

  • Transpim: A memory-based acceleration via software-hardware co-design for transformer[HPCA][2022]

  • Bert loses patience: Fast and robust inference with early exit[Advances in Neural Information Processing Systems][2020]

  • Mixture-of-experts with expert choice routing[Advances in Neural Information Processing Systems][2022]

  • DistillSpec: Improving Speculative Decoding via Knowledge Distillation[arxiv][2023]

  • ${$PetS$}$: A Unified Framework for ${$Parameter-Efficient$}$ Transformers Serving[USENIX ATC][2022]

  • On Optimal Caching and Model Multiplexing for Large Model Inference[arxiv][2023]

  • Minigpt-4: Enhancing vision-language understanding with advanced large language models[arxiv][2023]

  • A survey on model compression for large language models[arxiv][2023]

  • Falcon LLM: A New Frontier in Natural Language Processing[AC Investment Research Journal][2023]

LLM-based Agent Inference/Serving

LLM Training

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages