What I'm Reading This Week (2025/02.23-03.01)

Published on March 01, 2025 • By vski5 • 2 minutes read

Table of Contents

Trends

Kids, your emperor is manipulating coin prices again.

Trump tweets claiming to establish a crypto reserve: alt text

What I Am Reading

Haven’t read any books

0. When you feel a lot of pressure, just take a look at Zelensky

A national leader humiliated live at the White House, though it’s self-inflicted
Vance did what a childless cat lady would do, seeking emotional value—blaming Zelensky for not thanking America
China, the US, and Russia rarely reach a consensus, supporting a ceasefire in Ukraine
Uncertain whether Ukraine will be carved up, and what Europe will do next

1. A Week of DeepSeek Releasing New Stuff

Day 1: FlashMLA

Released FlashMLA - an efficient MLA decoding kernel for Hopper GPUs
- Optimized for variable-length sequences, now in production
- Supports BF16
- Supports paged KV cache (block size 64)
- Performance: 3000 GB/s memory-bound, 580 TFLOPS compute-bound (based on H800)

Day 2: DeepEP

Released DeepEP - the first open-source EP communication library for MoE model training and inference
- Features:
  - Efficient and optimized all-to-all communication
  - Supports intranode (NVLink) and internode (RDMA) communication
  - High-throughput kernels for training and inference prefilling
  - Low-latency kernels for inference decoding
  - Native FP8 dispatch support
  - Flexible GPU resource control for computation-communication overlapping

Day 3: DeepGEMM

Released DeepGEMM - an FP8 GEMM library supporting dense and MoE GEMMs
- Applied to V3/R1 training and inference
- Performance: Up to 1350+ FP8 TFLOPS on Hopper GPUs
- Features:
  - No heavy dependencies, as clean as a tutorial
  - Fully Just-In-Time compiled
  - Core logic at ~300 lines, outperforming most expert-tuned kernels
  - Supports dense layout and two MoE layouts

Off-Peak Discounts Alert

DeepSeek API Platform off-peak discounts
- Time: Daily 16:30–00:30 UTC
- Discounts:
  - DeepSeek-V3: 50% off
  - DeepSeek-R1: 75% off
- Goal: Optimize resource utilization and save costs

Day 4: Optimized Parallelism Strategies

Released optimized parallelism strategies
- DualPipe:
  - A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training
  - Link: https://github.com/deepseek-ai/DualPipe
- EPLB:
  - An expert-parallel load balancer for V3/R1
  - Link: https://github.com/deepseek-ai/eplb
- Analysis:
  - Computation-communication overlap analysis for V3/R1
  - Link: https://github.com/deepseek-ai/profile-data

Day 5: 3FS, Thruster for All DeepSeek Data Access

Released 3FS (Fire-Flyer File System) - a parallel file system
- Fully utilizes the bandwidth of modern SSDs and RDMA networks
- Performance:
  - 180-node cluster: 6.6 TiB/s aggregate read throughput
  - 25-node cluster GraySort benchmark: 3.66 TiB/min throughput
  - Per client node KVCache lookup peak: 40+ GiB/s
- Features:
  - Disaggregated architecture with strong consistency semantics
  - Supports training data preprocessing, dataset loading, checkpoint saving/reloading, embedding vector search, and inference KVCache lookups for V3/R1
- Links:
  - 3FS: https://github.com/deepseek-ai/3FS
  - Smallpond (data processing framework based on 3FS): https://github.com/deepseek-ai/smallpond

Day 6: DeepSeek-V3/R1 Inference System Overview

Released an overview of the DeepSeek-V3/R1 inference system
- Optimized throughput and latency:
  - Enhanced batch scaling via cross-node EP
  - Computation-communication overlap
  - Load balancing
- Online service statistics:
  - Per H800 node: 73.7k/14.8k input/output tokens per second
  - Cost profit margin: 545%
- Goal: Provide value to the community and contribute to AGI goals
- Deep Dive: https://bit.ly/4ihZUiO

QR code to a Mastodon post

Link

Original text hyperlink and QR code

Go to Top File an Issue