선도주제연구Advanced Topics Seminar

스토리지 분야 연구Research in Storage Systems

Advanced Topics in Storage Systems

소프트웨어학부 주용수 교수Prof. Yongsoo Joo, School of Software

2026년 3월 23일 (월)March 23, 2026 (Mon)

adv-tech.pages.dev →

https://adv-tech.pages.dev

목차Table of Contents

I

스토리지 시스템 분야 소개Introduction to Storage Systems
시장, 업계, 연구 생태계Market, Industry, Research Ecosystem

II

스토리지 연구의 흐름Research Flow in Storage
관찰에서 구현까지From Observation to Implementation

III

선도주제: 최신 연구 사례Advanced Topics: Recent Research
HotStorage '24 & FAST '26

IV

향후 연구 방향Future Research Directions
오픈 문제 및 미래 전망Open Problems & Outlook

스토리지 시스템의 의미 재정의Redefining the Role of Storage Systems

전통 컴퓨터 시스템에서의 스토리지Storage in Traditional Computer Systems

폰 노이만 구조: CPU — 메모리 — 스토리지 계층
Von Neumann architecture: CPU — Memory — Storage hierarchy
보조 기억 장치: 영속적 데이터 저장
Secondary storage: persistent data retention
HDD 기반의 느린 I/O → 시스템 병목의 원인
HDD-based slow I/O → root cause of system bottlenecks
OS 커널의 블록 I/O 스택, 파일 시스템, 버퍼 캐시 중심 설계
OS kernel design centered on block I/O stack, file system, buffer cache

AI 시대, 스토리지의 중요성 재정의Redefining Storage in the AI Era

데이터 폭증: LLM 학습 데이터셋 수십~수백 TB, 멀티모달 급증
Data explosion: LLM training datasets tens~hundreds of TB, multimodal surge
GPU 병목 해소: GPU utilization 극대화를 위한 고속 데이터 파이프라인
GPU bottleneck relief: High-speed data pipeline to maximize GPU utilization
Checkpoint & Recovery: 대규모 분산 학습의 장애 복구에 고대역폭 스토리지 필수
Checkpoint & Recovery: High-bandwidth storage essential for fault recovery in large-scale distributed training
추론 서빙: KV Cache 오프로딩, 모델 가중치 로딩 → 저지연 요구
Inference serving: KV Cache offloading, model weight loading → low-latency requirements
CXL / NVMe-oF: 스토리지-메모리 경계 재편
CXL / NVMe-oF: Reshaping the storage-memory boundary

▼

"보조 기억 장치"에서 AI 인프라의 핵심 구성 요소로From "secondary storage" to a core component of AI infrastructure

스토리지 시장 규모Storage Market Size

글로벌 스토리지 시장Global Storage Market ($1 ≈ ₩1,490 기준as of, 2026.3.18)

구분Category	2024	2025 (예상est.)	2028 (전망proj.)	CAGR^*
NAND Flash 매출Revenue	~$53B (약approx. 79조원T KRW)	~$78B (약approx. 116조원T KRW)	~$120B+ (약approx. 179조원T KRW+)	~20%
Enterprise SSD	~$18B (약approx. 27조원T KRW)	~$28B (약approx. 42조원T KRW)	~$50B+ (약approx. 75조원T KRW+)	~25%
HDD (Nearline)	~$7B (약approx. 10조원T KRW)	~$8B (약approx. 12조원T KRW)	~$9B (약approx. 13조원T KRW)	~5%
전체 스토리지 시스템Total Storage Systems	~$70B (약approx. 104조원T KRW)	~$90B (약approx. 134조원T KRW)	~$140B+ (약approx. 209조원T KRW+)	~18%

*CAGR (Compound Annual Growth Rate): 연평균 복합 성장률 — 일정 기간 동안의 연평균 성장률을 복리로 환산한 값*CAGR (Compound Annual Growth Rate): annualized growth rate compounded over a given period

핵심 포인트Key Takeaways

AI 인프라 투자 확대로 Enterprise SSD 수요 급증
Expanding AI infrastructure investment drives surging Enterprise SSD demand
NAND 업황: 2023년 역대 최악 → 2024~2025년 V자 반등
NAND cycle: 2023 historic downturn → 2024~2025 V-shaped recovery
HDD: 대용량 Nearline(데이터센터) 중심 재편, SMR/HAMR 기술 경쟁
HDD: restructuring around high-capacity Nearline (datacenter), SMR/HAMR technology race
삼성전자 + SK하이닉스 → 글로벌 NAND 시장 점유율 약 45~50%
Samsung + SK hynix → combined global NAND market share ~45~50%

최근 주요 스토리지 업계 동향Recent Storage Industry Trends

주요 업계 동향 (2024~2026)Key Industry Trends (2024~2026)

삼성전자: V-NAND 9세대(300단+) 양산, CXL 개발 가속
Samsung: 9th-gen V-NAND (300+ layers) mass production, CXL acceleration
SK하이닉스: 238단 NAND SSD, Solidigm 인수 효과
SK hynix: 238-layer NAND SSD, Solidigm acquisition benefits
Kioxia: 2024 IPO, WD와 분사 완료spin-off from WD completed
Micron: 232단-layer NAND, DC SSD 점유율 확대market share expansion
Western Digital: HDD/Flash 사업부 분사division spin-off
Pure Storage / VAST Data: 올플래시 스타트업 급성장
Pure Storage / VAST Data: All-flash startups rapidly growing
AI 스토리지: NVIDIA DGX/HGX → 고성능 NVMe SSD 대량 탑재
AI Storage: NVIDIA DGX/HGX → high-performance NVMe SSDs at scale

국내 주요 스토리지 회사Major Korean Storage Companies

대기업Large Enterprises

회사Company	주요 사업Core Business
삼성전자Samsung Electronics	NAND Flash, Enterprise/Consumer SSD, UFS, eMMC
SK하이닉스 hynix	NAND Flash, Enterprise SSD, UFS, Solidigm 자회사subsidiary

팹리스 SSD 컨트롤러 업체Fabless SSD Controller Companies

회사Company	주요 제품 / 특징Key Products / Features
파두FADU (FADU)	AI 데이터센터용 고성능 SSD 컨트롤러High-performance SSD controllers for AI datacenters
점프테크놀로지Jump Technology	국산 SSD 컨트롤러 개발Domestic SSD controller development
에이디테크놀로지AD Technology	국산 SSD 컨트롤러 / 펌웨어Domestic SSD controller / firmware
Silicon Motion	글로벌 1위 SSD 컨트롤러 팹리스 (한국 R&D 센터)#1 global SSD controller fabless (Korea R&D center)
파이슨Phison (Phison)	PCIe Gen5 SSD 컨트롤러 (한국 협력)PCIe Gen5 SSD controller (Korea partnership)

스토리지 솔루션 / 소프트웨어Storage Solutions / Software

회사Company	주요 사업Core Business
글루시스Gluesys (Gluesys)	스토리지 통합 관리 소프트웨어Unified storage management software
나무기술Namu Tech	클라우드 스토리지 솔루션Cloud storage solutions
맨텍ManTech (ManTech)	백업 / 아카이브 스토리지Backup / archive storage

목차Table of Contents

I

스토리지 시스템 분야 소개Introduction to Storage Systems
시장, 업계, 연구 생태계Market, Industry, Research Ecosystem

II

스토리지 연구의 흐름Research Flow in Storage
관찰에서 구현까지From Observation to Implementation

III

선도주제: 최신 연구 사례Advanced Topics: Recent Research
HotStorage '24 & FAST '26

IV

향후 연구 방향Future Research Directions
오픈 문제 및 미래 전망Open Problems & Outlook

스토리지 시스템 연구 흐름Storage Systems Research Flow

알려진 문제Known Problem

Demand Paging과 Cold StartDemand Paging & Cold Start

Demand Paging

OS는 프로그램 실행 시 모든 코드/데이터를 미리 메모리에 올리지 않음OS does not load all code/data into memory upfront when launching a program
실제로 접근하는 순간 page fault 발생 → 그때서야 저장장치에서 해당 페이지를 읽어옴Page fault occurs on actual access → page is fetched from storage at that moment
메모리 절약에는 유리하지만, 처음 실행 시 수백~수천 번의 I/O가 연쇄 발생Saves memory, but causes hundreds to thousands of sequential I/Os on first launch

Cold Start 문제Problem

Cold start: 앱 데이터가 메모리(page cache)에 없는 상태에서의 최초 실행Cold start: first launch when app data is not in memory (page cache)
모든 페이지를 저장장치에서 읽어야 하므로 실행 시간이 수 배 느려짐All pages must be read from storage, making launch time several times slower
HDD 시대에는 seek time까지 더해져 심각 → SSD에서도 소형 random I/O 때문에 여전히 병목Severe with HDD seek times → still a bottleneck on SSDs due to small random I/Os

핵심 질문Key Question
Cold start 시 발생하는 수천 번의 I/O를 어떻게 줄이거나 빠르게 처리할 수 있을까?How can we reduce or speed up the thousands of I/Os generated during cold start?

연구 흐름Research Flow

FAST [USENIX FAST '11]

FAST: Quick Application Launch on Solid-State Drives — Joo et al.

연구 단계Research Phase	FAST 논문 적용Application in FAST Paper
1. 시스템 입출력 관찰System I/O Observation	Blktrace로 애플리케이션 cold start 시 블록 I/O 트레이스 수집Collected block I/O traces during application cold start via Blktrace — 앱 실행 시 발생하는 수천 개의 블록 요청을 커널 레벨에서 기록— Recorded thousands of block requests at kernel level during app launch
2. 특이 패턴 발견Pattern Discovery	Cold start 시 페이지 폴트 순서 및 I/O 요청 순서가 매 실행마다 동일 (deterministic)Page fault and I/O request order identical across every cold start (deterministic) — 같은 앱을 반복 실행하면 블록 요청 시퀀스가 변하지 않음을 발견— Discovered that block request sequences remain unchanged across repeated app launches — page fault 발생 시 CPU는 대기하고 저장장치만 동작함 → CPU 연산 중에는 저장장치는 유휴 상태— During a page fault CPU idles while only storage works → during CPU computation storage sits idle
3. 개선 아이디어Improvement Idea	CPU 연산 시간과 SSD I/O 시간을 오버랩 (동시 실행)Overlap CPU computation time with SSD I/O time (concurrent execution) — prefetcher가 I/O를 미리 발행하여, 앱이 필요로 할 때 이미 page cache에 존재하도록 함— Prefetcher issues I/O in advance so data resides in page cache when the app needs it — t_launch ≈ max(t_ssd, t_cpu) 로 단축 가능

연구 단계Research Phase	FAST 논문 적용Application in FAST Paper
4. 알고리즘 도출/보완Algorithm Design	LBA → 파일 역매핑 (블록 주소 → 파일명 + 오프셋 변환) 알고리즘 설계LBA → File reverse mapping (block address → filename + offset) algorithm design — 파일 시스템(EXT3)은 LBA→파일 역방향 매핑을 지원하지 않음— File system (EXT3) does not support LBA→file reverse mapping — Red-black tree 기반 LBA-to-inode map을 앱별로 구축하여 해결— Solved by building per-app LBA-to-inode map using red-black tree
5. 구현Implementation	User-level FAST 시스템 구현: posix_fadvise()로 prefetch 수행User-level FAST system implementation: prefetching via posix_fadvise() — launch manager, system-call profiler, disk I/O profiler — launch sequence extractor, LBA-to-inode mapper, prefetcher generator — 커널 수정 없이 Linux 2.6.32에서 동작— Runs on Linux 2.6.32 without kernel modifications
6. 시스템 구현System Integration	Linux Fedora 12 + Intel 80GB SSD 환경에서 22개 실제 앱 대상 통합 테스트Integration test with 22 real applications on Linux Fedora 12 + Intel 80GB SSD — Acrobat, Eclipse, Firefox, Gnome, Matlab, OpenOffice 등etc.
7. 성능평가Evaluation	Cold start 대비 평균 28% launch time 단축 (16~46% 범위)Average 28% launch time reduction vs. cold start (range: 16~46%) — miss ratio 1.6%로 prefetcher 정확도 검증— Prefetcher accuracy verified with 1.6% miss ratio — 다중 앱 동시 실행(Fig.8)에서도 성능 향상 확인 (5~20개 앱, 개선폭 20%→7%)— Performance gains confirmed under multi-app concurrent launch (Fig.8, 5~20 apps, 20%→7%)

알려진 문제Known Problem

모바일 앱 로딩: 기대와 현실Mobile App Loading: Expectation vs Reality

모바일 환경의 특수성Mobile-Specific Challenges

스마트폰은 포그라운드 앱 1개만 활성 → 다른 앱은 백그라운드에서 메모리 회수됨Smartphones keep only 1 foreground app active → others get memory reclaimed in background
앱 전환 시 매번 cold start와 유사한 로딩이 반복될 수 있음Switching apps can repeatedly trigger cold start-like loading

스펙과 체감의 괴리Gap Between Specs and Experience

eMMC/UFS 제조사는 순차 읽기 수백 MB/s의 높은 대역폭을 강조eMMC/UFS manufacturers advertise sequential read speeds of hundreds of MB/s
그런데 실제 앱 로딩은 왜 기대만큼 빠르지 않은가?Yet why isn't actual app loading as fast as expected?
FAST '11의 prefetching만으로는 모바일 환경에서 한계가 있었음Prefetching from FAST '11 had limitations in the mobile environment

핵심 질문Key Question
스토리지 스펙은 충분히 빠른데, 실제 앱 로딩에서 그 성능이 왜 발휘되지 않는가?Storage specs are fast enough — why isn't that performance realized during actual app loading?

연구 흐름Research Flow

Enlarging I/O Size [IEEE ESL '20]

Enlarging I/O Size for Faster Loading of Mobile Applications — Joo et al.

연구 단계Research Phase	논문 적용Application in Paper
1. 시스템 입출력 관찰System I/O Observation	Blktrace로 모바일 앱 시작 시 블록 I/O 트레이스 수집Collected block I/O traces during mobile app launch via Blktrace — demand paging이 생성하는 I/O 패턴을 커널 레벨에서 기록— Recorded I/O patterns generated by demand paging at kernel level
2. 특이 패턴 발견Pattern Discovery	Demand paging은 다양한 크기의 다수 I/O를 queue depth 1로 발행Demand paging issues many I/Os of varying sizes at queue depth 1 — Flash 스토리지의 I/O 크기 vs 대역폭이 비선형: 소형 I/O에서 대역폭이 낮고 대형 I/O에서 급격히 높아짐— Flash storage I/O size vs. bandwidth is nonlinear: low bandwidth for small I/Os, sharply higher for large I/Os — 원인: 큰 I/O일수록 eMMC 내부 다중 다이가 동시 활성화되어 병렬 처리량 증가— Cause: larger I/Os activate multiple internal dies simultaneously, increasing parallel throughput → I/O 크기를 키우면 같은 데이터량에 대해 훨씬 빠른 전송이 가능→ Enlarging I/O size enables much faster transfer for the same data volume
3. 개선 아이디어Improvement Idea	Explicit loading + I/O 크기 확대를 위한 3단계 최적화3-step optimization for explicit loading + I/O size enlargement — ① I/O Reordering: working set을 LBA 순서로 정렬 (기존 기법)— ① I/O Reordering: sort working set by LBA order (prior technique) — ② I/O Merging: 인접한 청크를 병합 (기존 기법)— ② I/O Merging: merge adjacent chunks (prior technique) — ③ I/O Padding: 비연속 청크 사이에 불필요한 페이지를 삽입하여 I/O 크기를 추가 확대 (본 논문의 핵심 기여)— ③ I/O Padding: insert unnecessary pages between non-contiguous chunks to further enlarge I/O size (key contribution)
4. 알고리즘 도출/보완Algorithm Design	Dynamic Programming 기반 최적 패딩 결정 알고리즘Optimal padding decision algorithm based on Dynamic Programming — Working set W = (b₀, b₁, …, b_n-1), 패딩 벡터padding vector P = (p₀, …, p_n-2) — t(i,j) = min { read(i,j), t(i,s)+t(s+1,j) } — 패딩 vs 분할의 최적 트레이드오프optimal trade-off between padding vs. splitting — Storage 성능 모델: I/O 크기별 읽기 시간을 실측하여 lookup table 구축— Storage performance model: empirical read-time lookup table per I/O size

연구 단계Research Phase	논문 적용Application in Paper
5. 구현Implementation	Explicit application loading 모듈 (pread 시스템콜 기반)Explicit application loading module (pread syscall-based) — Profiler: blktrace로 working set 수집collect working set via blktrace — Optimizer: reordering → merging → DP padding — Loader: 최적화된 I/O 시퀀스를 pread로 실행execute optimized I/O sequence via pread
6. 시스템 구현System Integration	Google Nexus 5 (32GB eMMC) + 16개 Android 앱 대상 테스트Tested on Google Nexus 5 (32GB eMMC) with 16 Android apps — Chrome, Facebook, Firefox, Netflix, WhatsApp, Twitter, Skype 등etc.
7. 성능평가Evaluation	평균 I/O 크기 5.6배 증가, 앱 로딩 시간 최대 30% 단축Average I/O size increased 5.6x, app loading time reduced by up to 30% — I/O merging: 19.3% 감소, I/O padding 추가 적용: 29.9% 감소— I/O merging: 19.3% reduction; with I/O padding: 29.9% reduction — 백그라운드 I/O 간섭 하에서도 성능 이점 유지— Performance gains maintained under background I/O interference — 에너지 소비도 1/3 감소 (I/O 크기 확대로 eMMC 효율 향상)— Energy consumption also reduced by 1/3 (eMMC efficiency improved with larger I/O sizes)

Fig.1: LBA block map with I/O padding (Messenger)

스토리지 연구의 핵심은 무엇인가What Is the Essence of Storage Research?

시스템 연구의 세 가지 역량Three Core Competencies in Systems Research

🔍

관찰 & 문제 발견Observation & Problem Discovery

시스템 I/O를 관찰하고
남들이 못 본 패턴을 찾아내는 능력Observe system I/O and
find patterns others miss

📈

알고리즘 설계Algorithm Design

관찰에서 얻은 통찰을
체계적 해법으로 전환Transform insights from observation
into systematic solutions

🛠

시스템 구현System Implementation

커널, 펌웨어, 디바이스 레벨
실제 시스템에 적용Apply at kernel, firmware, device level
to real systems

그렇다면 가장 대체 불가능한 역량은?Which competency is most irreplaceable?

알고리즘Algorithm
문제가 정의되면
누군가는 풀 수 있음Once the problem is defined,
someone can solve it

시스템 구현System Impl.
전통적 진입장벽이었으나
AI가 빠르게 낮추는 중Traditional barrier to entry,
rapidly lowered by AI

관찰 & 문제 발견Observation & Discovery
연구의 출발점이자 핵심
비결은 호기심The starting point and essence of research
The key is curiosity

FAST '11: "앱 실행마다 동일한 I/O 시퀀스" 관찰"Identical I/O sequences per app launch" observation → prefetching 도출derived
ESL '20: "I/O 크기와 대역폭의 비선형 관계" 관찰"Nonlinear I/O size vs. bandwidth" observation → I/O padding 도출derived
→ 두 논문 모두 관찰에서 출발, 알고리즘과 시스템 구현으로 완성됨Both papers started from observation, completed through algorithm design and system implementation

목차Table of Contents

I

스토리지 시스템 분야 소개Introduction to Storage Systems
시장, 업계, 연구 생태계Market, Industry, Research Ecosystem

II

스토리지 연구의 흐름Research Flow in Storage
관찰에서 구현까지From Observation to Implementation

III

선도주제: 최신 연구 사례Advanced Topics: Recent Research
HotStorage '24 & FAST '26

IV

향후 연구 방향Future Research Directions
오픈 문제 및 미래 전망Open Problems & Outlook

선수지식Background

Interrupt vs Polling

I/O 완료를 CPU에게 알리는 두 가지 방법Two ways to notify the CPU of I/O completion

Interrupt

I/O 요청 후 CPU는 다른 작업 수행 (context switch)After I/O request, CPU does other work (context switch)
장치가 완료되면 인터럽트 신호로 CPU에 통보Device sends interrupt signal to CPU upon completion
장점: CPU 유휴 시간 없음 (다른 일 가능)Pros: No CPU idle time (can do other work)
단점: context switch 오버헤드, interrupt 처리 지연Cons: Context switch overhead, interrupt handling latency

Polling

I/O 요청 후 CPU가 완료 여부를 반복 확인 (busy waiting)After I/O request, CPU repeatedly checks completion (busy waiting)
완료 즉시 감지 → 추가 지연 없음Detects completion instantly → no additional latency
장점: 초저지연 (context switch 없음)Pros: Ultra-low latency (no context switch)
단점: CPU를 독점 → 다른 작업 불가Cons: Monopolizes CPU → no other work possible

현실: 저장장치 I/O에는 당연히 InterruptReality: Interrupt Is the Default for Storage I/O
저장장치(HDD/SSD)는 CPU보다 수천~수만 배 느림 → I/O 완료까지 수십 µs~수 ms 소요
이 시간 동안 CPU가 busy waiting하는 것은 낭비 → interrupt가 사실상 유일한 선택
Linux 커널도 블록 I/O에 기본적으로 interrupt 방식을 사용Storage devices (HDD/SSD) are thousands to tens of thousands times slower than CPU → I/O takes tens of µs to ms
Busy waiting during this time is wasteful → interrupt is effectively the only choice
Linux kernel uses interrupt mode by default for block I/O

그런데 정말 항상 그럴까?But Is That Always the Case?
① VM 환경: 데이터가 host의 DRAM cache에 있다면? → I/O가 사실상 즉시 완료
② 초저지연(ULL) SSD: 장치 자체가 수 µs 만에 응답한다면? → interrupt 오버헤드가 I/O 시간과 비슷① VM environment: What if data is in host's DRAM cache? → I/O completes virtually instantly
② Ultra-Low Latency (ULL) SSD: What if the device responds in a few µs? → interrupt overhead rivals I/O time itself

알려진 문제Known Problem

가상머신의 스토리지 성능 문제VM Storage Performance Problem

가상화 환경에서의 I/O 경로I/O Path in Virtualized Environments

Guest OS의 I/O 요청 → Host OS가 대신 처리 (가상 블록 디바이스 경유)Guest OS I/O request → processed by Host OS (via virtual block device)
Host OS는 자체 page cache (DRAM)를 보유 → 자주 접근하는 데이터는 여기에 캐싱Host OS has its own page cache (DRAM) → frequently accessed data is cached here
Guest OS가 요청한 데이터가 host cache에 있으면 SSD 접근 없이 즉시 응답 가능If requested data is in host cache, can respond instantly without SSD access

그런데 왜 느린가?So Why Is It Slow?

Guest OS는 데이터가 host cache에 있는지 SSD에 있는지 알 수 없음Guest OS cannot know whether data is in host cache or on SSD
항상 "느린 장치에 있을 것"으로 가정 → 무조건 interrupt 방식으로 I/O 수행Always assumes "data is on slow device" → unconditionally uses interrupt mode
Host cache를 켜 두면 성능이 좋아야 할 텐데… 실제로는 기대에 못 미침Enabling host cache should improve performance… but reality falls short of expectations

핵심 질문Key Question
Host cache에 데이터가 있는데도 왜 성능이 안 나오는가? 원인은 무엇이고, 어떻게 해결할 수 있을까?Data is in host cache, yet why is performance poor? What is the cause and how can it be fixed?

연구 흐름Research Flow

Expanding Polled I/O Path [HotStorage '24]

Improving Virtualized I/O Performance by Expanding the Polled I/O Path of Linux — Seo, Joo, Dutt

연구 단계Research Phase	논문 적용Application in Paper
1. 시스템 입출력 관찰System I/O Observation	가상화 환경에서 guest OS의 I/O 경로 및 host page cache 접근 성능 측정Measured guest OS I/O path and host page cache access performance in virtualized environments — fio로 guest ↔ host page cache 간 4KB random read 처리량 측정— Measured 4KB random read throughput between guest and host page cache via fio
2. 특이 패턴 발견Pattern Discovery	Host page cache (DRAM)에 데이터가 있어도 guest의 처리량이 105 MB/s에 불과Even with data in host page cache (DRAM), guest throughput was only 105 MB/s — DRAM 대역폭은 수십 GB/s인데 guest I/O 경로에서 불필요한 context switch 발생— DRAM bandwidth is tens of GB/s, yet unnecessary context switches occur in guest I/O path — 원인: guest OS는 데이터가 host SSD에 있는지 host cache에 있는지 알 수 없음— Cause: guest OS cannot determine if data resides in host SSD or host cache — 항상 저장장치에 있을 것으로 간주 → I/O submit 후 무조건 task switching 수행— Always assumes data is on storage device → unconditional task switching after I/O submit
3. 개선 아이디어Improvement Idea	Host cache 접근 시 polling을 시도하여 context switch 오버헤드를 회피Attempt polling on host cache access to avoid context switch overhead — 데이터가 host cache에 있으면 I/O가 즉시 완료됨 → polling이 유리— If data is in host cache, I/O completes instantly → polling is advantageous — 시스템 수준 장벽: Linux 커널의 buffered I/O, mmap 경로에 poll path가 미비— System-level barrier: Linux kernel's buffered I/O and mmap paths lack poll path support — 보완책: polled I/O 경로를 buffered I/O (path ③), mmap (path ④)으로 확장— Solution: expand polled I/O path to buffered I/O (path ③) and mmap (path ④)
4. 알고리즘 도출/보완Algorithm Design	hipri 플래그 주입 기법 설계hipri flag injection technique design — Block layer에서 I/O 요청의 bi_opf에 hipri 플래그를 삽입하는 위치를 탐색— Identified where to insert hipri flag in bi_opf of I/O requests in the block layer — Buffered I/O: readahead 모듈 내 submit_bio() 직전에 hipri 설정— Buffered I/O: set hipri right before submit_bio() in readahead module — Memory mapped I/O: faultaround/readaround 모듈에 동일 적용— Memory mapped I/O: same approach applied to faultaround/readaround modules — blk_poll() 반환 후 hipri를 해제하여 page status 업데이트 부작용 방지— Clear hipri after blk_poll() returns to prevent page status update side effects
5. 구현Implementation	Linux 커널 소스 수정: 12개 파일, +74줄 / -19줄Linux kernel source modification: 12 files, +74 / -19 lines — ext4 파일 시스템 대상, 다른 파일 시스템 확장 용이— Targets ext4 file system, easily extensible to others — Guest OS 커널 + Host OS 커널 양쪽에 적용 가능— Applicable to both guest OS and host OS kernels
6. 시스템 구현System Integration	Host: Intel i7-10700K + 32GB DDR4 + Intel Optane 900P (ULL SSD) — Guest: VirtualBox 6.1.26, Ubuntu 20.04, 커널kernel 5.15 — fio 마이크로벤치마크 + 실제 앱microbenchmarks + real applications (Matlab, Android Studio, Kdenlive 등etc.)
7. 성능평가Evaluation	Buffered I/O 4KB random read: 105 → 332 MB/s (3.2배 향상)Buffered I/O 4KB random read: 105 → 332 MB/s (3.2x improvement) — DIO, Buffered, MMAP 모두 vanilla 대비 1.9~3.3배 처리량 향상— DIO, Buffered, MMAP all improved 1.9~3.3x throughput vs. vanilla — 소형 파일 복사 시간 43% 단축 (guest cp), host cp도 20% 단축— Small file copy time reduced 43% (guest cp), host cp also 20% faster — Guest OS I/O 최적화 기능 비활성화 시 메모리 사용량 31~36% 절감— 31~36% memory savings when disabling guest OS I/O optimizations
8. TODO	Remzi의 OSTEP 교과서에 이미 hybrid (two-phased) 접근이 예견됨Remzi's OSTEP textbook already foresaw the hybrid (two-phased) approach — 현재는 항상 poll → 향후 poll/interrupt hybrid (two-phased) 방식으로 확장 가능— Currently always polls → can be extended to poll/interrupt hybrid (two-phased) approach

알려진 문제Known Problem

초저지연 SSD 시대의 I/O 완료 통보 문제I/O Completion Notification in the ULL SSD Era

SSD가 점점 빨라지고 있음SSDs Are Getting Faster

기존 NAND SSD: I/O 지연 수십~수백 µs → interrupt 방식으로 충분Conventional NAND SSD: I/O latency tens to hundreds of µs → interrupt mode is sufficient
ULL SSD (3D XPoint, Z-NAND 등): I/O 지연이 ~10 µs 이하로 단축ULL SSDs (3D XPoint, Z-NAND, etc.): I/O latency reduced to ~10 µs or less
장치는 빨라졌는데, interrupt + context switch 오버헤드(수 µs)가 I/O 시간과 비슷해짐Device got faster, but interrupt + context switch overhead (~µs) becomes comparable to I/O time

기존 해결 시도Prior Approaches

Polling: I/O 완료를 busy wait으로 즉시 감지 — 앞서 소개한 방식Detect I/O completion instantly via busy wait — as introduced earlier
Hybrid Polling: 일정 시간 sleep한 뒤 깨어나서 polling — polling의 CPU 낭비를 줄이려는 절충안Sleep for a set time then wake up to poll — a compromise to reduce CPU waste from polling
두 방식 모두 Linux 커널에 구현되어 있음 (io_uring 등)Both are implemented in the Linux kernel (io_uring, etc.)

핵심 질문Key Question
Polling과 hybrid polling이 이미 존재하는데, 왜 실제 환경에서는 널리 쓰이지 않는가? 무엇이 보급을 가로막고 있을까?Polling and hybrid polling already exist — why aren't they widely adopted in practice? What's preventing broader deployment?

연구 흐름Research Flow

DPAS [USENIX FAST '26]

DPAS: A Prompt, Accurate and Safe I/O Completion Method for SSDs — Seo, Jung, Yoon, Chen, Joo, Lim, Dutt

연구 단계Research Phase	논문 적용Application in Paper
1. 시스템 입출력 관찰System I/O Observation	3종 SSD (3D XPoint, Z-NAND, TLC NAND)에서 interrupt, polling, hybrid polling 성능 측정Performance measurement of interrupt, polling, hybrid polling across 3 SSD types (3D XPoint, Z-NAND, TLC NAND) — 4KB random read, QD=1, 1~20 threads, CPU contention 유/무 조건에서 IOPS, latency, CPU 사용률 비교— 4KB random read, QD=1, 1~20 threads, IOPS/latency/CPU utilization compared with and without CPU contention
2. 특이 패턴 발견Pattern Discovery	어떤 단일 I/O completion 기법도 모든 상황에서 최적이 아님 (I/O Completion Trilemma)No single I/O completion method is optimal in all scenarios (I/O Completion Trilemma) — Interrupt: CPU 효율적이나 SSD가 빨라질수록 고정 CS 오버헤드의 비중 증가— Interrupt: CPU-efficient, but fixed CS overhead grows as SSDs get faster — Classic polling: 최저 latency이나 CPU 100% 점유 → contention 시 성능 붕괴— Classic polling: lowest latency, but 100% CPU → performance collapse under contention — Hybrid polling: epoch 기반 sleep 추정이 부정확, latency shelving (oversleep → 측정값 오염 → 양성 피드백 루프)— Hybrid polling: inaccurate epoch-based sleep estimation, latency shelving (oversleep → measurement pollution → positive feedback loop) — CPU contention 시 OS 스케줄러가 timer보다 늦게 깨움 → hybrid polling이 interrupt보다 나빠지는 현상— Under CPU contention, OS scheduler wakes later than timer → hybrid polling becomes worse than interrupt
3. 개선 아이디어Improvement Idea	두 가지 핵심 아이디어:Two core ideas: — PAS: I/O 시간을 측정하지 않고 binary 결과(UNDER/OVER)만으로 sleep duration 조정Adjusts sleep duration using only binary outcomes (UNDER/OVER) without measuring I/O time → epoch 기반의 근본 한계(부정확, latency shelving) 회피Avoids fundamental limitations of epoch-based approach (inaccuracy, latency shelving) — DPAS: polling, PAS, interrupt 3가지 모드를 런타임에 동적 전환Dynamic runtime switching among polling, PAS, and interrupt modes → timer failure 감지 + queue depth 모니터링으로 현재 상황에 최적 모드 선택Selects optimal mode via timer failure detection + queue depth monitoring
4. 알고리즘 도출/보완Algorithm Design	PAS: 4가지 규칙 — 최근 2개 I/O의 결과 쌍 (sr_pnlt, sr_last)로 adjust 결정PAS: 4 rules — determine adjust from the last 2 I/O outcome pairs (sr_pnlt, sr_last) — (UNDER, UNDER): adjust += UP (OVER, OVER): adjust -= DN — (UNDER, OVER): adjust = 1-DN (OVER, UNDER): adjust = 1+UP — duration = duration × adjust (UP=0.01, DN=0.1 → oversleep 회피 10배 우선oversleep avoidance 10x prioritized) DPAS 모드 전환:DPAS mode transitions: CP ↔ PAS(normal) ↔ PAS(overloaded) ↔ Interrupt — QD=1이면 CP, timer failure 감지 시 PAS overloaded, QD>θ이면 interrupt로 전환— QD=1: CP; timer failure detected: PAS overloaded; QD>θ: switch to interrupt
5. 구현Implementation	Linux multi-queue block layer에 구현: 9개 파일, +1,224줄 / -30줄Implemented in Linux multi-queue block layer: 9 files, +1,224 / -30 lines — Per-core 버킷 구조 (PAS ~592B, DPAS +~100B per core)— Per-core bucket structure (PAS ~592B, DPAS +~100B per core) — Block layer 통합 → 모든 sync I/O 앱에 투명하게 적용 (앱 수정 불필요)— Block layer integration → transparently applies to all sync I/O apps (no app modification needed)
6. 시스템 구현System Integration	Intel Xeon Gold 6230 (20코어) + 3종 SSD 테스트Intel Xeon Gold 6230 (20 cores) + 3 SSD types tested — Intel Optane P5800X (3D XPoint), Samsung 983 ZET (Z-NAND), SK hynix P41 (TLC NAND) — fio 마이크로벤치마크microbenchmarks + RocksDB/YCSB A~F + CPU contention + I/O interference 조합combinations
7. 성능평가Evaluation	PAS: LHP 대비 CPU 사용률 21pp 절감 (4KB random read)PAS: 21pp CPU utilization reduction vs. LHP (4KB random read) DPAS: CPU contention + I/O interference 하에서 INT 대비 OPS 9% (Optane), 7% (ZSSD), 5% (P41) 향상DPAS: under CPU contention + I/O interference, OPS improved 9% (Optane), 7% (ZSSD), 5% (P41) vs. INT — 8종 추가 NAND SSD에서도 per-device 튜닝 없이 일관된 최고 성능 달성— Consistent best performance across 8 additional NAND SSDs without per-device tuning — Artifact Evaluated: Available + Functional + Reproduced (USENIX 3관왕triple badge)

USENIX FAST '26 요약USENIX FAST '26 Summary

USENIX FAST 위상USENIX FAST Status

File and Storage Technologies 분야 최상위 학회
Top-tier venue in File and Storage Technologies
2002년 시작, 2026년 제24회
Started 2002, 24th edition in 2026
Santa Clara, CA, USA (2026.2.24~26)

FAST '26 채택 현황FAST '26 Acceptance Statistics

항목Item	수치Data
제출Submissions	253편 papers (역대 최다, '25 대비 51%↑all-time record, 51%↑ vs. '25)
채택Accepted	44편 papers (acceptance rate 17.4%)
Best Paper	2편 (모두 중국, 1편은 생성형 AI 기반 파일시스템) papers (both from China; 1 on GenAI-based file system)
Test of Time	F2FS (한국 연구진, 삼성전자Korean researchers, Samsung, FAST '15)

제1저자 국가별 분포First-Author Distribution by Country

주요 세션 주제Key Session Topics

FAST '26 기관별 저자 수FAST '26 Authors by Institution 중복 소속 각각 집계, 동일 기업 합산dual affiliations counted separately, same company merged

Cloud Rider →

목차Table of Contents

I

스토리지 시스템 분야 소개Introduction to Storage Systems
시장, 업계, 연구 생태계Market, Industry, Research Ecosystem

II

스토리지 연구의 흐름Research Flow in Storage
관찰에서 구현까지From Observation to Implementation

III

선도주제: 최신 연구 사례Advanced Topics: Recent Research
HotStorage '24 & FAST '26

IV

향후 연구 방향Future Research Directions
오픈 문제 및 미래 전망Open Problems & Outlook

Open Problem: Interrupt Coalescing

Interrupt Coalescing이란?
I/O가 완료될 때마다 interrupt를 하나씩 발생시키면, 고부하 상황에서 초당 수십만 건의 interrupt가 CPU를 압도 (interrupt storm)
→ Coalescing: 여러 I/O 완료를 모아서 하나의 interrupt로 한꺼번에 통보 — interrupt 횟수를 대폭 줄여 CPU 부담 경감
→ 단, 모으는 동안 지연이 생기므로 latency ↔ CPU 효율 사이의 트레이드오프 존재 If an interrupt fires for every single I/O completion, hundreds of thousands per second can overwhelm the CPU (interrupt storm)
→ Coalescing: batch multiple I/O completions into one interrupt — drastically reduces interrupt count and CPU overhead
→ However, batching adds delay, creating a latency vs. CPU efficiency trade-off

연구 과제Research Challenge
DPAS의 동적 모드 전환(CP ↔ PAS ↔ INT)에 coalescing을 4번째 모드로 통합하는 것이 다음 과제 Integrating coalescing as a 4th mode into DPAS's dynamic switching (CP ↔ PAS ↔ INT) is the next challenge

Open Problem: PAS + Interrupt Hybrid

PAS의 한계Limitations of PAS

PAS의 sleep duration 예측은 완벽할 수 없음
PAS's sleep duration prediction cannot be perfect
Oversleep 발생 시 → 앱이 체감하는 latency 증가 불가피
On oversleep → unavoidable increase in app-perceived latency
현재 DPAS: oversleep 감지 후 사후 보정 (다음 I/O부터 adjust)
Current DPAS: post-hoc correction after detecting oversleep (adjusts from next I/O)

아이디어: Timer + SSD Interrupt 이중 등록Idea: Dual Registration of Timer + SSD Interrupt

I/O submit 시 timer interrupt와 SSD interrupt를 동시에 설정
On I/O submit, set both timer interrupt and SSD interrupt simultaneously
먼저 발생하는 interrupt에 따라 모드 결정:
Mode determined by whichever interrupt fires first:

Timer가 먼저 expireTimer expires first
→ PAS 모드: sleep 정확 → poll 후 즉시 완료
→ SSD interrupt 취소→ PAS mode: accurate sleep → poll and complete immediately
→ Cancel SSD interrupt

SSD interrupt가 먼저 도착SSD interrupt arrives first
→ Oversleep 상황 → interrupt로 즉시 처리
→ timer 취소, latency 손실 최소화→ Oversleep scenario → immediate handling via interrupt
→ Cancel timer, minimize latency loss

등록특허Registered Patent 10-2919645
하이브리드 인터럽트 처리 장치 및 방법Hybrid Interrupt Handling Apparatus and Method
국민대학교산학협력단 · 발명자: 주용수
등록일자: 2026.01.26Kookmin Univ. Industry-Academic Cooperation · Inventor: Yongsoo Joo
Registered: 2026.01.26

Open Problem: SSD 장치당 부가가치 신기술Open Problem: Per-SSD Value-Add Technology

SSD 장치당 최소 10원의 부가가치를 추가할 수 있는 신기술 연구 진행 중
Research in progress on new technology that adds at least 10 KRW (~$0.007) of value per SSD
설계는 특허 출원 완료
Design patent filed

장치당 10원 × 시장 출하량 = ?10 KRW per device x market shipments = ?

대상 시장Target Market	연간 출하량 (추정)Annual Shipments (est.)	장치당 10원 기준 연간 부가가치Annual Value at 10 KRW/device
Enterprise SSD	~1.4억대00M units ($28B ÷ 평균avg. $200/대unit)	~14억원/년00M KRW/yr (~$1M)
전체 SSDAll SSDs (Consumer 포함incl. Consumer)	~3억대00M units	~30억원/년00M KRW/yr (~$2M)
전체 NAND 장치All NAND Devices (UFS/eMMC 포함incl. UFS/eMMC)	~50억대00M units	~500억원/년00M KRW/yr (~$34M)

시장 규모 참고Market Size Reference (Part I 슬라이드 기준per Part I slides, $1 ≈ ₩1,490)
Enterprise SSD: 2025 ~$28B (약 42조원), CAGR ~25%
전체 스토리지 시스템: 2025 ~$90B (약 134조원)
→ 장치당 10원이라는 극히 작은 부가가치도 출하량 규모에서 수십~수백억원으로 환산됨 Enterprise SSD: 2025 ~$28B (approx. 42T KRW), CAGR ~25%
Total storage systems: 2025 ~$90B (approx. 134T KRW)
→ Even a tiny 10 KRW per device translates to tens to hundreds of billions of KRW at market-scale shipments

※ 출하량은 시장 매출과 평균 단가 기반 추정치. 실제 적용 범위에 따라 부가가치 규모 변동. ※ Shipment volumes are estimates based on market revenue and average unit prices. Actual value-add varies by deployment scope.