# Data Pipes Declarative Control over Data Movement

Lukas Vogel, Daniel Ritter, Danica Porobic, Pınar Tözün, Tianzheng Wang, Alberto Lerner

JNIVERSITY OF COPENHAGEN



SFU

#### Status Quo: Data movement is a bottleneck



| ٠ | <ul> <li>Wildly different hardware at every level</li> <li>Characteristics (latency, throughput,)</li> <li>Granularity (Cache line, page, block,)</li> <li>Access patterns (sequential, random)</li> </ul>                   |
|---|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ٠ | <ul> <li>Applications spend a lot of time moving data</li> <li>Explicitly: e.g., pread</li> <li>Implicitly: e.g., CPU caches</li> </ul>                                                                                      |
| ٠ | <ul> <li>Moving data is complex</li> <li>APIs (pread, mmap, io_uring, SPDK, S3,)</li> <li>Protocols (NVMe/PCI, HTTP, SATA, DDR(-T))</li> <li>Instructions (CLFLUSH(OPT), CLWB, CLDEMOTE)</li> <li>CPU involvement</li> </ul> |
| • | <ul> <li>We'd like to utilize "shortcuts"</li> <li>Such as DMA (I/OAT), DDIO</li> </ul>                                                                                                                                      |

- Exist between nearly any device pair
- We call them "data movement primitives"

#### Reason 1: Primitives are hard to control



Xeon Gold 6212U, 24 Cores, ~1.5 MB L3 Cache/Core, Samsung 970 Pro (PCIe 3)

#### ⇒ Hard to control!

#### Reason 2: Primitives are hard to target



Fincing "PeerFine.h" Fincing sufficient "peerform" Sincing "peerform" and PeerFinest init thread(self same)) (

4

#### Primitives have to be re-thought

- Problem: Hard to utilize, hard to control
  - Vendor-specific, transparent
  - Varying APIs, unconventional or not widely supported
- The culprit: **implicit**, **bottom-up approach** 
  - Vendors see primitives as accelerators for specific use-cases
  - Breaks down when straying from the happy path

#### 2.3 Intel<sup>®</sup> DDIO Requires No Industry Enabling

Inte DDIO is enabled by default on all Intel® Xeon® processor E5 servers and workstations.

Intel DDIO has no hardware dependencies and is invisible to software. No driver changes are required. No OS or VMM changes are required. No application changes are required. All industry I/O devices from IHvs benefit from Intel DDIO including InfiniBand Architecture<sup>®</sup>, Fibre Channel, RAID, and Ethernet. However, Intel Ethernet products with their high performing, stateless architecture excel with Intel DDIO.

#### Proposal: Declarative, top-down data movement

- Build a unified abstraction
- Give control to the developers
- We call it: Data Pipes



- "Typed" locators make data location explicit
- Pipes make data movement explicit
- Top-down approach is declarative and shows intentions
- Framework underneath decides on how data is actually moved
  - Employ DMA hardware capabilities
  - Software fallback

## Data Pipes Example

SSDResourceLocator ssd("file/path");

DRAMResourceLocator dram(1 \* GB);

parallel\_per\_core {

}

CacheResourceLocator cache(CACHE\_SZ);

Pipe ssd\_up(ssd, cache);

Pipe cache\_down(cache, dram);

```
ssd_up.transfer(1 * MB);
```

```
sort(cache, 1 * MB);
```

```
cache_down.transfer(1 * MB);
```



Please take a look at the paper for an **inversion of control** and an **OS-supported** approach

Properties of Data Pipes (1/2)





Declarative





Optimizable

## Properties of Data Pipes (2/2)





Orthogonal to existing primitives



Adaptable to new primitives

#### **Research Agenda**

• Current Prototype: Data Pipes in user space (e.g., SPDK)

- Integrate Data Pipes into the OS
  - Common baseline for different libraries
  - Reserve resources managed by the Data Pipe runtime (e.g., no swapping)
  - API with epoll: See paper!

- Cloud Infrastructure Opportunities
  - Predictability => Reduce overprovisioning (noisy neighbors!)
  - Common API helps customers to migrate to new device generations

#### Hardware Vendor Wishlist

- Give us more access to the hardware!
  - Expose state and tuning knobs of DDIO, I/OAT, ...
  - More control over the caching layer

- Intel's integrated DMA unit lacks some features
  - Lacking documentation
  - Low bandwidth for smaller data transfers
  - Limited number of channels

- Increased reach of DMA units
  - Some device pairs have no acceleration, e.g. DMA directly into the registers?

# Conclusions

Ο

- Data movement is a bottleneck in data-intensive systems
- Existing data movement primitives are heterogeneous and hard to use
- Data Pipes: Abstraction to ...
  - ... present a uniform interface to the application
  - ... make data movement primitives
    - top-down,
    - declarative,
    - and explicit.



## Inversion of Control approach

```
SSDResourceLocator ssd("file/path");
```

```
DRAMResourceLocator dram(1*GB);
```

```
PipeRuntime runtime;
```

```
runtime.fork_and_start();
```

```
vector<future> futures;
```

```
Pipe dram_down(dram, ssd, &runtime);
```

parallel\_per\_core { promise<void> write\_promise; futures.push\_back(write\_promise.get\_future()); // Do some computation with data in DRAM dram\_down.transfer(20 \* MB, write\_promise); wait\_all(futures).wait();

}

futures.clear();

## OS supported approach

```
SSDResourceLocator ssd("file/path");
```

```
DRAMResourceLocator dram(1*GB);
```

```
int dram_downpipe_fd = create_pipe(dram,
ssd);
```

```
int epoll_fd = epoll_create1(0);
```

```
epoll_event dram_pipe_op;
```

```
dram_pip_op.events = EPOLLTRANSFER;
```

```
dram_pip_op.fd = dram_downpipe_fd;
```

```
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, 0,
&dram_pipe_op)
```

```
// Wait until the pipe has capacity to transfer
epoll_event event;
epoll_wait(epoll_fd, &event, 1);
pipe_transfer(dram_downpipe_fd);
```

parallel\_per\_core {

}