Backlog #2827

Hypervisor-side block device cache management

Added by Stuart Longland over 3 years ago. Updated about 1 month ago.

Status:PendingStart date:04/07/2014
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:Drivers - Storage
Target version:-

Description

Hi all,

I've been slowly working on getting an OpenNebula-based virtual machine cluster up and running at my workplace. We use Ceph to provide shared storage, and one complaint was that the VM access was painfully slow. I did some testing and found that the VMs could access the disk at around 80MB/sec, sometimes this would drop to 20MB/sec, especially if OpenNebula was doing an `rbd copy` of a disk image for deployment of a VM. I did some tweaks on the hosts, setting up bonded Ethernet connections for the storage backhaul (the storage nodes still have single gigabit, we're working on this) which brought speed up to 120MB/sec.

So I've began looking at ways of speeding up the local VM speed. For now my efforts have been focussed on a driver that combines Ceph RBDs and FlashCache, but really these should be two separate drivers in separate subsystems: a driver that manages the access to the shared storage and maps/mounts it on the virtual machine host, and a driver that privisions and configures a chunk of local disk cache to run the VM in.

My work thus far is here: http://git.longlandclan.yi.org/?p=opennebula-ceph-flashcache.git

Right now this driver uses FlashCache and LVM for local cache, and Ceph version-2 RBDs for back-end storage. I've tested migration of VMs between hosts, this seems to work as does undeployment/redeployment of VMs. The code should be considered alpha-grade however. I would not recommend merging this into OpenNebula, instead, consider this the prototype. :-)

The initial tests I've done yielded close to 240MB/sec read speed in the VM.

Unlike the current Ceph driver, this uses the newer RBD format and in particular, it uses the Copy-On-Write clone feature of this format. It also uses `rbd map` on the hosts to mount the RBD volume as a kernel block device so it can be passed as a disk to FlashCache: thus it will require a newer kernel. My testing is on Ubuntu 14.04 Beta.

I'd like to split this driver up into its two parts, with the cache part being configured in the datastore template (we have DS_MAD and TM_MAD… so maybe a CM_MAD for cache manager?). That would allow its use with other NAS systems like Gluster and iSCSI.

It would also allow exploration of other cache managers such as bcache and dm-cache which as yet I haven't had the chance to try, as it'd require a re-write similar to the one I'm proposing here: so I might as well get some forward planning done and then we can write it proper the first time.

History

#1 Updated by Stuart Longland over 3 years ago

A little note, we updated to OpenNebula 4.6 and Ubuntu 14.04 across the whole cluster yesterday. This driver seems to work in OpenNebula 4.6, although more testing is needed.

#2 Updated by Ruben S. Montero over 3 years ago

  • Tracker changed from Feature to Backlog

#3 Updated by Stuart Longland over 3 years ago

I've just started implementing a better-engineered patch. One thing I've observed is that the scheduler appears to be blissfully unaware of any kind of local storage on the hosts, and is even less aware about how much the VM uses.

I'm therefore starting on adding some support in the core of OpenNebula for two components:

  • a Local Storage MAD (LS_MAD): which will be configured per host (in the host template, much like how virtual networking is configured) that provisions local storage for cache and volumes
  • a Local Cache MAD (LC_MAD): which configures a local cache driver (e.g. FlashCache, Enhanced I/O, dm-cache) to take the block device presented by the transfer manager and combine it with the locally provisioned cache.

Then my existing cephcache driver can have the cache and LVM parts stripped out: the new driver would be called "kceph", the "k" denoting that this uses the kernel RBD driver, rather than the userspace librbd driver used by "ceph".

Presently, TM_MAD and its ln step, links directly to storage.

The datastore/transfer manager drivers might opt to override some caching settings for the sake of compatibility, or raise an error. e.g. the ssh transfer manager can't support any cache mode other than OFFLINE. Modes other than NONE require the image to be in/converted to a raw format.

Here, I plan to first call on LS_MAD to provision whatever storage is needed to store the local image, there'd be provision and unprovision scripts that do this. So LS_MAD would be asked to provision some cache area, this returns a device path to the storage (i.e. /dev/dm-123 for LVM, /dev/loop123 for file). This would be skipped when cache_mode_==_NONE. For a LVM-based LS_MAD driver, this would be a wrapper around lvcreate, lvdisplay and lvremove.

TM_MAD, instead would use a mount step, which would make calls on centralised storage to mount or retrieve the image, returning the path to the image on the host. The TM_MAD would be passed the path to that local storage.

  • For the shared TM_MAD driver, it simply return the path to where that image resides on the host (i.e. /var/lib/one/datastores/123/...)
  • For the ssh TM_MAD driver, it could download a copy of the image, writing it directly over the locally provisioned block device, then return that device as the local path.
  • For kceph, it would do an rbd mount and return the /dev/rbdX device.
  • An iSCSI driver would do similar: mount the iSCSI target and return the /dev/sdX device it generates.
Having received a device name from TM_MAD's mount command, the ln command in LC_MAD takes over. It is passed the results of TM_MAD mount (source_device) and LS_MAD provision (local_device):
  • If (source_device local_device) or (cache_mode NONE):
    • let vm_device = source_device
  • Else:
    • Call set-up functions to create cached block device (e.g. flashcache-create)
    • let vm_device = /path/to/provisioned/cache/device (e.g. /dev/dm-345)
    • If image_precache > 0:
      • Read in image_precache MB of vm_device to fill some of the cache with the initial image_precache MB of data (to speed up booting of some VMs)
  • Link vm_device to the system datastore

As an addition, I'll be making some extensions to the scheduler to improve the placement decisions, allowing the scheduler to place a VM based on the amount of local storage required by the VM. In order to do this, I plan to make the following changes:

  • Images/disks will carry 3 new attributes:
    • CACHE_MODE: The caching mode to use for a given image or disk, where:
      • NONE = No caching, the image is directly accessed from shared storage and is not transferred to the host
      • AROUND = Write-Around caching: only reads are cached, writes bypass cache completely.
      • THRU = Write-Thru caching: Writes are written simultaneously to back-end storage and cache.
      • BACK = Write-Back caching: Writes are stored locally in cache first, then later flushed to back-end storage.
      • OFFLINE = The entire image is copied to cache during the VM's PROLOG phase, and is copied back in the EPILOG phase. (i.e. like when TM_MAD==ssh)
    • CACHE_SIZE: The size of the cache in MB
    • PRELOAD_SIZE: The amount to pre-read during the PROLOG phase to speed up launching of VMs.
  • It may be desirable for the datastore to have these parameters which can be inherited as defaults.
  • Hosts will carry 4 new parameters:
    • LS_MAD: Local Storage Middleware Access Driver, specifies what driver is used for local storage:
      • dummy: No local storage provisioning
      • file: Provisioning using raw loopback files
      • lvm: Provisioning using a LVM volume
    • LS_PATH: Path on the host for local storage
      • For LS_MAD==file, the directory where volume files will be kept
      • For LS_MAD==lvm, the LVM volume group
    • LS_MAX: The maximum size of a cache partition in MB.
    • LC_MAD: The Local Cache Middleware Access Driver, e.g.
      • dummy: No cache provisioning
      • flashcache: Local caching using FlashCache

For the sake of simplicity, each host will have just one LS_MAD driver (and pool), and one LC_MAD driver, which will be used for all VMs deployed to that host.

I have cloned the OpenNebula git repository and will be making my changes here:

#4 Updated by Florian Heigl about 1 month ago

Hi Stuart,

I've had the same issues with ssd-cached disk backends of other types.
I'd love to try out at least a part of what you built there, unfortunately your git is a tad broken.

If you happen to fix it at some point, could you post an update here?
No idea why it didn't go into the addon marketplace, I think it would also help with the newer ceph cache pools.

Ah and finally, in one case a fw update of the caching SSDs has improved things a lot for me. They were worn and tired and now they feel better :-)

Also available in: Atom PDF