Backlog #4420

Add quota for VCPU

Added by Bill Cole over 4 years ago. Updated over 4 years ago.

Status:PendingStart date:04/15/2016
Priority:HighDue date:
Assignee:-% Done:

0%

Category:Core & System
Target version:-

Description

To constrain users from the problematic practice of using high VCPU/CPU ratios, it would be helpful to have a VCPU quota in addition to the CPU quota. This would be particularly helpful in circumstances where hosts are loaded near their practical limits with interdependent VMs. We (CipherSpace) have seen a user effectively take down a host by launching multiple VMs with an aggregate VCPU/CPU over 4 and aggregate CPU to consume all unreserved CPU time. This sort of deployment seems functional under light actual load, but when all of the VMs start needing more CPU at once (e.g. as interconnected pieces of a modular application) the host can be sent into a failure mode characterized by very high context-switching rates (>100k/sec) and generally very slow response. This can cascade into UDP monitoring failure, resulting in fallback to SSH monitoring which only worsens the load while also failing, and ultimately this can trigger a migration attempt if the HOST_HOOK is enabled, which we have seen fail in a mode that leaves VMs running on two hosts accessing the same shared images, corrupting their filesystems.

We are currently handling this risk by cautioning users against creating VMs with VCPU/CPU>2 but that is an oversimplification of the risk and requires users to behave prudently. It would give them more flexibility within an enforced limit and more accurately model the risk if we could limit total active VCPU for each customer based on the specific nature of their VM usage.


Related issues

Related to Backlog #3210: Extend quota subsystem Pending 09/29/2014

History

#1 Updated by Bill Cole over 4 years ago

I stumbled across http://dev.opennebula.org/issues/3210, of which I suppose this could be considered a duplicate

#2 Updated by Ruben S. Montero over 4 years ago

  • Tracker changed from Feature to Backlog
  • Category set to Core & System
  • Priority changed from Normal to High

#3 Updated by Ruben S. Montero over 4 years ago

Also available in: Atom PDF