Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand FAILED_PRECONDITION error in NodeExpandVolume RPC call #431

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gnufied
Copy link
Contributor

@gnufied gnufied commented May 14, 2020

Expand the error from NodeExpandVolume to include readonly volume errors and add additional context about returning FAILED_PRECONDITION error if NodeExpandVolume is not supported for node-published volumes.

@gnufied
Copy link
Contributor Author

gnufied commented May 14, 2020

cc @bswartz

@gnufied gnufied changed the title Change error associated with readonly volume expansion Change error associated with readonly NodeExpandVolume May 14, 2020
Copy link
Contributor

@bswartz bswartz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gnufied gnufied force-pushed the fix-readonly-handling branch 2 times, most recently from 23d227b to f72fbbe Compare May 21, 2020 19:18
@gnufied
Copy link
Contributor Author

gnufied commented May 22, 2020

The build failures on this branch are unrelated to this PR. Filed #434 for tracking.

@gnufied gnufied changed the title Change error associated with readonly NodeExpandVolume Expand FAILED_PRECONDITION error in NodeExpandVolume RPC call May 22, 2020
@gnufied gnufied force-pushed the fix-readonly-handling branch from 898aec4 to bd02984 Compare May 22, 2020 15:21
@@ -2576,7 +2580,7 @@ message NodeExpandVolumeResponse {
|-----------------------|-----------|-----------------------|-----------------------------------|
| Exceeds capabilities | 3 INVALID_ARGUMENT | Indicates that CO has specified capabilities not supported by the volume. | Caller MAY verify volume capabilities by calling ValidateVolumeCapabilities and retry with matching capabilities. |
| Volume does not exist | 5 NOT FOUND | Indicates that a volume corresponding to the specified volume_id does not exist. | Caller MUST verify that the volume_id is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
| Volume in use | 9 FAILED_PRECONDITION | Indicates that the volume corresponding to the specified `volume_id` could not be expanded because it is node-published or node-staged and the underlying filesystem does not support expansion of published or staged volumes. | Caller MUST NOT retry. |
| Volume in use or readonly | 9 FAILED_PRECONDITION | Indicates that the volume corresponding to the specified `volume_id` could not be expanded because it is node-published or node-staged or readonly and the storage provider does not support expansion of published or staged or readonly volumes. | Caller MUST NOT retry. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Volume in use or readonly | 9 FAILED_PRECONDITION | Indicates that the volume corresponding to the specified `volume_id` could not be expanded because it is node-published or node-staged or readonly and the storage provider does not support expansion of published or staged or readonly volumes. | Caller MUST NOT retry. |
| Volume in use or readonly | 9 FAILED_PRECONDITION | Indicates that the volume corresponding to the specified `volume_id` could not be expanded because it is node-published or node-staged or readonly and the storage provider does not support expansion of published or staged or readonly volumes. | Caller SHOULD ensure that volume is not node-published before retrying with exponential back off. |

@@ -2523,12 +2523,16 @@ If plugin has `STAGE_UNSTAGE_VOLUME` node capability then:

Otherwise `NodeExpandVolume` MUST be called after successful `NodePublishVolume`.

A plugin that has `STAGE_UNSTAGE_VOLUME` node capability and supports `NodeExpandVolume` ONLY after `NodeStageVolume` and before `NodePublishVolume` but not after `NodePublishVolume` may return `FAILED_PRECONDITION` error code if `NodeExpandVolume` is called after `NodePublishVolume` - CO MUST NOT retry if volume is node-published.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to say CO MUST NOT retry if volume is node-published? Because that means offline is not supported?

Suggested change
A plugin that has `STAGE_UNSTAGE_VOLUME` node capability and supports `NodeExpandVolume` ONLY after `NodeStageVolume` and before `NodePublishVolume` but not after `NodePublishVolume` may return `FAILED_PRECONDITION` error code if `NodeExpandVolume` is called after `NodePublishVolume` - CO MUST NOT retry if volume is node-published.
An SP that has `STAGE_UNSTAGE_VOLUME` node capability and supports `NodeExpandVolume` ONLY after `NodeStageVolume` and before `NodePublishVolume` but not after `NodePublishVolume` may return `FAILED_PRECONDITION` error code if `NodeExpandVolume` is called after `NodePublishVolume` and, in response, the CO SHOULD ensure that the volume is is node-staged but not node-published before retrying with exponential backoff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did mean "CO MUST NOT retry if volume is node-published". The reason being - if a SP does not allow expansion after node-publish and volume already has been node-published, the only way NodeExpandVolume can succeed after retry is - if they NodeUnpublish the volume and the only way CO can retry expansion is on another node after node-stage but before node-publish. For CO it is already too late to retry expansion on this node because NodeUnpublish is very likely to be a unsafe operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that means offline is not supported?

I think it does not mean that - it means that CO must try expansion on another node after node-stage and before node-publish. It just means it can't happen on this node. Having said that - there is a slight problem with language:

If plugin has STAGE_UNSTAGE_VOLUME node capability then:

  • NodeExpandVolume MUST be called after successful NodeStageVolume.
  • NodeExpandVolume MAY be called before or after NodePublishVolume.

I know Kubernetes always calls NodeExpandVolume after NodeStageVolume if NodeExpand is supported but in general a different CO may choose to interpret this differently and may skip calling NodeExpand between node-stage and node-publish and only call NodeExpand after node-publish in which case "offline" expansion may not work on that particular CO. Even then I think recovery from volume in-use error on node shouldn't be "retry with exponential backoff" because it is highly problematic to node-unpublish a volume or if volume is readonly on a node, make it RW.

I think we have two options:

  1. Make no spec change for this particular behaviour in NodeExpandVolume. The downside might be - another CO may interpret this spec differently and implement call to NodeExpandVolume differently and plugins might be dependent on k8s behaviour (although I don't know any plugins which want this behaviour).
  2. Make the language change around when NdoeExpandVolume can be called. so something like - "it is desirable to call NodeExpandVolume before NodePublish to permit expansion of volumes which only permit expansion after node-stage and before node-publish and not after node-publish"

@gnufied gnufied force-pushed the fix-readonly-handling branch from bd02984 to 0d6a8ba Compare July 15, 2020 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants