IaaS Reconciliation
Note
Updated for bosh-release v183 (1.3010.0).
BOSH provides the bosh cloud-check CLI command (a.k.a bosh cck) to repair
IaaS resources used by a specific deployment. It is not commonly used during
normal operations; however, it becomes essential when some IaaS operations
fail and the Director cannot resolve problems without a human decision or when
the Resurrector is not enabled.
The Resurrector will only try to recover any VMs that are missing from the
IaaS or that have non-responsive agents. The bosh cck tool is similar to the
Resurrector in that it also looks for those two conditions; however, instead
of automatically trying to resolve these problems, it provides several options
to the operator.
In addition to looking for those two types of problems, bosh cck also checks
correct attachment and presence of persistent disks for each deployment job
instance.
Once the deployment is set via the bosh deployment command you can simply
run bosh cck. Here is an example output when no problems are detected:
bosh cck
Performing cloud check...
Processing deployment manifest
------------------------------
Director task 622
  Started scanning 1 vms
  Started scanning 1 vms > Checking VM states. Done (00:00:00)
  Started scanning 1 vms > 1 OK, 0 unresponsive, 0 missing, 0 unbound, 0 out of sync. Done (00:00:00)
     Done scanning 1 vms (00:00:00)
  Started scanning 0 persistent disks
  Started scanning 0 persistent disks > Looking for inactive disks. Done (00:00:00)
  Started scanning 0 persistent disks > 0 OK, 0 missing, 0 inactive, 0 mount-info mismatch. Done (00:00:00)
     Done scanning 0 persistent disks (00:00:00)
Task 622 done
Started     2015-01-09 23:29:34 UTC
Finished    2015-01-09 23:29:34 UTC
Duration    00:00:00
Scan is complete, checking if any problems found...
No problems found
Problems¶
VM is missing¶
Assuming there was a deployment with a VM, somehow that VM was deleted from
the IaaS outside of BOSH, here is what cloud-check would report:
bosh cck
Performing cloud check...
Processing deployment manifest
------------------------------
Director task 623
  Started scanning 1 vms
  Started scanning 1 vms > Checking VM states. Done (00:00:10)
  Started scanning 1 vms > 0 OK, 0 unresponsive, 1 missing, 0 unbound, 0 out of sync. Done (00:00:00)
     Done scanning 1 vms (00:00:10)
  Started scanning 0 persistent disks
  Started scanning 0 persistent disks > Looking for inactive disks. Done (00:00:00)
  Started scanning 0 persistent disks > 0 OK, 0 missing, 0 inactive, 0 mount-info mismatch. Done (00:00:00)
     Done scanning 0 persistent disks (00:00:00)
Task 623 done
Started     2015-01-09 23:32:45 UTC
Finished    2015-01-09 23:32:56 UTC
Duration    00:00:11
Scan is complete, checking if any problems found...
Found 1 problem
Problem 1 of 1: VM with cloud ID `i-914c046a' missing.
  1. Skip for now
  2. Recreate VM
  3. Delete VM reference
Please choose a resolution [1 - 3]: 3
Below is the list of resolutions you've provided
Please make sure everything is fine and confirm your changes
  1. VM with cloud ID `i-914c046a' missing.
     Delete VM reference
Apply resolutions? (type 'yes' to continue): yes
Applying resolutions...
Director task 624
  Started applying problem resolutions > missing_vm 168: Delete VM reference. Done (00:00:00)
Task 624 done
Started     2015-01-09 23:33:20 UTC
Finished    2015-01-09 23:33:20 UTC
Duration    00:00:00
Cloudcheck is finished
bosh cck determined that the i-914c046a VM was missing. Possible options are:
- 
Skip for now: the Director will not try to resolve this problem right now
- 
Recreate VM: the Director will just create new VM, deploy specified releases according to deployed manifest and finally start release jobsNote on current behaviour: cckwill not wait for all release job processes to start.
- 
Delete VM reference: the Director will not create new VM in its place. If cck is run again, it will not report that VM is missing since it's not expected to exist. Runningbosh deployafter cck will backfill missing VMs.
In the above example options 3 was picked and VM reference was deleted.
VM is not responsive (unresponsive agent)¶
Assuming there was a deployment with a VM, somehow Agent is no longer
responding to the Director. In this situation bosh vms will report VM's
agent as unresponsive agent:
bosh -d simple-deployment vms --details
Deployment `simple-deployment' Director task 630 Task 630 done +-----------------+--------------------+---------------+-----+------------+--------------------------------------+--------------+ | Job/index | State | Resource Pool | IPs | CID | Agent ID | Resurrection | +-----------------+--------------------+---------------+-----+------------+--------------------------------------+--------------+ | unknown/unknown | unresponsive agent | | | i-1db9ede6 | 59a30081-d63d-4c1b-80be-01fa681d8787 | active | +-----------------+--------------------+---------------+-----+------------+--------------------------------------+--------------+ VMs total: 1
Also bosh deploy will stop at Binding existing deployment stage since it is not able to communicate with unresponsive agent:
bosh deploy
..snip... Deploying --------- Deployment name: `tiny-dummy.yml' Director name: `idora' Are you sure you want to deploy? (type 'yes' to continue): yes Director task 631 Started preparing deployment Started preparing deployment > Binding deployment. Done (00:00:00) Started preparing deployment > Binding releases. Done (00:00:00) Started preparing deployment > Binding existing deployment. Failed: Timed out sending `get_state' to 59a30081-d63d-4c1b-80be-01fa681d8787 after 45 seconds (00:02:15) Error 450002: Timed out sending `get_state' to 59a30081-d63d-4c1b-80be-01fa681d8787 after 45 seconds Task 631 error For a more detailed error report, run: bosh task 631 --debug
bosh cck
Performing cloud check...
Processing deployment manifest
------------------------------
Director task 640
  Started scanning 1 vms
  Started scanning 1 vms > Checking VM states. Done (00:00:10)
  Started scanning 1 vms > 0 OK, 1 unresponsive, 0 missing, 0 unbound, 0 out of sync. Done (00:00:00)
     Done scanning 1 vms (00:00:10)
  Started scanning 0 persistent disks
  Started scanning 0 persistent disks > Looking for inactive disks. Done (00:00:00)
  Started scanning 0 persistent disks > 0 OK, 0 missing, 0 inactive, 0 mount-info mismatch. Done (00:00:00)
     Done scanning 0 persistent disks (00:00:00)
Task 640 done
Started   2015-01-09 23:33:45 UTC
Finished  2015-01-09 23:33:55 UTC
Duration  00:00:10
Scan is complete, checking if any problems found...
Found 1 problem
Problem 1 of 1: dummy/0 (i-914c046a) is not responding.
  1. Skip for now
  2. Reboot VM
  3. Recreate VM
  4. Delete VM reference (forceful; may need to manually delete VM from the Cloud to avoid IP conflicts)
Please choose a resolution [1 - 4]: 4
Below is the list of resolutions you've provided
Please make sure everything is fine and confirm your changes
  1. dummy/0 (i-914c046a) is not responding.
     Delete VM reference (forceful; may need to manually delete VM from the Cloud to avoid IP conflicts)
Apply resolutions? (type 'yes' to continue): yes
Applying resolutions...
Director task 641
  Started applying problem resolutions > unresponsive_agent 168: Delete VM reference (...). Done (00:00:05)
Task 641 done
Started   2015-01-09 23:35:20 UTC
Finished  2015-01-09 23:35:25 UTC
Duration  00:00:05
Cloudcheck is finished
cck determined that i-914c046a VM is unresponsive. Possible options are:
- 
Skip for now: the Director will not try to resolve this problem right now
- 
Reboot VM: the Director will power off and then power on existing VMNote on current behaviour: cckwill not wait for all release job processes to start.
- 
Recreate VM: the Director will delete existing VM, then create a new VM, deploy specified releases according to deployed manifest and finally start release jobsNote on current behaviour: cckwill not wait for all release job processes to start.
- 
Delete VM reference: the Director will not create new VM in its place. Ifcckis run again, it will not report that VM is unresponsive since it does not exist. Runningbosh deployafter cck will backfill missing VMs.
In the above example options 4 was picked and VM reference was deleted.
Persistent Disk is not attached¶
Assuming there was a deployment with a VM, somehow persistent disk got detached.
bosh cck
Performing cloud check...
Processing deployment manifest
------------------------------
Director task 656
  Started scanning 1 vms
  Started scanning 1 vms > Checking VM states. Done (00:00:00)
  Started scanning 1 vms > 1 OK, 0 unresponsive, 0 missing, 0 unbound, 0 out of sync. Done (00:00:00)
     Done scanning 1 vms (00:00:00)
  Started scanning 1 persistent disks
  Started scanning 1 persistent disks > Looking for inactive disks. Done (00:00:00)
  Started scanning 1 persistent disks > 0 OK, 0 missing, 0 inactive, 1 mount-info mismatch. Done (00:00:00)
     Done scanning 1 persistent disks (00:00:00)
Task 656 done
Started   2015-01-13 22:04:56 UTC
Finished  2015-01-13 22:04:56 UTC
Duration  00:00:00
Scan is complete, checking if any problems found...
Found 1 problem
Problem 1 of 1: Inconsistent mount information:
Record shows that disk 'vol-549f071f' should be mounted on i-4fcd99b4.
However it is currently :
  Not mounted in any VM.
  1. Skip for now
  2. Reattach disk to instance
  3. Reattach disk and reboot instance
Please choose a resolution [1 - 3]: 2
Below is the list of resolutions you've provided
Please make sure everything is fine and confirm your changes
  1. Inconsistent mount information:
Record shows that disk 'vol-549f071f' should be mounted on i-4fcd99b4.
However it is currently :
  Not mounted in any VM
     Reattach disk to instance
Apply resolutions? (type 'yes' to continue): yes
Applying resolutions...
Director task 657
  Started applying problem resolutions > mount_info_mismatch 23: Reattach disk to instance. Done (00:00:22)
Task 657 done
Started   2015-01-13 22:05:19 UTC
Finished  2015-01-13 22:05:41 UTC
Duration  00:00:22
Cloudcheck is finished
cck determined that vol-549f071f persistent disk is not attached to i-4fcd99b4 VM. Possible options are:
- 
Skip for now: the Director will not try to resolve this problem right now
- 
Reattach disk to instance: the Director will reattach persistent disk to the VM and mount it at its usual location/var/vcap/store.Note on current behavior: Release job processes will not be restarted when persistent disk is remounted. 
- 
Reattach disk and reboot instance: the Director will reattach persistent disk to the VM and reboot it so that Agent can safely mount persistent disk before starting any release job processes.Note on current behavior: cck will not wait until VM reboots and restarts all release job processes. 
Inactive Disk¶
Assuming there are are disks that are marked as inactive,
bosh cck
Using environment '10.0.0.5' as client 'admin' Using deployment 'test-deployment' Task 417 Task 417 | 23:59:42 | Scanning 7 VMs: Checking VM states (00:00:06) Task 417 | 23:59:48 | Scanning 7 VMs: 7 OK, 0 unresponsive, 0 missing, 0 unbound (00:00:00) Task 417 | 23:59:48 | Scanning 2 persistent disks: Looking for inactive disks (00:00:01) Task 417 | 23:59:49 | Scanning 2 persistent disks: 0 OK, 0 missing, 2 inactive, 0 mount-info mismatch (00:00:00) Task 417 Started Tue Apr 21 23:59:42 UTC 2020 Task 417 Finished Tue Apr 21 23:59:49 UTC 2020 Task 417 Duration 00:00:07 Task 417 done # Type Description 9 inactive_disk Disk 'disk-eaca0b50-daf5-4fba-6dbf-06354e11e0af' (102400M) for instance 'database/5cc2ab64-a269-4c91-b6ba-9a6ff3c2d9d3 (0)' is inactive 10 inactive_disk Disk 'disk-270c7959-5f06-4597-727c-6c4283afd138' (102400M) for instance 'blobstore/f048378e-27c8-4ab6-b08a-1c47d09570c5 (0)' is inactive 2 problems 1: Skip for now 2: Delete disk 3: Activate disk Disk 'disk-eaca0b50-daf5-4fba-6dbf-06354e11e0af' (102400M) for instance 'database/5cc2ab64-a269-4c91-b6ba-9a6ff3c2d9d3 (0)' is inactive (1): 3 1: Skip for now 2: Delete disk 3: Activate disk Disk 'disk-270c7959-5f06-4597-727c-6c4283afd138' (102400M) for instance 'blobstore/f048378e-27c8-4ab6-b08a-1c47d09570c5 (0)' is inactive (1): 3 Continue? [yN]: y Task 419 Task 419 | 00:00:43 | Applying problem resolutions: Disk 'disk-eaca0b50-daf5-4fba-6dbf-06354e11e0af' (102400M) for instance 'database/5cc2ab64-a269-4c91-b6ba-9a6ff3c2d9d3 (0)' is inactive (inactive_disk 1): Activate disk (00:00:06) Task 419 | 00:00:49 | Applying problem resolutions: Disk 'disk-270c7959-5f06-4597-727c-6c4283afd138' (102400M) for instance 'blobstore/f048378e-27c8-4ab6-b08a-1c47d09570c5 (0)' is inactive (inactive_disk 2): Activate disk (00:00:06) Task 419 Started Wed Apr 22 00:00:43 UTC 2020 Task 419 Finished Wed Apr 22 00:00:55 UTC 2020 Task 419 Duration 00:00:12 Task 419 done Succeeded
cck determined that disk-eaca0b50-daf5-4fba-6dbf-06354e11e0af persistent disk is inactive. Possible options are:
- 
Skip for now: the Director will not try to resolve this problem right now
- 
Delete disk: the Director will check if the disk is unmounted, detach from any active VM, and mark it as an orphaned disk
- 
Activate disk: the director will check if the disk is mounted and if the active VM already has a disk before marking the disk as active
Persistent Disk is missing¶
Assuming there was a deployment with a VM, somehow persistent disk got deleted.
Note: Not all CPIs implement needed functionality to determine if disk is missing. Those CPIs will report missing disk as Persistent Disk is not attached problem; however, both reattaching resolutions will fail since persistent disk would not be found.
Automate recovery selection¶
Automate recovery using the --resolution=RESOLUTION-VALUE cloud-check (cck)
option, where RESOLUTION-VALUE represents one of the following:
- ignore- skip resolution
- recreate_vm- recreate VM and wait for processes to start
- recreate_vm_without_wait- recreate VM without waiting for processes to start (and thus- post-startscripts won't run after processes have restarted, which could have consequences depending on how critical- post-startscripts are)
- reboot_vm- reboot the VM
- delete_vm- delete the VM
- delete_vm_reference- remove the VM reference that Director has (this could cause IaaS resources to be abandoned)
- delete_disk- delete the disk
- delete_disk_reference- remove the disk reference that Director has (this could cause IaaS resources to be abandoned)
- activate_disk- mark the disk as active
- reattach_disk- reattach persistent disk to the VM and mount it at its usual location- /var/vcap/store
- reattach_disk_and_reboot- reattach the persistent disk to the VM and reboot it so that Agent can safely mount persistent disk before starting any release job processes. cck will not wait until VM reboots and restarts all release job processes
Warning
Consider using cck interactively because the selected resolution will be
applied to all problems that are found. Specifying --resolution can be
risky if new, unexpected problems occur while you run the command and
selected resolution may no longer be appropriate.