We do use cPanel's system backups to backup the account files and also the "
default" files and directories. However, i'm not sure these are enough.
Disaster scenario #1: Let's say a VPS breaks down. Right now there are 2 solutions: 1) restore an VPS image backup and 2) setup cPanel, CloudLinux, etc and the restore each account and the system settings. Both scenarios will take many hours to complete.
Yep, a restore could take hours (I remember pulling a 36hour "shift" to restore a failed server once). Hard drives and computers are only so fast.
If you have a VPS snapshot of the running system, you do not then need to seutp cPanel/CloudLinux etc - restoring the snapshot should restore everything. If you only have a "template snapshot" you use for your VPS, then yes, you will have to manually reinstall CloudLinux and WHM/Cpanel before applying the backups (but if you have any deployment software in place, that reinstallation shouldn't take too long).
The only way(s) to avoid a restore taking hours are:
* By having a warm/hot spare - i.e. a "standby" server to which very frequent changes from your "main/live" server are copied to - if you main server goes down, you just need to switch the IP addresses on the standby and you should be up and running with minimal loss (emails that reached the old server but the most recent sync will be lost, but that's better than hours lost if you have to restore from yesterday's backup).
* Having all the data on a network share/mount using a SAN (Storage Area Network) or NAS (Network Attached Storage). If you main server dies, the data is available for another server to take over. However, if the SAN/NAS fails in any way, then you have got to resort to backups again (a good SAN should be more reliable than a "common server", but odd issues can still occur - decades ago, I had to resolve an issue where the file allocation table on a SAN got corrupted: all the files were still there, but the system had no idea "where")...
Disaster scenario #2: Let's say dedicated server dies out. The solution is to setup the Virtual Environment and then restore each VPS image backup. It will certainly take a day to complete.
You have two ways of restoring a VPS Host server: either per VPS image (which means that sites/accounts will gradually come back online as they are being restored) or "entire machine" (usually a block level backup) - either way, takes a while to restore.
However, if you take frequent VPS snapshots from server 'A' and restore them to server 'B' (even if they are transferred across in a "power-off"/"frozen" stage) and then server A fails, then you can just start them on server B instead (see the warm/hot spare solution above)
Important Question: Is what we are doing enough? Can we do anything else to decrease the time needed to come back online and to also increase the chances of success/decrease the chances of failure?
What would you do different on both scenarios? Have you heard of anything better?
Disaster Recovery procedures are a whole can of worms... You first need to figure out what you are actually protecting against (a hot spare is good for hardware failure, but not so much against a malicious employee or a virus changing the data: backups are good for multiple "point in time" recoveries which gives you a chance to rollback before the employee/virus - however, if you keep the backups/hot spare in the same datacentre - which will make backups faster - it means if there is a datacentre fire, then the data is lost). I would try and figure out how much an hour of outage would "cost" you (if you are a web hosting provider giving 99% uptime, that means you have over 3 days per year you can be offline without your "compensation" kicking in) and then work from there. It's no good speccing for a "99.999%" (5 minutes per year) backup/recovery system if the most it'll cost you is $100 per day...
I would avoid cheap hardware/datacentres, have a mixed backup strategy to cover major issues (so maybe have an hour/previous days backup on a secondary drive in the machine, the previous days/weeks on a different server in the same datacentre, and the week before that's backup somewhere else entirely - Amazon Glacier I've found is quite good for stuff you hope you never have to touch)