I agree, rebooting a production server isn’t great, but you also have to determine when you’re not making progress.
The first step was identifying where the large files/directories were
The second step was you deleted a bunch of them and the space didn’t free, which meant you now
hit the third step which was trying to identify what was held open, what wasn’t
which led to the fourth step where you were looking at possibly clearing system libraries.
Rebooting solves 2-4. Now you’re back to 1. When the space usage spikes, don’t just DELETE stuff, use du to identify the folders and files that are growing. Identify the biggest offenders. Ignore anything less than 1MB. Figure out where the space usage is. THEN we can help you formulate a targeted plan for resolving the root cause, not just patching symptoms.
Note there are also steps you can take to limit the damage - right now you’re not using LVM, you have a single hard drive and it’s partitioned. That’s like, 0 for 3.
I always split my filesystem into (AT LEAST) /, /var, /usr. Why? Because /var is going to grow. If it grows, it can fill var. But it’s not going to fill / or /usr. Even better would be splitting out /var/lib/docker, and /var/log as well. LVM makes this way easier because you can dynamically assign space to the different areas on the fly, without rebooting. In fact with LVM you could just attach more space as a second drive and start using it immediately to resolve the immediate problem.
Of course, this goes even FURTHER off topic. There are numerous howtos on how to do these sorts of things, Redhat will be all over LVM, guides for debian, guides for ubuntu, etc.
https://unix.stackexchange.com/questions/131311/moving-var-home-to-separate-partition
https://access.redhat.com/discussions/641923
https://www.control-escape.com/linux/lx-partition.html
But in the world of virtual servers, I recommend you do NOT partition your drives. My VMs generally fit this pattern:
/dev/sda 4GB drive, partitioned, /dev/sda1 is the full disk size, starting at block 2048 (properly aligned), type linux, active and ext4 formatted. This gets mounted at /boot. Why? Because this is the only thing the BIOS/EFI needs to boot. It could be MBR, it could be GPT, it could be whatever - as long as you can install grub, it can get to the kernel and initrd, you’re golden.
/dev/sdb 8GB drive NOT PARTITIONED, literally mkswap -L swap /dev/sdb
/dev/sdc 100GB drive NOT PARTITIONED for LVM - pvcreate /dev/sdc, vgcreate system /dev/sdc
Then I carve out root, usr, var, home, opt. With Bionic i’ve noticed slow boots if / and /usr are on different partitions so I keep them together now, so like 8GB for /, 5gb for /var, 1GB for home, the rest in OPT, leave like 5GB free for snapshots and future problems.
home is small. Usually I’m building servers, so home doesn’t need to be big. My package development/jenkins box has a 100GB home. Most of the others 1GB is mostly unused.
/var is also small. If it fills, I either extend it, or identify WHY it grew and move that stuff elsewhere (i.e. /opt) or decrease retention or isolate it to a new partition. /var/lib/docker is a great one to isolate because docker puts EVERYTHING there - logs, containers, layers, volumes.
And then the most critical piece - monitoring. You need to know when the drives reach 85% full. Not 100% full and failing. When they reach 85% you usually have time to respond. (Usually. When I was playing with logstash it ended up spewing 5GB into /var/log in a matter of 45min) See monitoring-plugins-basic, check_disk, or just write a script, there will be tons of options online, or go crazy and install nagios.
All of this will, however, require more work than simply firing up an AMI, or taking a stock template from a cloud provider. If you’re going to go that route, since you’re using docker anyway, CoreOS already does just about all this in an 8GB footprint with automatic updates. You just need to add a drive for /var/lib/docker and away you go. (I add /storage too for any other odds and ends I might need to bind mount)