SaaS Repositories generate high IOPS even when idle

Manic Mark · April 17, 2022

I use Nakivo in a multi tenant configuration (MSP) for Office 365 backups only.

I use onboard transporter of the director VA ( 4vCPU and 16GB ram ) for all tenants.

Each tenant has a SaaS repository configured on an NFS backed VMDK.

The backups run overnight and work fine.

I noticed however that even when backups are not running, there is a transporter process that appears to do some housekeeping on the repo. Each thread that does this housekeeping generates ~3000 iops over nfs and more than one thread can run concurrently. As a result, My storage has a load of 3,000-25,000 iops (peaking at ~260MB/sec) continuously, 24/7.

No space or inodes appear to be freed.

I've only got 8 SaaS repos at the moment. What happens when I have 80?

What is this background process doing that requires this much IO? Is there some way I can tell if it is actually accomplishing anything? Is there a way to limit it to running out of hours?

-Mark

Official Moderator · April 18, 2022

15 hours ago, Manic Mark said:

What is this background process doing that requires this much IO? Is there some way I can tell if it is actually accomplishing anything? Is there a way to limit it to running out of hours?

-Mark

Hello, Mark. It is really hard to say what is going on without checking the logs. We can guess the product is doing the hourly repository refresh. Please try to change it to 4-6 hours and check if that makes any difference (see the attached sreenshot).

You can post the logs here but it is better to send a support bundle for further investigation to our Support team (https://helpcenter.nakivo.com/display/NH/Support+Bundles).

Should you have any further questions, let us know. We look forward to hearing from you.

Manic Mark · April 23, 2022

Hi,

I disabled the auto refresh of the repositories but it did not make any change to the usage.

After some digging it looks like the transporter ( or pgsql ) is repeatedly trying to determine the used and free space in the repository.

looking at all_pg_sql.log, it keeps logging eventl like this:

Quote

INFO: Get free space 2.5-TB: 0mls.
INFO: Get free space 2.5-TB: 0mls.
INFO: Get free space 2.5-TB: 1mls.
INFO: Get free space 2.5-TB: 0mls.
INFO: Get free space 2.5-TB: 0mls.
INFO: Get free space 2.5-TB: 0mls.
INFO: Get free space 2.5-TB: 1mls.
INFO: Get free space 2.5-TB: 0mls.
Get used space 181.0-GB: 14min. 36sec. 248mls

I suspect that the "Get used space" is the culprit where it is walking the directory structure and adding up the size of all the files in the repo (which contains ~3.3M files in this case). It appears to be triggered at tenant startup (or repo mount) and then 30 minutes after the previous iteration completes.

I have also tried switching off "system.repository.refresh.backup.size.calculation" under expert settings but "Get used space" still runs.

Official Moderator · April 25, 2022

On 4/23/2022 at 5:46 AM, Manic Mark said:

Hi,

I disabled the auto refresh of the repositories but it did not make any change to the usage.

After some digging it looks like the transporter ( or pgsql ) is repeatedly trying to determine the used and free space in the repository.

looking at all_pg_sql.log, it keeps logging eventl like this:

I suspect that the "Get used space" is the culprit where it is walking the directory structure and adding up the size of all the files in the repo (which contains ~3.3M files in this case). It appears to be triggered at tenant startup (or repo mount) and then 30 minutes after the previous iteration completes.

I have also tried switching off "system.repository.refresh.backup.size.calculation" under expert settings but "Get used space" still runs.

Hello, @Manic Mark Please accept our apologies to keep you waiting. Our engineer team is still in the process of ticket investigation. We will get back to you the soonest with updates.

Thank you for understanding and your patience!

Manic Mark · April 25, 2022

Hi,

I have implemented a workaround. This works for me in the sense that it takes the load of my nas and moves it to my flash storage. Perhaps it will be of help to some one else, perhaps not. Try this at your own risk.

I stopped all tenants and the transporter service. I unmounted the existing ext4 filesystem that held all the SaaS repos and remounted it in another location.

I upgraded the va's virtual hardware to the current (7.0u2) and then added a new nfs backed 10TB vmdk (sda) and a 100GB flash backed vmdk (nvme0n1). I installed zfsutils-linux in the va and created a zpool on the nfs backed vmdk with metadata stored on the flash backed vmdk and mounted in the place of the original ext4 filesystem.

zpool create nkv-saas -o ashift=13 -m /mnt/nkv-saas-repos /dev/sda special /dev/nvme0n1

I then copied all the saas repos to the new filesystem using rsync, and started the transporter service and the tenants.

The difference is shown below.

BEFORE: (ext4 on nfs backed vmdk): continuous 3K-25K IOPs over nfs

Get used space 181.0-GB: 14min. 36sec. 248mls

AFTER: (zfs on nfs backed vmdk with metadata on flash backed vmdk): peaks at ~250 nfs IOPs during a backup, little to no load the rest of the time.

Get used space 181.9-GB: 1min. 44sec. 278mls

I'm using 23% of the 10T vmdk on my NAS and 29% of the 100G vmdk on my directly attached flash array so the amount of flash storage required is trivial.

I still think that the way Nakivo tracks used space is naive at best. Walking the filesystem every 30 minutes to add up the space used by what could be hundreds of millions of files ( I'm already up to 21M files ) is a waste of system resources. Even more so considering that the repo is PostgreSQL database and all the files in the repo are managed by Nakivo. A far more efficient approach would be to track the blob sizes in the database and, if needed, measure the sizes of the database and other files.

-Mark

Official Moderator · April 26, 2022

12 hours ago, Manic Mark said:
Hi,

I have implemented a workaround. This works for me in the sense that it takes the load of my nas and moves it to my flash storage. Perhaps it will be of help to some one else, perhaps not. Try this at your own risk.

I stopped all tenants and the transporter service. I unmounted the existing ext4 filesystem that held all the SaaS repos and remounted it in another location.

I upgraded the va's virtual hardware to the current (7.0u2) and then added a new nfs backed 10TB vmdk (sda) and a 100GB flash backed vmdk (nvme0n1). I installed zfsutils-linux in the va and created a zpool on the nfs backed vmdk with metadata stored on the flash backed vmdk and mounted in the place of the original ext4 filesystem.
zpool create nkv-saas -o ashift=13 -m /mnt/nkv-saas-repos /dev/sda special /dev/nvme0n1
I then copied all the saas repos to the new filesystem using rsync, and started the transporter service and the tenants.

The difference is shown below.

BEFORE: (ext4 on nfs backed vmdk): continuous 3K-25K IOPs over nfs
Get used space 181.0-GB: 14min. 36sec. 248mls
AFTER: (zfs on nfs backed vmdk with metadata on flash backed vmdk): peaks at ~250 nfs IOPs during a backup, little to no load the rest of the time.
Get used space 181.9-GB: 1min. 44sec. 278mls
I'm using 23% of the 10T vmdk on my NAS and 29% of the 100G vmdk on my directly attached flash array so the amount of flash storage required is trivial.

I still think that the way Nakivo tracks used space is naive at best. Walking the filesystem every 30 minutes to add up the space used by what could be hundreds of millions of files ( I'm already up to 21M files ) is a waste of system resources. Even more so considering that the repo is PostgreSQL database and all the files in the repo are managed by Nakivo. A far more efficient approach would be to track the blob sizes in the database and, if needed, measure the sizes of the database and other files.

-Mark

@Manic Mark, thank you for your post. In this case, this is a workaround that allows moving repository IOPS to another storage.

For further investigation, we need a new support bundle if it is still necessary.

Please note your ticket ID #142199 in the description.

We will add information from the forum thread to the ticket and check all information. We are looking forward to your feedback.

Should you need any further information, please do not hesitate to contact us.

Sign In

SaaS Repositories generate high IOPS even when idle

Recommended Posts

Manic Mark

Link to comment

Share on other sites

Official Moderator

Link to comment

Share on other sites

Manic Mark

Link to comment

Share on other sites

Official Moderator

Link to comment

Share on other sites

Manic Mark

Link to comment

Share on other sites

Official Moderator

Link to comment

Share on other sites

Join the conversation

Browse

Activity