PDM server refresh lesson learnt

Background

Due to lease period coming at the end and OS close to end of life support, we refreshed our PDM servers.
I am not an IT engineer, but I had to do almost 90% of the job, with IT basically wiring the LAN for us and all the rest on me (rack mount, hardware update, ILO and RAID configuration, OS deployment, software install)

Hardware wise we basically kept the old server main specs, with refreshed part numbers for the main components like cpu and memory, and moving the disks to SSD technology because the 15k HDDs were basically no more available.

Still they are bare metal servers like the old ones. I do not have the experience, and resources to setup and run PDM on a VM at the moment.

DATABASE SQL2022 16cores (32threads) dual socket cpu + 96GB memory
ARCHIVE 16cores + 16GB.

Afterthoughts

It was the first time I was involved in this kind of server operations.
Previously I just covered the PDM software upgrades and tried to fix some issue we had on our legacy PDM setup.
This time I can see how the hardware affect and how PDM reacts more in depth and I think it is positive experience. I think that with a decent dedicated SQL and SSD it does not make sense anymore to keep archive and database on two separate machines.
If I had to remake the server again I would probably just add an addtional raid controller dedicated to the archive storage and use the money to upgrade memory and cpu on the SQL alone instead.

The reasons are the additional lag introduced by using two separated machines and the double maintenance involved with the servers, the double licensing costs etc.
Archive and database acts as a unique entity anyway and they should be rebooted together to avoid issues for example.

Lesson learnt (in random order)

  1. SQL was installed with the bare defaults, nobody knew what was configured and why. Even memory setup was the SQL out of the box maximum 2147483647 MB value. I suggest to study a bit what every option does and write down the logic behind your setup. Do not pay your VAR to push “next” on an installer, ask for a detailed report of what they did and why. Or do it yourself.

  2. Database files were buried into the “program files” folder (see 1.), then OS and DB were not segregated enough on the disk. Personally I would make one partition for the OS, one for the DB, one for transaction log and one for tempdb.

  3. RAID1+0 on SSD is fast and with 6 disks it allows a nice fault tolerance (unless both disk in a mirror pair die). I would use it for SQL and OS as the disk space involved is not so big and losing 50% capacity is accepable in that regard. For PDM archives I would stick with RAID6. (or RAID5 if it is mandated by your religion, beliefs or whatever you read on the internet: I tested both and it does not seem to change too much performance wise IMHO)

  4. As technology gets old, SQL related information get outdated, also many “best practices” seem closer to voodoo or digital black magic, with no solid background: be aware of what you find on the internet and also pay attention at WHEN it was written. Performing the installation during a full moon after sacrificing a lamb to the IT gods may help. (if you think so)

  5. Vault specific issues could be a thing: according to KB and my personal experience SQL CL set to 140 may speed up some operations like browsing vault folders with more than 1000 files. It seems to be the fix for some odd behavior too, but it can be environment specific.

  6. Maintenance plans and backup strategy: test them in real world conditions. Install everything from scratch at least once. Having an already installed environment does not expose your plan to a lot of annoying issues.

  7. Document everything, or at least the main bullet points of what you did: you are going to forget everything sooner than you think.

  8. A set of scripts to manage servers is useful: disk space monitoring, planned graceful reboot (shutting down the PDM services without forcefully killing them), antivirus scans with capped CPU usage (yes, I am talking about you dear Microsoft), automatic email to admins when something happens. Task scheduler is your friend, a filtered event viewer is useful too. Remember that you have to manage your scripts in a repository or have some sort of discipline to avoid having 3 versions of the same script floating around. Put some comments, revision history inside

  9. Domain admins try to screw your server all the time. Windows Update is good, but please just avoid to try reboot my server at 10.30AM in the middle of the job without telling me. They will force a lot of policies you know nothing about, like forcing the antivirus you already manage with your scripts to start without restrictions and it takes like 100% of the cores at a random time of the day. (based on a real story)

  10. To move A LOT of data you need the right tools, but you should be always aware of the risks involved. I moved some million PDM files and I would not trust windows explorer or robocopy without hashing everything beforehand and verifying. You do not want a data corruption to be discovered months later, right?
    Just for reference I hashed the whole archive server with CRC32 (light algorithm and 99% you get no collision) and it took like 3.5hours with a RAID5 15k HDDs while the same data after the robocopy into the new server took like 32 minutes.
    You hash the 16 HEX folders and then compare the 16 crc32 files from the source and destination, if they match then no error during file copy and you can sleep better at night.

3 Likes