Failed to write version into archive, but it succedeed for database

I will edit this post with better information once I get in the office, but yesterday was a roller coaster of a day with multiple assistance requests from my engineers and I had barely put together a description for my VAR. I have no access to the login screen here at my job, so I have to use my phone or my PC at home with no records at hand of what happened in our vault.

I suspect a network issue (or a PDM operation canceled or somehow aborted in the middle of a file upload) as one user is working at the factory shop floor in a different building.

I checked 2 PDM clients local LOG via PDM administration and the ARCHIVE server log, then the file history and the archive version folder xml.

We had user A (CAD user) and user B (non CAD user) working in different buildings.
My guess is that User A tried to check in a drawing and somehow the file failed to upload into the archive server, but the operation was recorder into the database as version 40.

User B tried to retrieve the file latest version (he apparently had not previous versions in his local cache) via explorer to check it out, fill the datacard and move it to our approved state. It failed with a dialog “cannot find ID xxxxxx on the server…” came out, at this point user B called user A about what button to push and somehow he got “some version” from the archive server since the latest version 40 was not available. He filled the datacard and checked in the version 41 then moved it to an approved state.

The file history looks like
Version 41 multiple transitions to approval
Version 41 check in
Version 40 latest check in at 10:11
Version 39 check in
Version 38 check in

Archive server xml
version 40=41 datacard edits
version 38=39

Problem is file version 38 exists while version 40 does not exist into the archive server.
This happened one week ago and the version 40 is not in our backups, so it was unlikely created back then too.

One week later the file is moved to a revision and the drawing file failed because there is no physical version inside the archive server.

LOG files gets interesting for USER A
During version 40 check in the local log shows an error

10:11:14 E_EDM_FILE_NOT_LOCKED_BY_YOU catch error 3 (not 100% sure I have to check it again when back at the office)

At the same time the ARCHIVE SERVER LOG registered a failed operation.
The operation was recorded with something like

opRollback version 38 step 0

and a some UUID like string.

Note that USER A does not have a rollback permission (in theory, but he used to have it years ago, not sure if it is important at this point)

LOG for USER B is recording only the failure to get version 40, apparently pushing random buttons on the error dialogs he was able to get the datacard in edit mode, then checked in “something” and moved it to an approved state without a physical file.
He does not open the file with Solidworks so the DB only operation was apparently not 100% aware the physical file was not there.

I forgot something important:

I was able to retrieve a copy of the drawing in the approved state from USER A workstation.
File hash is different from other versions in the server so I suspect it is potentially the missing version 40.

Since user A tried (but failed) to move the file from approved to revised (I also failed to do it as admin on my PC as there is not physical version into the server) he COPIED the drawing file with a “_2” temporary name and he started doing his revision task.

The drawing file still in locked state appears to be flagged WRITEABLE
I added a / flag at the end of the path in explorer to look at the real local cache state and the column writeable was marked yes. Version tab in PDM explorer had a green icon with local version marked as “-” iirc.

Likely you can manually put the file from the users local cache onto the archive and it will work. I don’t recall if I’ve ever had a missing file liem this, except in the case of replication server errors that result in a missing copy in which I had to manually copy the file and check the archive server tables.

As you probably know by now, the files in the archive are named by version number in hexidecimal format. The XML just indicates which file to pull per version ID, I think its only needed because you don’t always get a new file version with check-ins for data card edits only.

1 Like

yes, to save space in the archive when metadata or state changes occurs the version is a “linked version” only a pointer to an existing file to avoid duplication. Not sure how robust it is…

I will test the candidate version 40 file on our test server and see what happens.

Another PDM MISTERY

Been working for us for 17 years now. Most of our Archive errors are all replication related. We had 3 replication servers until recently when our company forced us to move all servers to Azure. Since they are now all in the same cloud location, made no sense to continue replicating them.

I have no experience with replication, but it sounds useful for us.
It is possible to have a partial server replication? (only some folders for outsourced projects for example?)

Not sure 100%, but the only thing I am scared right now is to botch our test server configuration and expose it as a replicated archive, since I copy it from our backups and manually edit 3 tables (dbo.archiveservers dbo.sysinfo in the vault DB, dbo.licenseserver in conisiomasterdb) with a different server name.

I keep different passwords for archive, SQL and admin for both TEST and production to separate them as much as possible.

If you haven’t seen it, the document attached to this KB article describes how it works in detail:

I’ve only had to fiddle with the index.xml file once in 15 years.

3 Likes

thank you it was already in my KB bookmarks. I have messed with those xml a couple of times.

I would like to add that I suspect some user tried to do something “funny”, but just kept its mouth shut about it. (as usual)

But the system should be robust enough to avoid funny things.

“the system is always right”
“trust the system”
“obey the system”
“be the system”
:joy:

Yes, you pick which folders you want to replicate by creating different schemas. I only had one that replicated the entire vault to all locations since everyone was working on the same products.

Keep in mind, this schema doesn’t control permissions. Even if the folders aren’t on a replication schedule, the system will replicate the file over to the other server “on demand” if a user can read it (Just takes longer since it triggers the replication). The schema schedule just saves your users time by having it already on their local server. Your user’s local vault view has to be configured for a specific replication server.

Say you have two locations with servers “London” and “Boston”. “London” makes motorcycles, and “Boston” makes scooters. The teams are separate but you want them to share common components and reference each other files.

Folders:
c:\Vault\Products\Motorcycle
c:\Vault\Products\Scooter
c:\Vault\Library

Replication schemas

  • Library
    *Replicates every hour both ways. You have library admins in both locations that can add/edit content.
  • Motorcycle (London)
    *Replicates content to “Boston” server on Sunday at 12am.
  • Scooter (Boston)
    *Replicates content to “London” server on Sunday at 3am
1 Like

this sounds quite interesting.
So all I have to do is to restrict the user folder permission to the bare minimum, for users operating at the replicated server location.
if they cannot see the folder they cannot get its content or trigger an unwanted sync.

My VAR relayed my questions to DS and they think that a version failing the checkin can be caused by something like: (provisional list since I need to translate it properly from japanese)

pdm archive service crash
pdm server disk fault
pdm server crash
antivirus killing the file
no writing permission on the archive
server recovered from incomplete backup

Well, there is more to unpack, but it sounds like “it is your fault not ours, we do not known only where you messed up” kind of reply. I also suspect my japanese VAR writes in japanese to solidworks japan, which translates in english to their international HQ and all back to japanese with some hop and jumps until the answer arrives into my mailbox.

Conclusion: it is friday evening and I go home.:joy::ok_hand:

Correct, you don’t have to schedule syncs at all. The “on demand” sync will still function but if you have each location isolated to its own folder where each group can’t “read” the other’s folder, then no syncing would take place. Except for you, if you are ever troubleshooting files and data from another location.

1 Like

Unfortunately there isn’t much of a log to go by for you or them to troubleshoot.

  • pdm archive service crash - Don’t think I’ve ever had this happen
  • pdm server disk fault - Lost a RAID array on the DB server once, but not the archives. It’s all Azure now so not sure if thats even possible now.
  • pdm server crash - See above
  • antivirus killing the file - I could see this one happening. I had to make sure IT put in an exclusion not to scan the archive location for performance reasons.
  • no writing permission on the archive - I’ve had this happen on the archive once for a folder. Also happens on regular servers. Sometimes one folders permissions gets messed up, no idea how.
  • server recovered from incomplete backup - We had a virus destroy a bunch of servers including one of our replicated archives. I had to rebuild it and restore the files. The question was do I restore from the previous nights backup, or from another archive? Since our archives were on an hour sync schedule, we elected to do that. Still had some replication issues with missing files that I had to manually fix but only 20-30 files.

I meant I physically checked my servers while reading their email.
I put the server in our racks, made the raid6 array etc. checked the system logs, the raid logs…I had crashes and disks dying on a old server years ago, not this time.

PDM service crashing and restarting immediately sounds plausible, but I need to find a log to confirm the service restarted. But IIRC when PDM archive restarts it forces the clean up scheduler so it is not that the restart could get unnoticed.

Same for the backup thing: I moved the server months ago, I hashed 2TB of data to be sure no corruption happened, also I was able to write the hex folder of the failed checkin without any kind of issue multiple times.

The only issue I was expecting, a network problem, was not even mentioned by them.

2 Likes
  1. take a copy of the physical data from the workstation local vault view that botched the checkin. It was apparently congruent with the missing version on the server.
  2. search the Document id of the file with the missing version (you can setup a special bom with the document Id and filename columns only)
  3. Convert with a calculator from the DECIMAL document Id to the HEX document Id
  4. Go on the archive server and open the root with 16 archive folders (0 to F)
  5. Hex document Id last digit corresponds to one of the 16 folders name
  6. open the folder you got from the previous point and look for a folder named as the hex document id
  7. rename the file as the missing version number In the archive server (version number is HEX too like version ten of a part file is 0000000A.sldprt) look at the other files in the folder to confirm how many zeros and digits you need.

remember that the xml file inside the document id folder will tell you a lot about the versions stored there.
like an unchanged file shared by multiple versions. (e.g. a state change without a physical file change)

Also test it on a non production environment.

1 Like