We’re in a load balanced environment. Auth and Web1 are on one host, Web2 is on another. The 404 issue manifested on Auth and Web1 only, which after investigating we see that host went through a host migration on Azure last night which we suspect created the problem.
This only seems to happen with one tree which was copied and moved around by the client in a recent restructure, possibly with some usync involvement. It’s possible there was some data integrity issue, so we ran the health check for that today.
I found this previous post, and sure enough after rebuilding the memory cache the pages came back, but we’re not sure how things got into this state. I’m now monitoring both front-end URLs to see if it happens again.
Further, during the host migration / new instance spin up, both Auth and Web01 reported an Unhandled Exception in AppDomain, but no further details around why.
This lines up with NuCache getting into an inconsistent state, and the host migration is almost certainly the trigger. When the AppDomain restarts unexpectedly (which an unhandled exception during the migration would cause), the NuCache files (NuCache.Content.db / NuCache.Media.db in umbraco/Data/TEMP/NuCache) can be left half-written or out of sync with the database. Since Auth and Web1 share a host and Web2 is separate, only the migrated host ended up with the corrupted local cache — which is why the 404s were isolated to those two and a reload fixed it.
The “published but URL cannot be routed” message is the classic symptom: the node is published in the database, but the in-memory cache that builds the routing table is missing it or has a broken parent/child relationship for it. A Reload rebuilds the in-memory cache from the nucache files; a full Rebuild regenerates those files from cmsContentNu. The recent copy/move on that tree may have left a latent inconsistency that sat harmlessly until the restart forced a rebuild and surfaced it.
A few things to harden against a recurrence:
Make sure the TEMP/NuCache folder is local to each node, not shared storage — shared nucache files on a load-balanced setup will give you exactly this kind of divergence. Umbraco:CMS:Hosting:LocalTempStorage set to EnvironmentTemp forces each node onto its own local temp.
Confirm the setup follows the flexible load balancing guidance (single elected scheduling publisher, others as subscribers) with MainDomLock set appropriately — SqlMainDomLock is the usual choice on Azure. MainDom handover going wrong during a host migration is a very plausible cause of that AppDomain exception.
Dig into the actual exception — the AppDomain line is just the symptom of the process going down. The real cause should be in App_Data/Logs/UmbracoTraceLog.*.json (or Application Insights) around that timestamp.
We’re definitely running EnvironmentTemp for indexes, been caught out with this in the past and have learned our lessons.
Also have followed the load balancing guidance, although you’ve encouraged me to go back and check this over. We’re on FileSystemMainDomLock across all three instances as per the docs here:
Finally I’ve been through the traces for the startup on the second App Service instance. The AppDomain exception is the last thing logged on the previous instance as the new one is starting up. All entries for indexes seem to be fine (at least these entries appear on Auth but not on Web1):
Checked index {IndexName} at {IndexPath} and it is clean.
No errors on the startup for Web1 itself, but no mention of indexes being built. It does recognise this is a new server so goes for a clean boot. MainDomLock is acquired without any issues.
It looks like Umbrace didn’t see any integrity issue and built its caches without issue, although something underlying must have hidden these pages from that process.
At one point there was a suggestion that two pages existed with the same URL. We suspected one had been moved and renamed on Prod, and one had been publised by uSync, somehow resulting in two nodes with different GUIDs but the same URL. That was previously worked around by unpublishing and republishing the page and the back-office stopped complaining. I’m wondering whether something is still hanging about.
You’ve clearly checked the obvious stuff, so just one thing on the uSync side worth a quick look: is uSync only set to import on the scheduling publisher, with the subscribers left out of it?
If a subscriber instance ends up running an import (particularly ImportAtStartup), it can publish content on its own rather than letting it flow from the publisher, and that’s a really easy way to wind up with two nodes on the same URL with different GUIDs — which sounds a lot like the duplicate you think is still hanging about.
Might be worth comparing the uSync config across all three to make sure only the publisher’s doing the importing. If a subscriber’s been quietly importing on boot, that could well be how the second node appeared.
Thanks again Justin. We don’t generally have importAtStartup set, but I’ll check to make sure the front-ends aren’t touching uSync as you suggest. That could certainly be a candidate.
@justin-nevitech the three servers have the same uSync settings but none of the import options are set. If the editor imports staging content from prod, would that follow the correct publishing flow or could the same duplicate URL thing happen?
When you say the editor imports staging content from prod, what do you mean exactly? Is staging not a completely separate site and database from prod? You should only allow CMS access to a single instance in v13 when load balancing so the CMS site is the publisher and the other load balanced sites are subscribers. All content should then be managed on the publisher (CMS) instance.
Could you provide a bit more detail on your setup and how staging is connected?
if (_serverRegistrar.CurrentServerRole == ServerRole.Subscriber)
{
if (_logger.IsEnabled(LogLevel.Information))
_logger.LogInformation("This is a replicate server in a load balanced setup - uSync will not run {serverRole}", _serverRegistrar.CurrentServerRole);
return;
}
using Microsoft.AspNetCore.Hosting;
using Umbraco.Cms.Core.Composing;
using Umbraco.Cms.Core.DependencyInjection;
using Umbraco.Cms.Core.Sync;
using Umbraco.Cms.Infrastructure.DependencyInjection;
namespace www.Extensions.LoadBalancing
{
public class ServerRegistrar : IServerRoleAccessor
{
private readonly IWebHostEnvironment _env;
public ServerRegistrar(IWebHostEnvironment env)
{
_env = env;
}
public ServerRole CurrentServerRole
{
get
{
return _env.EnvironmentName.ToLower() switch
{
string n when n.EndsWith("api") => ServerRole.SchedulingPublisher,
"development" => ServerRole.Single,
_ => ServerRole.Subscriber,
};
}
}
}
public class RegisterServerRegistrar : IComposer
{
public void Compose(IUmbracoBuilder builder) => builder.SetServerRegistrar<ServerRegistrar>();
}
}
Ps it’s exposed in the system Information panel.. to check
@justin-nevitech Maybe should have been clearer. We’re running uSync Complete. The concept / rule is that all content should be prepped on Staging then published to Prod using uSync. This is an imperfect workflow as the editor has to make various changes on Prod (i.e. SEOChecker redirects), and I think ended up moving some pages around on Prod without reflecting those changes on Staging first.
It’s also possible they used uSync Complete from the Prod back-office to pull content from Staging up. What we’re wondering about is if this pulled in the same item, or if the editor copied an item (new GUID) then pulled the original item then two items with identical paths might have been created.
And thanks @mistyn8 Yes we do see the “This is a replicate server” message in logs so good to know what that relates to, and I can confirm the Publisher / Subscriber settings are all 100% correct.
That sounds like the issue then. Copy on prod gives a new GUID, then a later uSync pull brings the original down with its own GUID, and now you’ve got two nodes on the same URL with neither aware of the other. The republish stopped the back-office complaining but didn’t get rid of the duplicate node, it’s still published and sat in cmsContentNu. Most of the time one wins the URL and all looks fine; when the migrated instance rebuilt its cache the load order changed, the other won, and your child pages cannot be routed.
So I’d leave the cache alone and go after the duplicate. Find the two nodes on that URL, work out which is the orphan, and properly delete it rather than unpublishing, then make sure all three instances have picked up the change.
To avoid it recurring, the rule needs to be that content structure only ever changes in one direction. All moves, renames and restructures happen on Staging, then flow down to prod via uSync. Nothing structural gets done directly in the prod back-office. The Prod-only bits (SEOChecker redirects and the like) are fine as long as they’re not touching node structure. The trouble started the moment an editor moved pages on prod and then pulled content the other way, so killing that reverse flow is what stops it happening again.
I think we’re on the same page @justin-nevitech this is what I think has happened. We’re monitoring these items and hoping the integrity health check did the right thing. If not I’ll go digging in the database. Thanks for your advice!