XML sitemaps for large sites

Porting our internal XML sitemap generator to v15 has got me thinking…

Previously we’ve simply crawled through the entire content tree looking for signals to determine if a node should be part of the sitemap (in our case, a true/false property with the alias umbracoSitemapHide being false).

But with v15, and the new cache, not necessarily containing all content at all time, I’m thinking this approach is not the best.

What are my options then?

I’ve been looking into using Examine, but it seems like true/false properties sometimes arent indexed (I suspect, if the property is added to the content type, after the content node is created, the content node gets a null value in the property and is not indexed). Because of this, the examine based sitemap is missing some pages in my initial tests.

Is there other ways in v15 I can do this? Or should I run through all my content once and for all, to set an actual “false” value in my “sitemap hide” fields, if they arent there?

What are other people doing for automatically generating xml sitemaps?

Hey Sørren,

I may have over simplified this…

If you are happy to treat the absence of a hide value in the document as a false, then could you use Examine to generate an exclude list of all nodes that have a true value. In you sitemap only include nodes that are not in the exclude list?

Except, currently the I only want to include nodes that has the property, and where the property is false. If the node doesn’t have the property, it should not be included :slight_smile:

A workaround though, could be to create a list of content types with that property and use those aliases when searching in Examine.

1 Like

Am i reading this correct that you can’t rely on the IPublishedContentCache to have all your content?

Does a rebuild of the examine indexes not correct that? Or are the indexes rebuilt from the same incomplete content cache?

Am i reading this correct that you can’t rely on the IPublishedContentCache to have all your content?

Depends on how you set it up. By default, at startup it contains the first 100 nodes its finds, and it then increases the size whenever other nodes are requested.

I’m not an expert on the subject though, but you can make it load all your content into the cache, like the old cache did.

1 Like

Does a rebuild of the examine indexes not correct that? Or are the indexes rebuilt from the same incomplete content cache?

At least not when I tried :slight_smile:

1 Like

I’m going to need to implement this soon too.

There’s a pattern emerging in Umbraco of having dedicated persisted indexes/repositories for use cases like this, that are updated when content is updated.

URLs (routes) are the first example of this, and search is going the same way - with dedicated database tables that are effectively indexes of published content with specific property values.

I’m thinking about maintaining my own index of sitemap data in a table (with EF core so it doesn’t have to be in the DB) then updating it when content changes.

So completely opposite to umbraco <15 where hitting the database from the frontend was to be avoided at all costs? Or having to create caches of these tables for frontend use?

What we do is have a XmlSiteMap.cshtml view for the sitemap path.

We basically use Model.Root() to get the homepage, then render an xml entry for this page, and for each child recursively do the same thing.

The homepage and all its .Children seems to be of type IPublishedContent. Does the cache issue with v15 affect this as well you think?
Otherwise this approach seems tried true and tested, works great.

We also check for any toggles to opt out of sitemap listing, and there’s also a list of excluded document types we verify against at the same time.

Not sure it’s the most performant option, but this is generated here and then cached in frontend, as a headless architecture.

This is marked with a warning in your code in v15. Effectively it will fill up the cache with all the pages, eliminating all performance gains from the new cache.

No, you should still cache this, but in @JasonElkin s approach, the cache would probably be a simple KeyValue List (node key and url).

1 Like

How would you implement this, via NotificationHandlers?

Listening for ContentPublished, ContentUnpublished, ContendMoved, ContentMovedToRecycleBin and ContentDeleted-notifications (might’ve missed one).

So ripe for redis?
Though seems like these are things the cms should be doing for us… not having to maintain our own caches to generate a sitemap?

would you get a publish/unpublish from a scheduled action? Also wasn’t there something about bulk publishing (node and descendants) didn’t raise notifications?

Seems a little fraught with danger to try and keep things in sync.

I wonder if we could hook into or follow the mechanism in use for content syndication across load balanced setups? ICacheInstructionService if memory serves…

That sounds like the best idea to me, check the document type then check the property has the right value all in Examine.

I disagree, how would the cms know (out of the box) what content to include in the sitemap? At least our sitemaps would be filled with unnecessary (and non-working) links, if it would contain a link for every node in the CMS.

The fact that Umbraco doesn’t have any opinions on how I structure the front layer of my website (including the front layer for robots, eg. sitemap, robots.txt etc), is one of the best things about Umbraco IMO.

So what would the best approach be to handle this with Examine?

Introduce a property on all relevant document types that defaults to true, something like includeInSitemap?
I guess this would not be enough however, since for each search result you would have to get the actual node to find the published URL, right? And then you’re back to square one touching PublishedContent and lose.

Or would you rather write NotificationHandlers hoping to cover all relevant scenarios and update a separate Examine index instead?

For now I have ended up querying all content types using IContentTypeService, to find all content types that has a property with the alias umbracoSitemapHide. This query is runtime cached with a duration of an hour.

I then do a search in the ExternalIndex using Examine, for nodes of the content types found before, where umbracoSitemapHide is not 1 (true).

If there is more than 50.000 results, I return a sitemap index, with links to paged sitemaps (eg. /xml-sitemap?page=1, ?page2 etc.)

1 Like

Sorry you missunderstand.. not what should be in a sitemap per say.. but that I need to worry about a query such as give me a filtered set of content nodes, not getting me all my nodes :thinking: and having to work out an approach in order to make sure I get them all..
That would seem to be the SCRUD that the CMS should handle :person_shrugging: