Saturday, May 07, 2016

Sitecore Decennial Series #2 - Don't Sitecore Query your content

Sitecore does not query, how you think it queries

As a consultant, I am often brought in to projects and solutions that have - to some extent - gone wrong, or does not quite meet up to expectations. And when it comes to those, absolutely without question, the most common problem I come across, is under-performing solutions.

And let's just clear something up. While you will find a lot of posts out there discussing the speed and performance of Sitecore (Sitecore 8, in particular); these are all referring to the performance of the Sitecore Experience Editor. The client environment. And while there is rarely this much smoke without a fire somewhere, I will say this:

The Sitecore runtime is a very well oiled machine. If it doesn't perform, it means YOU broke it.

You are welcome to quote me on this.

Let me exemplify.






















Pretty gruesome, right?  You'd not blame this on "poor performance of ASP.NET", if you found this in some web application I assume.

Now believe me when I tell you; this horrifying example is actually very mild compared to what I often find hidden deep in some of the Sitecore solutions I visit. Usually tucked away in some "*Helper" or "*Utils" (pet peevees of mine) class, used arbitrarily throughout the entire project, never to have been visited by a developer since it was first put in place.

So anyway. Let's take a look at, how situations like this come to be. I will begin with some fundamentals.

The Sitecore (Sql) Data Store

To try and understand what's going on, we're going to take a bit of a dive into the Sitecore Sql data store. Now before you start; I am very well aware that Sitecore abstracts this layer from view (and rightly so), and under no circumstances should you ever be working on the database layer when doing Sitecore. The structures I am about to show you do change from time to time, and you will end up with a bloody nose if you go down this road.

But we cannot escape the fact, that Sitecore data storage is based on Sql Server (in most cases, actually in every single one I have come across in the past 10 years) - and any data storage mechanism comes with a set of built-in limitations and restrictions. So how Sitecore uses its data storage matters, and it will affect how your solution performs.

[Items]














Without question, the table most central to the Sitecore data store. I imagine the fields are self explanatory. 1 row in this table, for every Item in your Sitecore Database ("master", "web" and so on).

This table holds no other values other than Item.Name. For this, Sitecore looks to 3 other tables. They look almost identical.

[VersionedFields]















[UnversionedFields]



[SharedFields]


If you haven't guessed already; Sitecore decides where to place field values depending on how you check your "Shared" and "Unversioned" checkboxes when defining your Sitecore Templates. I will not go further into this, in this post.

But what does this mean?   Well there's a couple of things to pay close attention to here. The first one being; Sitecore sees its data store as a Tree Structure (d'uh I hear you say, you probably already knew that). But this is important. Notice how [Items] stores Id and ParentId (and a few other Ids we don't care about right now). To go anywhere in the "Tree", Sitecore must therefore traverse up and down this chained list of Ids - starting from the top.

So if I were to do something simple as say; Sitecore.Context.Database.GetItem("/sitecore/content") - what this ends up as, is something like this:

exec sp_executesql N' SELECT [ID] FROM [Items] WHERE [ParentID] = @parentId AND [Name] = @childName',N'@parentId uniqueidentifier,@childName nvarchar(7)',@parentId='11111111-1111-1111-1111-111111111111',@childName=N'content'

(Sitecore caching is in effect, so it was already aware of the parentId)

Which is not so bad; but do remember. Had I queried "/sitecore/content/home/data/articles/2015/february" - this would (uncached) have resulted in 6 queries like the one above, to Sql Server. Yep. Again; this is without considering caching, obviously, or we'd all be in big trouble.

And this is just getting an Item. As we saw above, the field values sit in separate tables. Once you start accessing actual content on the Item, this is what it looks like behind the scenes.

exec sp_executesql N'SELECT [ItemId], [Order], [Version], [Language], [Name], [Value], [FieldId], [MasterID], [ParentID]
                     FROM (
                        SELECT [Id] as [ItemId], 0 as [Order], 0 as [Version], '''' as [Language], [Name], '''' as [Value], [TemplateID] as [FieldId], [MasterID], [ParentID]
                        FROM [Items]

                        UNION ALL                          
                        SELECT [ParentId] as [ItemId], 1 as [Order], 0 as [Version], '''' as [Language], NULL as [Name], '''', NULL, NULL, [Id]
                        FROM [Items] 

                        UNION ALL 
                        SELECT [ItemId], 2 as [Order], 0 AS [Version], '''' as [Language], NULL as [Name], [Value], [FieldId], NULL, NULL
                        FROM [SharedFields] 

                        UNION ALL 
                        SELECT [ItemId], 2 as [Order], 0 AS [Version],       [Language], NULL as [Name], [Value], [FieldId], NULL, NULL
                        FROM [UnversionedFields] 

                        UNION ALL 
                        SELECT [ItemId], 2 as [Order],      [Version],       [Language], NULL as [Name], [Value], [FieldId], NULL, NULL 
                        FROM [VersionedFields]
                     ) as temp  WHERE [ItemId] IN (SELECT [ID] FROM [Items] WITH (nolock)  WHERE [ID] = @itemId) 
                     ORDER BY [ItemId], [Order] ASC, [Language] DESC, [Version] DESC',N'@itemId uniqueidentifier',@itemId='0DE95AE4-41AB-4D01-9EB0-67441B7C2450'

Incidentally; ever had a [Shared Field] that you switched to [Unshared], and now you're not getting the field values you expect?  this query is the reason. When you toggle that checkbox, Sitecore starts a background task that will migrate the field values from [SharedFields] to [UnversionedFields]. It has to, as the above query will otherwise return the first field value it finds in the UNION - which will be the one near the top; [SharedFields]. If this task somehow doesn't complete - your data storage gets left in a bit of a pickle. 

So yea. Sitecore essentially goes and grabs all your field values once you start accessing content on an item. And yes, this IS the efficient approach - a roundtrip to Sql for every field value you access hereafter would be absolutely crippling.

But keeping this in mind; picture this:

Sitecore.Context.Database.SelectItems("/sitecore/content//*[@PageTitle='news']")

Caches aside; you've just instructed Sitecore to recursively go find anything that has /sitecore/content as an ancestor (1 SQL statement for each item, to find them all). And then your request for the PageTitle field will then, for each of the found items, execute the big UNION statement from above so that it can determine the field value for PageTitle.

Yep, you read that right. You may be imagining things like; "but I can construct a SQL statement that finds this more effectively". And you're right, you can. But you'd also be ignoring Sitecore through and through - you might as well not bother using a CMS at all. What you're overlooking (to name a few) are things like Versions (which of the 7 versions of /Home is the active one?), Language, Security, Workflow Status. 

I'll say this as plainly and calmly as I possibly can. Do NOT use Sitecore Query for any code that executes at runtime on your solution. "If it doesn't perform, it means YOU broke it."

When it comes to constructing a more efficient query, you're not the first one to have this thought though; enter Sitecore Fast Query.

Sitecore Fast Query

SFQ is a query mechanism that aims to address some of the limitations of regular Sitecore Query. It does so by compromising exactly some of the things I mention above. So sacrifice a little functionality to gain a little speed. I'm almost tempted to paraphrase Frankling and claim "you shall have neither".

But let's look at it. From the "Using Sitecore Fast Query Cookbook" (how I miss those)
































Now; given everything that I have just shown you - this makes perfect sense. The only supported special attributes - all of those are what you find on the [Items] table. Standard Values are not supported (as these sit on a separate item). I'll leave it to you to speculate why field values need to be surrounded by % signs ;-)

Ok, so let's take a look then. I cook up something new:

Sitecore.Context.Database.SelectItems("fast:/sitecore/content//*[@#Page Title#=\"%article%\"]");

And the resulting Sql looks like this:

exec sp_executesql N'SELECT DISTINCT [i].[ID] [ID], [i].[ParentID] [ParentID] FROM [Items] [i] WITH (NOLOCK)  LEFT OUTER JOIN (SELECT [Fields].* from [Fields] INNER JOIN [Items] ON [Fields].[FieldID] = [Items].[ID] AND lower([Items].[Name]) = ''page title'') [Fields1] ON [i].[ID] = [Fields1].[ItemId] INNER JOIN [Descendants] ON [i].[ID] = [Descendants].[Descendant] INNER JOIN (SELECT DISTINCT [i].[ID] [ID], [i].[ParentID] [ParentID] FROM [Items] [i] WITH (NOLOCK)  INNER JOIN (SELECT DISTINCT [i].[ID] [ID], [i].[ParentID] [ParentID] FROM [Items] [i] WITH (NOLOCK)  WHERE LOWER([i].[Name]) = ''sitecore'' AND [i].[ParentID] = @value1) [a] ON [i].[ParentID] = [a].[ID] WHERE LOWER([i].[Name]) = ''content'') [a] ON [Descendants].[Ancestor] = [a].[ID] WHERE (coalesce([Fields1].[Value], '''') LIKE @value2)',N'@value1 uniqueidentifier,@value2 nvarchar(9)',@value1='00000000-0000-0000-0000-000000000000',@value2=N'%article%'

And a footprint like this.










Ok. So there's no doubt, this is BETTER than our previous situation. But only by a measure, it's not the end-all to our performance woes. I have about 20 (!) items in the vanilla solution I am using to write this blog post by the way. Twenty.

Interesting (unofficial) factoid: Do you know why FAST: query was introduced to begin with?  To solve performance problems in Sitecore Content Editor. Yes. Think about it; opening up any item in CE presents you with a series of fields. Each of these fields may be Treelists and whatnot, and loading up all that related meta-data and presets, populating all the dropdown boxes and so on - well it pretty much killed performance in early Sitecore (5 and 6) releases. And for Content Editor, the limitations of Fast Queries are fine. No doubt they're (often) better than the regular Sitecore Query equivalent. But that still doesn't make them good.

Don't look to caching to save you either. It helps, surely. But even cached results need to be generated at least once; and now you find yourself in a whole heap of other problems instead. When to clear cache?  on publish?  you sure?   So with editors publishing new content every 5 minutes (to test it out - trust me, they do), how much "cache debt" does your solution need to deal with?

Look; we're in my Decennial Series. I'm giving you my hard earned advice. 

Just stay away from any form of querying. There are other and better ways. I'll give you a few pointers; most of what I'm highlighting here is already well covered by blog posts everywhere (far from all of them, mine). Seek them out, please. Please. Next time you feel, you need to query for your content.

What to do instead


Information Architecture

The first and most important thing to keep in mind, is how you organise your content. Try and keep related items close together and organise them in a meaningful deep tree hierarchy. So for, as an example, News Articles, do this:

/sitecore/content/news/2016/04/[your articles here]

Which gives you the option to do:

var root = Sitecore.Context.Database.GetItem("/sitecore/content/news/2016/04");
var newsArticles = root.GetChildren();

And unless you have an extraordinarily "news'y" site, I assure you this will be better than the Query equivalent which might look something like:

var newsArticles = Sitecore.Context.Database.SelectItems("/sitecore/content/news/*[@newsDate > '2016-04-01' & @newsDate < '2016-05-01'])

Don't be overly worried about reading in 10-20 items in a .GetChildren() call; be worried about reading in 2000 items and discarding 1980 of them.

Code Smart(er)

There are often multiple ways of achieving something. This certainly holds true for Sitecore as well. One of the little known gems is the Sitecore LinkDatabase. Ever tried deleting an item, and have Sitecore pop up and tell you this:

















How does Sitecore know this?  And you probably guessed it already. The LinkDatabase. The SLDB is easily worthy of a post of its own, but in essence it's this:

A table maintained by Sitecore, where all reference field references (!) are stored. This allows Sitecore to keep track of inter-Item relations so it can pop this warning to you.

Since we're doing the low-level thing today, here's what it looks like:



















Turns out, this is an excellent tool in many other situations as well. I think my Listing "Related Articles" using LinkDatabase was the first time I blogged about it; and I regularly dust off SLDB and pick it up from my tool shelf even to this day.

A common scenario that often comes up, is "I need to find all News articles, but they're spread out all over my content tree. Silly editors". Well SLDB has a fix for that; although here it is used indirectly.









In case you're wondering; .GetUsageIDs() uses SLDB internally to achieve it's result.

I highly recommend you read up on the Sitecore LinkDatabase. It can really make your life easier in a lot of situations, where you would otherwise feel tempted to start Sitecore Querying.

Index your content

And finally, but also most importantly, use an index for every single Sitecore solution you ever do. I mean it. Seriously.

Want to find all News Articles in a specified date range?  use an index.
Want to find all News Articles in the "Business News" category, sorted by Author? use an index.

Even with all the tricks I have show you here, sooner or later you will come to a situation where the content cannot be organised any differently, where LinkDatabase cannot help you as we're dealing with content and not references - or flat out, we're dealing with so much content that we just have to narrow down the dataset to be able to work with it.

Fortunately, I've already given the answer to these problems. Use an index.

There are quite literally dozens and dozens of blog posts out there, on how to set up and use Sitecore ContentSearch (which is where I recommend you start out). It takes so little effort to get going, it's active out of the box (using Lucene as it's provider), there quite simply is not a single valid excuse not to use it.

I know it's "another thing to learn". This one is not optional however; you really cannot go around it. You may feel you're doing just fine and your Sitecore Queries run fine. Great. Give yourself a nice "It works on my machine" sticker, and then get those indexes going ;-)

Seriously.

Until next time :-)


No comments: