Seems I am on a writing streak this week. Am taking a week off, you see, from my normal everyday Sitecore Consulting, and seem to have a bit of time on my hands to catch up on some of all the posts I’ve been meaning to write for a while. Don’t worry; after this I will probably be way too busy again for a while to find time to post ;-)
So I catching up on StackOverflow the other day, and an interesting question was posed; “How to find related items by tags in Lucene.NET”.
And while there probably IS a way to actually do this with Lucene.NET; I remember my initial thought was “but why go through all the hassle of configuring and setting it up to do this?”. Not only would it matter things from an Operations point of view; it would require more code and more code that was completely dependant on specific configuration settings in the Lucene indexes.
Now, let me be very clear, I am no big expert on Lucene. There are many of you out there who know it well, and would probably be able to cook up a solution to answer the guys question using it. As for myself, I try to keep as much arcane configuration out of any project I am involved in – especially to solve a problem such as this, where Sitecore pretty much gives you the tools you need to solve it straight out of the box.
So anyway. Guy was asking in a Lucene context, but was looking for proposals. And I decided to give it a whirl, mocked up some pseudo code to solve the problem, and that was that. But see; everyone can write pseudo-code :P And it’s only fair I put my err… code where my mouth is, and write up a real example of how this can be achieved in a manner I explained. Here goes.
Setting it up in Sitecore
I start by making up two templates:
1) “Simple Value”, which will be used to organise the meta tags I will be drawing upon.
It has no fields.
2) “Article”, which I will use to demonstrate how to implement “Related Articles” functionality.
I then set up a meta-structure that I will be using to tag up my articles, and ultimately draw out related articles. I don’t fill out the entire structure, nor do I mean to imply this structure is perfect. But it is enough to demonstrate the point, and should be easy enough to follow. All the tags are based on the “Simple Value” template.
After this, I go through the somewhat tedious task of setting up a number of articles that are tagged in different ways.
For now, I type and tag in 7 articles; like this:
Name: Ben Hur
Tags: O2 Arena, Theatre
Name: Britney Spears
Tags: O2 Arena, Pop, Concert
Name: Depeche Mode
Tags: O2 Arena, Alternative, Concert
Name: Michael Jackson
Tags: O2 Arena, Pop, Concert
Name: Nickelback
Tags: O2 Arena, Rock, Concert
Name: Pet Shop Boys
Tags: O2 Arena, Pop, Concert
Name: War of the Worlds
Tags: O2 Arena, Theatre
I should probably go on for a while longer if I really wanted to go all-out in demonstrating this. However, I do have enough now, and it’ll have to do. I hate typing in test data ;-)
Before I go on, I should explain exactly how I intend to deduce what “related articles” should be. It can be done and determined in many ways – but I am proceeding exactly in the manner that was originally in question on StackOverflow. The rule can be described as two statements:
1) An article is related if it shares one or more tags with the source article
2) The more tags it shares, the more relevant it becomes (i.e. should appear higher on the list)
Lastly, I set up a blank .ASPX page in my webroot named “TestRelated.aspx”, and I quickly mock up two DomainObjects that I will build upon for this functionality.
SimpleValue.cs
using CorePoint.DomainObjects.SC; using CorePoint.DomainObjects; namespace Website.Related { [Template("user defined/simple value")] public class SimpleValue : StandardTemplate { } }
Article.cs
using System; using System.Collections.Generic; using CorePoint.DomainObjects.SC; using CorePoint.DomainObjects; namespace Website.Related { [Template("user defined/article")] public class Article : StandardTemplate { [Field("title")] public string Title { get; set; } [Field("text")] public string Text { get; set; } [Field("tags")] public List<Guid> Tags { get; set; } } }
And finally, in my TestRelated.aspx.cs, I add a bit of code to test that everything is as expected.
public partial class TestRelated : System.Web.UI.Page { protected void Page_Load( object sender, EventArgs e ) { var director = new SCDirector(); List<Article> articles = director.GetChildObjects<Article>( "/sitecore/content/global/articles" ); foreach ( Article article in articles ) { // Get the SimpleValues (name) from the tag Guids var simpleValues = article.Tags.ConvertAll<string>( a => { return director.GetObjectByIdentifier<SimpleValue>( a ).Name; } ); StringBuilder sb = new StringBuilder(); simpleValues.ForEach( sv => sb.Append( sv + ' ' ) ); Response.Write( string.Format( "Name: {0}<br />Tags: {1}<br /><br />", article.Name, sb.ToString() ) ); } } }
So far so good. I run the code, and I get a replica of the list I already showed:
Name: Ben Hur
Tags: O2 Arena Theater
Name: Britney Spears
Tags: Pop Concert O2 Arena
Name: Depeche Mode
Tags: O2 Arena Concert Alternative
Name: Michael Jackson
Tags: Pop Concert O2 Arena
Name: Nickelback
Tags: Rock Concert O2 Arena
Name: Pet Shop Boys
Tags: O2 Arena Concert Pop
Name: War of the Worlds
Tags: O2 Arena Musical
Excellent. After all this, I am now ready to proceed to the good stuff ;-)
Finding Related Articles using the Sitecore LinkDatabase
Having an Article entity in place, makes this an obvious place to add functionality such as Related Articles. I could either add it as a Lazy Load property named “Related Articles”, or I could write a method named “GetRelatedArticles()”. This is mostly down to aesthetics and practices; personally I prefer the first option.
I expand the Article.cs with a little bit of code. The original pseudo-code I suggested, is entered in comments, for reference.
private int _referenceCount; List<Article> _RelatedArticles = null; public List<Article> RelatedArticles { get { if ( _RelatedArticles == null ) { _RelatedArticles = new List<Article>(); var referenceCount = new Dictionary<Guid, int>(); // for each ID in tags foreach ( Guid id in Tags ) { var sv = Director.GetObjectByIdentifier<SimpleValue>( id ); // Personal note: In this particular instance, performance // could be gained here, but not loading up full articles // via DomainObjects but hitting the LinkDatabase directly instead // get all documents referencing this tag List<Article> articles = sv.GetReferrers<Article>(); // for each document found articles.ForEach( a => { if ( a.Id != Id ) { // if master-list contains document; if ( referenceCount.ContainsKey( a.Id ) ) referenceCount[ a.Id ]++; // increase usage-count else // else; // add document to master list referenceCount[ a.Id ] = 1; } } ); } // Now we have a list of all the relevant guids being referenced on all tags // on this article. Load them up, and stamp them with the reference count foreach ( var key in referenceCount.Keys ) { var relatedArticle = Director.GetObjectByIdentifier<Article>( key ); relatedArticle._referenceCount = referenceCount[ key ]; _RelatedArticles.Add( relatedArticle ); } // sort master-list by usage-count descending _RelatedArticles.Sort( ( a, b ) => b._referenceCount.CompareTo( a._referenceCount ) ); } return _RelatedArticles; } }
And to test if what I’m getting from this is what I expect, I also add some code to my TestRelated.aspx so it becomes:
protected void Page_Load( object sender, EventArgs e ) { var director = new SCDirector(); List<Article> articles = director.GetChildObjects<Article>( "/sitecore/content/global/articles" ); foreach ( Article article in articles ) { // Get the SimpleValues (name) from the tag Guids var simpleValues = article.Tags.ConvertAll<string>( a => { return director.GetObjectByIdentifier<SimpleValue>( a ).Name; } ); StringBuilder sb = new StringBuilder(); simpleValues.ForEach( sv => sb.Append( sv + ", " ) ); Response.Write( string.Format( "Name: {0}<br />Tags: {1}<br />Related Articles: ", article.Name, sb.ToString() ) ); article.RelatedArticles.ForEach( ra => Response.Write( string.Format( "{0},", ra.Name ) ) ); Response.Write( "<hr />" ); } }
And after all this, I am pleased to find a result looking like:
Name: Ben Hur
Tags: O2 Arena, Theater,
Related Articles: Michael Jackson,Britney Spears,Depeche Mode,Nickelback,Pet Shop Boys,War of the Worlds,
Name: Britney Spears
Tags: Pop, Concert, O2 Arena,
Related Articles: Michael Jackson,Pet Shop Boys,Depeche Mode,Nickelback,Ben Hur,War of the Worlds,
Name: Depeche Mode
Tags: O2 Arena, Concert, Alternative,
Related Articles: Britney Spears,Michael Jackson,Nickelback,Pet Shop Boys,War of the Worlds,Ben Hur,
Name: Michael Jackson
Tags: Pop, Concert, O2 Arena,
Related Articles: Britney Spears,Pet Shop Boys,Depeche Mode,Nickelback,Ben Hur,War of the Worlds,
Name: Nickelback
Tags: Rock, Concert, O2 Arena,
Related Articles: Britney Spears,Depeche Mode,Pet Shop Boys,Michael Jackson,Ben Hur,War of the Worlds,
Name: Pet Shop Boys
Tags: O2 Arena, Concert, Pop,
Related Articles: Britney Spears,Michael Jackson,Depeche Mode,Nickelback,War of the Worlds,Ben Hur,
Name: War of the Worlds
Tags: O2 Arena, Musical,
Related Articles: Ben Hur,Britney Spears,Depeche Mode,Nickelback,Pet Shop Boys,Michael Jackson,
The first thing that strikes me is; my meta data and test data probably aren’t extensive enough to really see this functionality in full effect. They all look almost the same.
However, I can determine that it works as expected. “Britney Spears”, “Michael Jackson” and “Pet Shop Boys” all share the same 3 meta tags. They SHOULD in all instances suggest the “one left out” on top of the list as “Related Articles”. And they all do; I’ve marked them in bold and underline. Also note that the “Depeche Mode” concert in O2 Arena lists other concerts (although of different music genre) before it proceeds to list the musicals and theatre plays.
It works :-)
A few notes on performance
In this post, I’ve deliberately not focused excessively on performance implications. Don’t worry – it’s not at all bad. But in “real life”; there are still obvious places in this code where you could potentially gain a significant amount of performance. As everyone will know; I/O operations are by an order of magnitude some of the most expensive calls we can make, and there is definitely a few places you could set in here.
A few suggestions I would look into if I were to take this code live:
- Code up a TagController; that will eventually act as a cache for all the tags in your solution. Load up the tags only once, and don’t repeatedly re-load them in your loops.
- In this case, bypass the very convenient .GetReferrers() method provided by DomainObjects and go through the extra work of working with the LinkDatabase directly yourself. For this part of the algorithm (counting up how many times a given ID is referencing your tag), you don’t really need to load up the Sitecore Item – something .GetReferrers() will automatically do. I will put this on the TODO list for DomainObjects.
- And – as ALWAYS – don’t forget to configure caching for whatever sublayouts and/or user controls you are calling this functionality on.
That’s it for this time. I hope you found this useful :-)