Monday, April 13, 2009

Migrating data into Sitecore

One question I’ve noticed gets posed ever so often on the Sitecore Forums is; “How do I migrate data from my clients previous CMS into Sitecore?”  Or similar type questions, to more or less the same effect.  Also commonly asked on the same occasion is; “Does Sitecore provide a tool for this?”

 

Just to make sure we’re all on the same page here; I’m referring to the act of importing structured data (i.e. the clients existing website, sitting in any structured repository – CMS or otherwise – and getting that migrated into Sitecore for use in the Next Generation CMS system being developed by you).

 

This process is also known as Data Migration, and WikiPedia defines it as follows:

 

Data migration is the process of transferring data between storage types, formats, or computer systems. Data migration is usually performed programmatically to achieve an automated migration, freeing up human resources from tedious tasks. It is required when organizations or individuals change computer systems or upgrade to new systems, or when systems merge (such as when the organizations that use them undergo a merger/takeover).”

 

It further states:

 

“To achieve an effective data migration procedure, data on the old system is mapped to the new system providing a design for data extraction and data loading. The design relates old data formats to the new system's formats and requirements. Programmatic data migration may involve many phases but it minimally includes data extraction where data is read from the old system and data loading where data is written to the new system.”

 

And lastly:

 

“Changing application vendor - for instance a new CRM or ERP platform will inevitably involve substantial transformation as almost every application or suite operates on its own specific data model.”

 

If we boil all of this information down, we get something like “Data Migration is usually performed programmatically. To do it, a design for extracting data and loading data needs to be developed.”

 

And that’s where you come in.

 

We already have at least a partial answer to one of the above questions. Sitecore couldn’t really provide a tool that could achieve this. Data extraction would always be bespoke to the specific system being migrated from – so while you might be lucky enough to find something that would extract data from say a specific version EpiServer to a specific version of Sitecore; chances are you would still end up having to do quite a bit of the data design yourself to satisfy your business requirements.

 

That being said, I do believe Sitecore offers a few features to help you out.

 

For instance, there’s the Xml Importer.

    • It essentially works as a Transformation Engine. Given an XML input file (presumably produced by the system you intend to migrate from), you then write an XSLT file to transform the XML input file to a pre-defined format, which the Xml Importer will then use to create Sitecore Content Items.
    • This is probably only fine for the simplest of cases. In most migrations I have done, there has almost always been business requirements to perform more “advanced” features such as UrlMapping (setting up Url Redirection so that the existing links will work on the target Sitecore installation), Media Importing, category mapping to name a few. Not that you couldn’t achieve most of this using XSLT. I just think it will be very hard and unworkable.

 

There may or may not be similar tools available. I’ve never used them myself simply because in the vast majority of cases, a pre-built tool would not be able to achieve the requirements I was tasked to fulfil. The process of actually creating Sitecore Content Items is the least of my worries when it comes to migration.

 

While I can’t provide a specific solution in this post to whatever migration problem you may be facing, I will take you through a fictional scenario to help explain some of the points that needs to be addressed, and provide a few techniques that you may find useful.

 

For demonstration purposes, I will be importing the good old trusted Northwind database. Now before you start, I’m not saying this is neither a very good idea or anything similar – Northwind doesn’t hold web content and is therefore a very unlikely candidate for this type of migration. I have chosen it simply because many of you are already familiar with the structure of it, and it is freely available for anyone to download and experiment with.

 

Designing the model

To begin the process of migration, you first need to analyse the data model, and come up with an implementation that will work in your Sitecore solution.

 

In the example I’m going to cover, I’ve been tasked with the following:

 

“Migrate the products into a relevant categorised Sitecore Content Structure, making sure you preserve information about each supplier for the migrated products”.

 

After looking a bit at the products table definition, I conclude the following:

 

products

  • Each product has only 1 supplier
  • Each product is only defined in 1 category

 

This is fortunate, as it will make the migration a fairly simple task. More often than not, things will not be this well defined and simple.

 

So in terms of a Sitecore Content Structure, this could look like this:

 

  • /sitecore/content/Home
    • /sitecore/content/Home/Products
      • /sitecore/content/Home/Products/CategoryA
      • /sitecore/content/Home/Products/CategoryB
      • … and so on
    • /sitecore/content/Home/Suppliers
      • /sitecore/content/Home/Suppliers/SupplierA
      • /sitecore/content/Home/Suppliers/SupplierB
      • … and so on

 

You could of course map this in many different ways. A few of the things you should be considering when designing your structure are:

 

  • How many Sitecore Items will be created on each branch? 
    • Sitecore recommends that no more than 100 Items are created on any one branch level.
  • How will the existing meta-structure be preserved?  
    • In this case, a products category is implicitly derived from the Category node it sits under. Alternatively, as I will need to do with suppliers, you could create references from your data to point to the relevant meta data.
  • Will any filtering or data transformation need to take place?
    • Often, there are old data in the source system that you might as well filter out when doing the migration. This could be data prior to a certain date (i.e. only migrate the last 3 years worth of data) or certain categories that will not be migrated to the new system.
    • Are there inline links that you will need to handle?   As Northwind doesn’t hold web content, this is not relevant to this example. But in most cases, you will need to RegEx through content fields and rewrite Internal Links to point to the new locations of your content within Sitecore. This topic is complex enough to deserve an entire post of it’s own, so I won’t be going into detail with this here.

 

Having defined this structure, I now explore the relevant Database Schemas, and come up with the following 3 Sitecore Templates that will eventually hold my migrated data. In this case I started on a blank Sitecore 6 installation (090212, but that isn’t really relevant – this migration process would not differ significantly between Sitecore 5 and Sitecore 6).

 

Product:

product

 

Category:

category

 

Supplier:

supplier

 

I also create the basic skeleton structure I defined above, and run a Smart Publish. I’ll explain why, in just a bit.

 

Skeleton structure:

structure

 

And I am now ready to proceed to the next stage.

Extracting the data

When it comes to extracting the Northwind Data, fortunately this is almost a textbook example. The data is nice and clean, referential integrity is in place, and .NET makes it very straightforward to connect to my underlying MS SQL Server to reach the relevant tables. I know there are many boilerplate examples out there of how to easily access Northwind using LINQ and similar technologies, but since the data extraction will be so simple in this case I don’t really see a need for invoking any of them.

 

One toolkit I will use however, is my CorePoint.DomainObjects. Don’t see this as a shameless plug; one of the main reasons I developed the toolkit was to help in my many migration projects and it just makes the task ahead so much easier.

 

I set up a standard Sitecore project, and quickly mock up a few entities:

 

project

 

Product.cs:

using System;
using System.Data;
using CorePoint.DomainObjects.SC;
using CorePoint.DomainObjects;

namespace Website.Migration
{
    [Template("user defined/product")]
    public class Product : StandardTemplate
    {
        // A few properties that will be needed to map out the Sitecore hierarchy
        public int ProductId { get; set; }
        public int SupplierId { get; set; }
        public int CategoryId { get; set; }

        [Field("product name")]         public string ProductName { get; set; }
        [Field("quantity per unit")]    public string QuantityPerUnit { get; set; }
        [Field("unit price")]           public double UnitPrice { get; set; }
        [Field("units in stock")]       public int UnitsInStock { get; set; }
        [Field("units on order")]       public int UnitsOnOrder { get; set; }
        [Field("reorder level")]        public int ReorderLevel { get; set; }
        [Field("discontinued")]         public bool Discontinued { get; set; }
        [Field("supplier")]             public Guid Supplier { get; set; }

        public Product()
        {
        }

        public Product( DataRow row )
        {
            // For brewity, null values are not checked for. There are none in the 
            // Northwind products table
            ProductName = Convert.ToString( row[ "ProductName" ] );
            SupplierId = Convert.ToInt32( row[ "SupplierID" ] );
            CategoryId = Convert.ToInt32( row[ "CategoryID" ] );
            QuantityPerUnit = Convert.ToString( row[ "QuantityPerUnit" ] );
            UnitPrice = Convert.ToDouble( row[ "UnitPrice" ] );
            UnitsInStock = Convert.ToInt32( row[ "UnitsInStock" ] );
            UnitsOnOrder = Convert.ToInt32( row[ "UnitsOnOrder" ] );
            ReorderLevel = Convert.ToInt32( row[ "ReorderLevel" ] );
            Discontinued = Convert.ToBoolean( row[ "Discontinued" ] );
        }
    }
}

Category.cs:

using System;
using System.Data;
using CorePoint.DomainObjects.SC;
using CorePoint.DomainObjects;

namespace Website.Migration
{
    [Template("user defined/category")]
    public class Category : StandardTemplate
    {
        public int CategoryId { get; set; }

        [Field("category name")]    public string CategoryName { get; set; }
        [Field("description")]      public string Description { get; set; }

        public Category()
        {
        }

        public Category( DataRow row )
        {
            CategoryId = Convert.ToInt32( row[ "CategoryID" ] );
            CategoryName = Convert.ToString( row[ "CategoryName" ] );
            Description = Convert.ToString( row[ "Description" ] );
        }
    }
}

Supplier.cs:

using System;
using System.Data;
using CorePoint.DomainObjects.SC;
using CorePoint.DomainObjects;

namespace Website.Migration
{
    [Template("user defined/supplier")]
    public class Supplier : StandardTemplate
    {
        public int SupplierId { get; set; }

        [Field("company name")]        public string CompanyName { get; set; }

        // For brewity, the rest of the fields are omitted, as
        // they're not really needed to demonstrate the migration

        public Supplier()
        {
        }

        public Supplier( DataRow row )
        {
            SupplierId = Convert.ToInt32( row[ "SupplierID" ] );
            CompanyName = Convert.ToString( row[ "CompanyName" ] );
        }
    }
}

 

And now we’re ready to get going. As you can see, I cheated a bit on Supplier.cs. This article is long enough, without me listing out 15 or so similar fields that don’t really add value to the demonstration / example itself.

 

I also created a completely standard .ASPX page under the root of the site, DoMigrate.aspx. It is in this page I will be executing the migration itself.

 

I start by adding some code that will load up all of the required source data into memory. When developing a migration, there are a few things to keep in mind:

 

  • Your code is more or less fire and forget. It will not run in a live environment, it will not need to be maintained by yourself or your peers for an extended period of time (normally) and therefore many of the rules and guidelines that apply to developing web applications don’t really apply here.
  • For performance reasons, load up as much as you possibly can into RAM. Run 64 bit if you can, and don’t worry if your migration script is eating up gigabytes of memory. If your migration mapping is complex (this example isn’t), you will appreciate the performance gain from having in-memory tables and lookups, and can in many cases mean a difference of 4 hours versus 4 minutes of execution time.
  • Don’t expect to get it spot-on the first time. You will be running the script many times over.

 

I add a bit of code to the Page_Load method:

protected void Page_Load( object sender, EventArgs e )
{
    // First we start by loading all source data into memory.

    string connectionString = 
        @"user id=sa;password=removed;Data Source=.\SQLEXPRESS;Database=Northwind";

    DataTable products = new DataTable();
    DataTable suppliers = new DataTable();
    DataTable categories = new DataTable();

    using ( var conn = new SqlConnection( connectionString ) )
    {
        using ( var da = new SqlDataAdapter( "SELECT * FROM Products", conn ) )
        {
            da.Fill( products );
        }

        using ( var da = new SqlDataAdapter( "SELECT * FROM Categories", conn ) )
        {
            da.Fill( categories );
        }

        using ( var da = new SqlDataAdapter( "SELECT * FROM Suppliers", conn ) )
        {
            da.Fill( suppliers );
        }
    }
}

 

As you can see, no surprises there. I am basically filling up 3 Datatables with information from the Northwind database.

 

I add a little code to load up all the data in the entities I created above, like this:

List<Product> ProductList = new List<Product>();
List<Category> CategoryList = new List<Category>();
List<Supplier> SupplierList = new List<Supplier>();

foreach( DataRow row in products.Rows )
    ProductList.Add( new Product( row ) );
foreach ( DataRow row in categories.Rows )
    CategoryList.Add( new Category( row ) );
foreach ( DataRow row in suppliers.Rows )
    SupplierList.Add( new Supplier( row ) );

Response.Write( string.Format( "{0} products, {1} categories, {2} suppliers<br />",
                ProductList.Count,
                CategoryList.Count,
                SupplierList.Count ) );

 

And for the first time, I run my migration script-to-be.

 

“77 products, 8 categories, 29 suppliers”

 

This is a good time to sanity check your code and make sure the numbers you are getting are more or less what you expect. Comparing with the rowcounts of the source database, I conclude that everything appears to be in order, and continue. It is now time to create some Sitecore Items based on this data.

 

Before I do that, there are a few things to point out:

 

  • The order which you create the Sitecore items in, is not irrelevant. In our case here, Products refer Suppliers and sit beneath Categories. Therefore the first Sitecore Items you create must be Suppliers (or Categories, in this case that won’t matter) followed by Categories and finally followed by Products.
  • I always run test migrations against the “web” database. This has a few advantages:
    • It runs faster (I think…  it should, since there is no publishing queue handling)
    • It allows you to easily undo a failed migration. Just publish your skeleton again, and the whole thing resets back to start

 

I add some code to create Sitecore Content. Don’t worry, we’re almost done now :-)

// Set up a director for "web", standard language and security disabled
SCDirector director = new SCDirector( "web", "en", true );
var suppliersRoot = director.GetObjectByIdentifier( "/sitecore/content/home/suppliers" );
var productsRoot = director.GetObjectByIdentifier( "/sitecore/content/home/products" );

// And with these references in place, it's time to create our Sitecore Content Items

SupplierList.ForEach( delegate( Supplier s )
{
    s.Name = s.CompanyName;        // A name for the Sitecore Item (duh.. :P)
    s.ParentId = suppliersRoot.Id;
    s.Director = director;
    s.Store();
} );

CategoryList.ForEach( delegate( Category c )
{
    // First create the category
    c.Name = c.CategoryName;
    c.ParentId = productsRoot.Id;
    c.Director = director;
    c.Store();

    // And now create all products under the category
    ProductList.FindAll( p => p.CategoryId == c.CategoryId ).ForEach( delegate( Product p )
    {
        p.Name = p.ProductName;
        p.ParentId = c.Id; // .Id was assigned when the category was stored
        p.Director = director;

        // Pay attention to this one. I now map the Supplier Id field with the
        // Sitecore ID of the related supplier (created above)
        p.Supplier = SupplierList.Find( s => s.SupplierId == p.SupplierId ).Id;
        p.Store();
    } );
} );

Response.Write( "All done!<br />" );

 

And here we go. The data is now migrated. On my overworked laptop, this code executes in just a couple of seconds. The results look like this:

 

migration

 

And you’re all done.

 

A few last things to consider are this:

 

  • If migrating large quantities of data, try and disable as many Sitecore event handlers and whatever else you can get away with.
  • Use BulkUpdateContext()
  • Don’t forget your target language
  • If you can, make the fields shared and unversioned. This should help migration execution speed.
  • And probably most importantly: Most of your work will almost always be during analysis. Figuring out
    • How to get to the source data
    • Determining what data needs to be migrated
    • Designing a reasonable way to store the data in Sitecore

 

I don’t know of many shortcuts. This is very rarely as easy a task as this one was. I know there are standards in the making, for how CMS systems could store (or at least interchange data), but I don’t know if those are realistic and (more importantly) will be widely adopted by the various CMS vendors.

 

I hope you found this post useful. Probably my longest post ever. Comments and feedback welcomed :-)

8 comments:

Alex de Groot said...

Great post Mark!

About the standard, guess you're talking about CMIS. That's not for WCM, only for DMS'. Which means that it doesn't make you life easier.

I love your tips and tricks :).

- Alex

Anonymous said...

Great post . . .
May want to do a few things a little different:
Create blank content items as a first stage and build up content tree.
Then Edit and insert content into each item.
Add a lot of logging.
All of which is mainly to help track down any problamatic areas because these things rarely go smoothly.

In general I would have taken a very similar approach.

Anonymous said...

Excellent post.

Raul

Amar Patel said...

Nice one Mark, I've just quoted this article in an response to RFI that asks exactly the same question. Cheers!

Jack McKrack said...

What if I have a template with several different base templates - how do I populate the fields in those other templates? Also, how do I simply edit existing content items rather than always adding new content items (Store())?

n said...

Error
Could not read Sitecore configuration.
Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.

Exception Details: System.InvalidOperationException: Could not read Sitecore configuration.

Source Error:


Line 68:
Line 69: // Set up a director for "web", standard language and security disabled
Line 70: SCDirector director = new SCDirector("web", "en",true);
Line 71: var suppliersRoot = director.GetObjectByIdentifier("/sitecore/content/home/suppliers");
Line 72: var productsRoot = director.GetObjectByIdentifier("/sitecore/content/home/products");

please help

clinton said...

Excellent post, I'm about to start a migration and this post really helped me get a plan together. Thanks

Sandun Perera said...

Thanks for the post, this strategy is worked for me. I have commented out the events section in the web.config, but the performance was just reduced 27 seconds from the 30 second process for this same example app.

And also I have a live DB which have lots of data, can you suggest a good approach for data migration since the RAM is not enough to hold them all.