When Sitecore 9 was introduced, a new way of storing contact data called XDB was introduced. XDB which is short for the Experience Database, stores all the data related to contacts, such as interactions, contacts and their related facets.
To support a large intake and retention of data, Sitecore chose to use sharded databases, which essentially means that the data is divided over several databases. Sitecore uses the Elastic Database from Microsoft to accomplish this.
When everything is working, there’s no need of a deep understanding on the internals of XDB, however when something isn’t working properly, it’s nice to have a clear view on what is happening internally. In this blog post I’m going to give you a high level explanation on what sharding is, how data is divided between the databases, what the tables do, and how contacts are resolved.
Sharding
As mentioned above, the XDB databases are sharded databases, where the complete dataset is divided over several databases (shards), based on their sharding key. When you use a default installation of Sitecore, you’ll get two shard databases (Xdb.Collection.Shard0 and Xdb.Collection.Shard1) and a Shard Map Manager (XDB.Collection.ShardMapManager).
The Shard Map Manager holds the information about in which shard a record lives (called a ShardMap), based on the sharding key of the record. There are 3 Shard Maps; ContactIdentifiersIndex, ContactId and DeviceProfileId. Each of these maps have one more more Shard Mappings, which determine in which shard the record should be placed.
A sample Shard Mapping can look as follows (some columns are left out):
MappingId | ShardId | ShardMapId | MinValue | MaxValue |
EAC25FCB-EEC1-4AF4-8E70-0BBAB2819D58 | 7DAC49A9-D955-45C9-8063-A1F9A2EDA50A | B903A963-19DE-4A03-B385-05A9C66150C9 | 0x | 0x80 |
As you can see in the example above, the ShardMap with that specified ID, stores records where the ShardingKey is between 0x and 0x80 in the Shard that corresponds to the ShardId.
Shard Maps
ContactIdentifiersIndex Shard Map
The ContactIdentifiersIndex Shard Map manages one table;
- ContactIdentifiersIndex.
The value used as ShardingKey for this table is the Identifier.
ContactId Shard Map
The ContactId Shard Map manages several tables:
- Contacts
- ContactFacets
- ContactIdentifiers
- Interactions
- InteractionsFacets
The value used as ShardingKey for this table is the ContactId
DeviceProfileId Shard Map
The DeviceProfileId Shard Map manages several tables:
- DeviceProfiles
- DeviceProfileFacets
The value used as ShardingKey for this table is DeviceProfileId
Sharding keys
As mentioned above several times, the ShardingKey in combination with the Shard Map determines where the record is stored. The ShardingKey however, isn’t just the ID of a record, but it is determined by using the ID. Sitecore hashes the ID using the FNV-1 hash function and then takes the most significant byte as sharding key.
The following C# code can calculate the ShardingKey based on the GUID ID.
public static class ShardKeyHelper { public static byte[] GetShardKey(Guid id) { return GetHashedKey(id.ToByteArray()); } public static byte[] GetHashedKey(byte[] bytes) { return new[] { (byte) (Hash(bytes) >> 56) }; } private static ulong Hash(IEnumerable<byte> bytes) { ulong num1 = 14695981039346656037; foreach (byte num2 in bytes) { num1 ^= num2; num1 *= 1099511628211UL; } return num1; } }
Retrieving a contact in Sitecore
There are two ways that a contact can be retrieved from within Sitecore. This can either be done by using the ContactId, or by using the Identifier of the contact. Both options use their own logic to retrieve the contact.
ContactId
When a contact is requested using the ContactId, a straight forward lookup is done:
- The ShardingKey is calculated based on the ContactId
- Sitecore checks the ShardMapManager to check in which shard the ShardingKey belongs.
- Sitecore retrieves the contact from the Contacts table in the correct sharding database.
Identifier
When a contact is requested by their identifier, the flow is somewhat different.
- The ShardingKey is calculated based on the Identifier
- Sitecore checks the ShardMapManager to check in which shard the ShardingKey belongs.
- Sitecore retrieves the ContactId from the ContactIdentifiersIndex table in the correct shardingdatabase.
- The ShardingKey is calculated based on the retrieved ContactId
- Sitecore checks the ShardMapManager to check in which shard the ShardingKey belongs.
- Sitecore retrieves the contact from the Contacts table in the correct sharding database.
Tables of interest
Most of the tables in the XDB databases speak for itself, however there are two tables for which I’d like to share some information.
ContactIdentifiersIndex
The ContactIdentifiersIndex almost contains the same data as in the ContactIdentifiers table, however the data in this table is sharded based on the Identifier instead of the ContactId. This table is used to lookup a ContactId based on the Identifier.
The Identifiers are stored as varbinary data, which can be made readable by converting it to a varchar (please note that this will not work properly for special characters).
SELECT TOP (1000) convert(varchar(255),[Identifier]) ,[IdentifierHash] ,[Source] ,[ContactId] ,[LockTime] ,[Version] FROM [xdb_collection].[ContactIdentifiersIndex]
ContactIdentifiers
The ContactIdentifiers table stores all the identifiers for a certain contact. As described above, the identifiers are stored as varbinary and can be made readable by converting it to a varchar.