In my last post I explained a little about the Master Data Management (MDM) problem that we have faced on my current project. At its very core building an MDM system is a data modelling exercise and in this post I'm going to talk about the data model iterations that we have gone through in an attempt to solve the problem.
In order to talk about these solutions I want to outline a hypothetical scenario. A scenario that is actually very similar to the real one in which we found ourselves.
The Scenario
Oil Wells
We need to store a list of oil wells. We have already decided what the common identifers for the oil wells are going to be, for simplicity's sake here we are calling them WellA, WellB, WellC & WellD. For each oil well we need to store a set of attributes:
- Well Type, which has a finite set of values {Producer, Injector}
- Well Status, which also has a finite set of values {Up, Down}
Well Type and Well Status can also be empty
These oil wells exist in a hierarchy (called the 'Production' hierarchy that is based on geographical location and contains States, Fields, Areas and Wells. Likewise we have also already decided what the 'Production' hierarchy looks like:
Note that this is a ragged hierarchy - Field #2 is alot smaller than Field #1 so it is not split into areas like Field #1 is.
A State, Field or Area may also have attributes associated with it.
Equipment
We also need to store a list of equipment. For simplicity, Equipment can be considered to be pieces of machinery e.g. A pump.
- Equipment can be moved between different oil wells
- Equipment can break down
- Equipment can be replaced
As with Wells, we also need to store attributes of equipment:
- Date of Manufacture
- Weight
This is a flat list. There is no hierarchy. For simplicity our list consists of only three pieces of equipment: {PumpA, PumpB, PumpC}.
Notice how we have a many-to-many relationship between wells and equipment - at any given time a Well may have a particular pump fitted to it. It is a requirement that we represent that relationship in our MDM system.
Note that there are many many more entities about which we need to store data. In this scenario for the sake of simplicity I am only modelling two of them, Wells and Equipment.
1. The dedicated table model
Our first attempt at a Master Data Management data model was based on PPDM that I discussed previously. (I say our, we actually hired an external data modelling consultant to help us out with this.) In PPDM each entity has its own dedicated table and we implemented the same. In order to store relationships between different entities we used what PPDM calls link tables where each different relationship requires a dedicated link table. Here is the data model that was designed:

This is a fairly familiar model to anyone with a data modelling background. Each entity and each many-to-many relationship has a dedicated table.
There are a number of advantages to this model:
- Simplicity. Its very obvious what each table is for.
- All attributes are strongly typed (i.e. DateOfManufacture, Weight, WellType & WellStatus are all typed accordingly.)
There were also some distinct disadvantages though:
- It is impossible to model the ragged hierarchy of the wells that I outlined in the scenario above. We got around this by inventing artificial values in order to satisfy the relationships (e.g. We create an artificial Area as a parent of WellC and WellD).
- Each time we need to store data about a new entity we have to introduce a new table into the model.
- Each time we need to store a new relationship, that too requires the introduction of a new table into the model.
- Generic services built on top of this model that are queryable by entity type would be difficult to implement.
- All attribute columns have to be nullable which means there is the potential for sparsity in our storage model.
The inflexibility of this model, as highlighted by the disadvantages above, was its downfall. We took a decision to kick the external data modeller out and have a go at it ourselves. I was given the task of coming up with something that was more flexible. The second incarnation of our MDM model was what I shall refer to as 'The Hybrid Model'.
2. The Hybrid Model
The main problem I wanted to solve was that the dedicated table model didn't allow us to store ragged hierarchies. To combat that I employed a self-referential table, but I still had to solve the problem of how to store attributes of all the entities. The following model shows what I came up with:

Some new tables have been introduced here
- EntityType - A categorisation of entities that exist in the same hierarchy. In our scenario this table contains 2 records: {Production, Equipment}
- Entity - All the entities. This table has a foreign key to EntityType. In our scenario this table contains records for {California, Field #1, Field #2, Area #1, Area #2, WellA, WellB, WellC, WellD, PumpA, PumpB, PumpC} with the names of all those entities being stored in Entity.EntityName.
- EntityCrossReference - Used to store relationships between different entity types. The PK of {SourceEntityID, SourceEntityTypeID, TargetEntityTypeID} ensures that we only store one relationship from entity to an entity of another type. In our scenario we need to hold a relationship between a Well and a piece of Equipment.
As I said above the main reason for the changes implemented here was to enable us to store ragged hierarchies and that was achieved through use of Entity.ParentID field. Note that the self-referencing foreign key is from (ParentEntityID,EntityTypeID) to (EntityID,EntityTypeID) in order to ensure entities cannot have a hierarchical relationship to entities in another hierarchy.
So, the advantages here:
- All attributes are strongly-typed
- Ragged hierarchies can be held
- We can hold all relationships (in EntityCrossReference) and we no longer need to introduce a new table for each relationship.
- Generic services on top of this model, queryable by entity type, are alot easier to build due to the generic table Entity.
And the disadvantages:
- We've lost some of the simplicity
- Each time we need to store data about a new entity we still have to introduce a new table into the model in order to store the attributes
- All attribute columns have to be nullable which means there is the potential for sparsity in our storage model.
- I had to make sure that classes of entity that appear in the same entity type (e.g. Area and Well) didn't share the same EntityID because this would cause a PK violation in table Entity. I solved this by defining a range of values that could be used for EntityID for each different class of entity. In other words I was putting some real-world meaning into the EntityID values whereas this field was supposed to be a meaningless surrogate key. This approach made me uncomfortable.
Still, even given the disadvantages we were alot happier with this model than the previous one because of the ability to store ragged hierarchies and the more generic (thus more flexible) manner of storing all of the entities in the same table. Still though we had the problem that if we wanted to store attributes of an entity, it required adding a new table to the model. So back to the design studio again.
3. The Generic Model
As I was thinking through how to solve this I had one of those moments of clarity where the pieces of the jigsaw just drop into place. I realised that I already had everything I needed in the hybrid model -in fact I had too much- I was just misusing it. This epiphanous moment came when I realised that a piece of Equipment could actually be considered to be an attribute of a Well. In other words a piece of equipment, whilst being an entity in its own right, could also be an attribute of another entity type as well. I was already holding this relationship in EntityCrossReference so why not store all attributes in the same way. This thinking gave rise to the following model:

Yes, there are only three tables, with the following usage:
- EntityType - A categorisation of entities that exist in the same hierarchy - just the same as in the hybrid model. Previously for our scenario this contained only 2 entity types {Production, Equipment} but in this new model it contains alot more: {Production, Equipment, WellType, WellStatus, DateOfManufacture, Weight}.
- Entity - The use of this table hasn't changed from the hybrid model except that it now stores all the values that used to be considered attributes of an entity. In this new model all attibutes are considered to be first class entities in their own right.
- EntityCrossReference - Stores the relationships between entities. In this respect it hasn't changed from the hybrid model except that now it stores alot more information.
As before, let's review the advantages and disadvantages of this model.
Advantages:
- Ragged hierarchies can be held
- We can hold all relationships (in EntityCrossReference) and we no longer need to introduce a new table for each relationship.
- We can hold all attributes (and all future attributes) without having to introduce new tables.
- Generic services on top of this model, queryable by entity type, are alot easier to build due to the generic Entity and EntityCrossReference tables.
- No nullable columns other than items that exist at the top of a hierarchy. If no attribute is available for particular entity then simply we just don't store a relationship for it. This is better than having a database packed with NULL values.
Disadvantages:
- We have lost strong typing of the entities. We decided that the advantages justified this backwards step.
- Impact on performance is detrimental
As there are now only three tables in the model I can show you what's in them and hopefully that will illustrate their usage alot better. Observe:
EntityType
Entity
EntityCrossReference 
That may look fairly nonsensical but accessing the production hierarchy is now pretty easy using a recursive operation:
1 --Pull out the Production hierarchy
2 WITH ProductionCTE AS
3 (
4 SELECT EntityName, CAST(EntityName AS VARCHAR(100)) AS EntityPath,
5 0 AS HierarchyLevel, EntityID, EntityTypeID
6 FROM Entity e
7 WHERE EntityTypeID = 1
8 AND ParentEntityID IS NULL
9 UNION ALL
10 SELECT e.EntityName, CAST(c.EntityPath + '\' + e.EntityName AS VARCHAR(100)) AS EntityPath,
11 c.HierarchyLevel + 1 AS HierarchyLevel, e.EntityID, e.EntityTypeID
12 FROM Entity e
13 INNER JOIN ProductionCTE c
14 ON e.ParentEntityID = c.EntityID
15 AND e.EntityTypeID = c.EntityTypeID
16 )
17 SELECT EntityName,EntityPath,HierarchyLevel
18 FROM ProductionCTE
and that returns us the following:

Hopefully you'll be able to see the similarity between this data and the picture of the hierarchy at the top of this page.
Similarly, pulling out an entity and all of its attributes is pretty easy as well. Here's the code to do that, it uses our relational database engine's PIVOT function:
1 --Show the Well entities and all of the attributes
2 SELECT *
3 FROM
4 (
5 SELECT e1.EntityName, e2.EntityTypeID AS AttributeName, e2.EntityName AS AttributeValue
6 FROM Entity e1
7 INNER JOIN EntityCrossReference ec
8 ON e1.EntityID = ec.SourceEntityID
9 AND e1.EntityTypeID = ec.SourceEntityTypeID
10 AND e1.EntityTypeID = 1
11 INNER JOIN Entity e2
12 ON e2.EntityID = ec.TargetEntityID
13 AND e2.EntityTypeID = ec.TargetEntityTypeID
14 )p
15 PIVOT
16 (
17 MAX(AttributeValue)
18 FOR AttributeName IN ([3],[4])
19 )AS pvt
And sure enough:

OK, so I think that's enough for one post. If you've made it down this far - good going. I should caveat this with saying that the final model that we came up with is slightly more complicated than this in order that we could satisfy requirements for data lineage, auditing and further classification of entities; but essentially this is the basic model and its working well for us.
I have attached a zipped script to this blog post that will create the generic model, populate it with data, and uses the same queries that you can see above to extract the useful information. This should serve to demonstrate some of the ideas that I have talked about above. Note that the script will only work when executed against SQL Server 2005.
If you've any questions or comments I would love to hear them. Have you been trying to solve similar problems? How have you solved them? Does your model look anything like this one?
-Jamie
UPDATE 2007-03-27: Mick has provided more information about our implementation pattern for MDM here: http://blogs.conchango.com/mickhorne/archive/2007/03/26/Test-Post.aspx