blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

SSIS Junkie

Service Orientation in Oil and Gas (Part 5) - Attempting to solve the Master Data problem

In my last post I explained a little about the Master Data Management (MDM) problem that we have faced on my current project. At its very core building an MDM system is a data modelling exercise and in this post I'm going to talk about the data model iterations that we have gone through in an attempt to solve the problem.

In order to talk about these solutions I want to outline a hypothetical scenario. A scenario that is actually very similar to the real one in which we found ourselves.

 


The Scenario

Oil Wells

We need to store a list of oil wells. We have already decided what the common identifers for the oil wells are going to be, for simplicity's sake here we are calling them WellA, WellB, WellC & WellD. For each oil well we need to store a set of attributes:

  • Well Type, which has a finite set of values {Producer, Injector}
  • Well Status, which also has a finite set of values {Up, Down}

Well Type and Well Status can also be empty

These oil wells exist in a hierarchy (called the 'Production' hierarchy that is based on geographical location and contains States, Fields, Areas and Wells. Likewise we have also already decided what the 'Production' hierarchy looks like:

 

Note that this is a ragged hierarchy - Field #2 is alot smaller than Field #1 so it is not split into areas like Field #1 is.

A State, Field or Area may also have attributes associated with it.

 

Equipment

We also need to store a list of equipment. For simplicity, Equipment can be considered to be pieces of machinery e.g. A pump.

  • Equipment can be moved between different oil wells
  • Equipment can break down
  • Equipment can be replaced

As with Wells, we also need to store attributes of equipment:

  • Date of Manufacture
  • Weight

This is a flat list. There is no hierarchy. For simplicity our list consists of only three pieces of equipment: {PumpA, PumpB, PumpC}.

Notice how we have a many-to-many relationship between wells and equipment - at any given time a Well may have a particular pump fitted to it. It is a requirement that we represent that relationship in our MDM system.

Note that there are many many more entities about which we need to store data. In this scenario for the sake of simplicity I am only modelling two of them, Wells and Equipment.


 

1. The dedicated table model

Our first attempt at a Master Data Management data model was based on PPDM that I discussed previously. (I say our, we actually hired an external data modelling consultant to help us out with this.) In PPDM each entity has its own dedicated table and we implemented the same. In order to store relationships between different entities we used what PPDM calls link tables where each different relationship requires a dedicated link table. Here is the data model that was designed:

 

 

This is a fairly familiar model to anyone with a data modelling background. Each entity and each many-to-many relationship has a dedicated table.

There are a number of advantages to this model:

  • Simplicity. Its very obvious what each table is for.
  • All attributes are strongly typed   (i.e. DateOfManufacture, Weight, WellType & WellStatus are all typed accordingly.)

 

There were also some distinct disadvantages though:

  • It is impossible to model the ragged hierarchy of the wells that I outlined in the scenario above. We got around this by inventing artificial values in order to satisfy the relationships (e.g. We create an artificial Area as a parent of WellC and WellD).
  • Each time we need to store data about a new entity we have to introduce a new table into the model.
  • Each time we need to store a new relationship, that too requires the introduction of a new table into the model.
  • Generic services built on top of this model that are queryable by entity type would be difficult to implement.
  • All attribute columns have to be nullable which means there is the potential for sparsity in our storage model.

The inflexibility of this model, as highlighted by the disadvantages above, was its downfall. We took a decision to kick the external data modeller out and have a go at it ourselves. I was given the task of coming up with something that was more flexible. The second incarnation of our MDM model was what I shall refer to as 'The Hybrid Model'.

 

2. The Hybrid Model

The main problem I wanted to solve was that the dedicated table model didn't allow us to store ragged hierarchies. To combat that I employed a self-referential table, but I still had to solve the problem of how to store attributes of all the entities. The following model shows what I came up with:

 

Some new tables have been introduced here

  • EntityType  - A categorisation of entities that exist in the same hierarchy. In our scenario this table contains 2 records: {Production, Equipment}
  • Entity - All the entities. This table has a foreign key to EntityType. In our scenario this table contains records for {California, Field #1, Field #2, Area #1, Area #2, WellA, WellB, WellC, WellD, PumpA, PumpB, PumpC} with the names of all those entities being stored in Entity.EntityName.
  • EntityCrossReference - Used to store relationships between different entity types. The PK of {SourceEntityID, SourceEntityTypeID, TargetEntityTypeID} ensures that we only store one relationship from entity to an entity of another type. In our scenario we need to hold a relationship between a Well and a piece of Equipment.

 

As I said above the main reason for the changes implemented here was to enable us to store ragged hierarchies and that was achieved through use of Entity.ParentID field. Note that the self-referencing foreign key is from (ParentEntityID,EntityTypeID) to (EntityID,EntityTypeID) in order to ensure entities cannot have a hierarchical relationship to entities in another hierarchy.

 

So, the advantages here:

  • All attributes are strongly-typed
  • Ragged hierarchies can be held
  • We can hold all relationships (in EntityCrossReference) and we no longer need to introduce a new table for each relationship.
  • Generic services on top of this model, queryable by entity type, are alot easier to build due to the generic table Entity.

And the disadvantages:

  • We've lost some of the simplicity
  • Each time we need to store data about a new entity we still have to introduce a new table into the model in order to store the attributes
  • All attribute columns have to be nullable which means there is the potential for sparsity in our storage model.
  • I had to make sure that classes of entity that appear in the same entity type (e.g. Area and Well) didn't share the same EntityID because this would cause a PK violation in table Entity. I solved this by defining a range of values that could be used for EntityID for each different class of entity. In other words I was putting some real-world meaning into the EntityID values whereas this field was supposed to be a meaningless surrogate key. This approach made me uncomfortable.

 

Still, even given the disadvantages we were alot happier with this model than the previous one because of the ability to store ragged hierarchies and the more generic (thus more flexible) manner of storing all of the entities in the same table. Still though we had the problem that if we wanted to store attributes of an entity, it required adding a new table to the model. So back to the design studio again.

 

3. The Generic Model

As I was thinking through how to solve this I had one of those moments of clarity where the pieces of the jigsaw just drop into place. I realised that I already had everything I needed in the hybrid model -in fact I had too much- I was just misusing it. This epiphanous moment came when I realised that a piece of Equipment could actually be considered to be an attribute of a Well. In other words a piece of equipment, whilst being an entity in its own right, could also be an attribute of another entity type as well. I was already holding this relationship in EntityCrossReference so why not store all attributes in the same way. This thinking gave rise to the following model:

 

 Yes, there are only three tables, with the following usage:

  • EntityType - A categorisation of entities that exist in the same hierarchy - just the same as in the hybrid model. Previously for our scenario this contained only 2 entity types {Production, Equipment} but in this new model it contains alot more: {Production, Equipment, WellType, WellStatus, DateOfManufacture, Weight}.
  • Entity - The use of this table hasn't changed from the hybrid model except that it now stores all the values that used to be considered attributes of an entity. In this new model all attibutes are considered to be first class entities in their own right.
  • EntityCrossReference - Stores the relationships between entities. In this respect it hasn't changed from the hybrid model except that now it stores alot more information.

As before, let's review the advantages and disadvantages of this model.

Advantages:

  • Ragged hierarchies can be held
  • We can hold all relationships (in EntityCrossReference) and we no longer need to introduce a new table for each relationship.
  • We can hold all attributes (and all future attributes) without having to introduce new tables.
  • Generic services on top of this model, queryable by entity type, are alot easier to build due to the generic Entity and EntityCrossReference tables.
  • No nullable columns other than items that exist at the top of a hierarchy. If no attribute is available for particular entity then simply we just don't store a relationship for it. This is better than having a database packed with NULL values.

Disadvantages:

  • We have lost strong typing of the entities. We decided that the advantages justified this backwards step.
  • Impact on performance is detrimental

 

 

As there are now only three tables in the model I can show you what's in them and hopefully that will illustrate their usage alot better. Observe:

EntityType     

Entity    

EntityCrossReference 

 

That may look fairly nonsensical but accessing the production hierarchy is now pretty easy using a recursive operation:

    1 --Pull out the Production hierarchy

    2 WITH ProductionCTE AS

    3 (

    4     SELECT    EntityName, CAST(EntityName AS VARCHAR(100)) AS EntityPath,

    5             0 AS HierarchyLevel, EntityID, EntityTypeID

    6     FROM    Entity e

    7     WHERE    EntityTypeID = 1

    8     AND        ParentEntityID IS NULL

    9     UNION    ALL

   10     SELECT    e.EntityName, CAST(c.EntityPath + '\' + e.EntityName AS VARCHAR(100)) AS EntityPath,

   11             c.HierarchyLevel + 1 AS HierarchyLevel, e.EntityID, e.EntityTypeID

   12     FROM    Entity e

   13     INNER    JOIN ProductionCTE c

   14     ON        e.ParentEntityID = c.EntityID

   15     AND        e.EntityTypeID = c.EntityTypeID

   16 )

   17 SELECT    EntityName,EntityPath,HierarchyLevel

   18 FROM    ProductionCTE

 

and that returns us the following:

Hopefully you'll be able to see the similarity between this data and the picture of the hierarchy at the top of this page.

 

Similarly, pulling out an entity and all of its attributes is pretty easy as well. Here's the code to do that, it uses our relational database engine's PIVOT function:

    1 --Show the Well entities and all of the attributes

    2 SELECT *

    3 FROM

    4 (

    5     SELECT e1.EntityName, e2.EntityTypeID AS AttributeName, e2.EntityName AS AttributeValue

    6     FROM    Entity e1

    7     INNER   JOIN EntityCrossReference ec

    8     ON     e1.EntityID = ec.SourceEntityID

    9     AND     e1.EntityTypeID = ec.SourceEntityTypeID

   10     AND     e1.EntityTypeID = 1

   11     INNER   JOIN Entity e2

   12     ON     e2.EntityID = ec.TargetEntityID

   13     AND     e2.EntityTypeID = ec.TargetEntityTypeID

   14 )p

   15 PIVOT

   16 (

   17     MAX(AttributeValue)

   18     FOR AttributeName IN ([3],[4])

   19 )AS pvt

 

And sure enough:

 

OK, so I think that's enough for one post. If you've made it down this far - good going. I should caveat this with saying that the final model that we came up with is slightly more complicated than this in order that we could satisfy requirements for data lineage, auditing and further classification of entities; but essentially this is the basic model and its working well for us.

I have attached a zipped script to this blog post that will create the generic model, populate it with data, and uses the same queries that you can see above to extract the useful information. This should serve to demonstrate some of the ideas that I have talked about above. Note that the script will only work when executed against SQL Server 2005.

 

If you've any questions or comments I would love to hear them. Have you been trying to solve similar problems? How have you solved them? Does your model look anything like this one?

 

-Jamie

UPDATE 2007-03-27: Mick has provided more information about our implementation pattern for MDM here: http://blogs.conchango.com/mickhorne/archive/2007/03/26/Test-Post.aspx

 

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

SSIS Junkie said:

For most of the past 18 months I have been working on a series of projects for one of the world's

February 8, 2007 07:07
 

GParkins said:

Hey Jamie,

Firstly, thanks for laying out your problem so well.

Secondly, thanks for all the great help you've posted on your blog, it has helped me a number of times!

Third, and on to the comment... it seems to me that the only hierarchy present in the problem above is with the regions (State, Field, Area).  Couldn't you just create a self referencing table called say 'Region' to build your hierarchy of regions?  You could provide strongly typed region specifc attributes (eg. Area) in other tables that use the PK of the region table to provide element specific attributes.

Wells (can they move between regions?) could hold a FK to the region in which they currently belong, and their equipment would of course refer to the well they are currently assinged to.  So the model would look more similar to the dedicated model but the addition of the region table to handle the hierarchy.  This way you could handle the hierarchy requirement, get strong typing and keep the model simple.

Sorry if I have missed the point or wasted your time, I only had time for a quick look so I may have missed something.  I can elaborate more if you are interested.  Most importantly I wanted to thank you for your site, good luck!

February 8, 2007 15:21
 

jamie.thomson said:

Hi there G,

Thanks for the comment, you raise some good questions. I'm glad you have done as well because clearly I haven't explained this well enough.

It was a design goal of the generic model that all future master data requirements could be handled by our MDM system without (and this is crucial) having to extend the schema. If we introduce a 'Region' table to store this hierarchical relationship then it logically follows that any future hierarchies may need a dedicated table and we want to avoid that. In fact, your suggestion is akin to the 'Dedicated Table Solution' that I have outlined above - and we didn't like that model because it wasn't generic and therefore wasn't flexible.

So, the short answer is "we want to store it in a generic model that suits all future requirements. Generic means that we never have to extend the schema."

I hope that makes sense. Thanks again for your comment.

-Jamie

February 8, 2007 17:58
 

Anthony.Steele said:

I like the hierarchical queries in the third, "Generic" model.

Similar things, called "Dynamic data", "characteristics", "attributes" etc. are to be found in many database schemas. For instance, look at the way asp.net 2.0 stores user-defined property values on the user profile.

However, if done in excess, it becomes the inner patform antipattern (http://thedailywtf.com/Articles/The_Inner-Platform_Effect.aspx). You may not need to "extend the schema", i.e. make a new table for each relationship or attribute, but you do have to add new metadata in order to add one - you just store it in "EntityType" not "sysobjects".

February 13, 2007 13:19
 

David.Tupper said:

Jamie,

Thanks for the post. This is the first article in my search on MDM that provides some explicit data modelling, rather then just concepts.

EOM.

June 4, 2007 14:23
 

SSIS Junkie said:

If you've been reading my blog over the past year or so you'll have noticed that two things I've started

August 29, 2008 20:25

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

News

Powered by Community Server (Personal Edition), by Telligent Systems