I was speaking to a friend the other day and they mentioned they were working on some metadata analysis. He had built a MS Access database to import the metadata. He found the going quick tricky as the analysis they were performing is called “data lineage” and they were having difficulty. He also wanted to analyze mappings between fields in the database as well as mappings between lists of values (a list of value is like the set of values you see in a dropdown box on a user interface). All of this seemed like social networking to me.
The way to think about is that (and I could use John Seely Brown’s Social Life of Information book to back me up here) the data lineage problem is just like a social network. You want to track something from its start to the next hop. The “friend” in this case, is the place where the data is transported to another system. Hence a “friend” of a piece of metadata must be another metadata item in another system or database table. Data lineage was nothing more than social networking. To me, data lineage would probably generate much simpler networks but I would guess that there are alot of grey areas about figuring out all the places that data is moved to or converted to along the way–that’s probably what makes it a much harder problem.
Naturally, not knowing whether it was possible or not I mentioned how graph databases could capture most of this data fairly easily and you could run very sophisticated queries. I had not really deeply thought about it but I had been reading up on graphs and probability & statistics, etc. So it seemed reasonable to me.
Of course, just doing a simple import of metadata into MS Access is fairly straightforward . You define some tables that capture a “table” concept and it has a bunch of relationships to “fields.” This can be modeled in RDBMS using foreign keys and such. But as you normalize out the other concepts, such as categories of tables, or try to describe different types of tables, such as views or other RDBM’ish structures, the MS Access approach starts getting a bit daunting.
But my friend wanted to deeply analyze the data and have something that could scale to much harder metadata problems. So I dipped into a neo4j manual and read some blogs. I then I ran across alot of blogs that described classification through taxonomies and ontologies and other types of very abstract ways of describing data. This became complicated very quickly and I realized that I wanted to try and do something small but not necessarily simple. I would need a graph model that was highly compact and could change as requirements changed (my friend said metadata requirements change all the time). And I would sacrifice the ease of a dedicated but highly rigid model for one that was general. I was essentially shifting complexity from the model itself to the processing layer that would sit above the model. But that’s fine if it came me something that exceptional room to grow.
So after reading the manual, the blogs and thinking about it for another hour. I realized that I could do most of what he wanted using a few very simple concepts:
- A DataItem is a description of a data element or a value in a list of values. A DataItem could be part of multiple categories. We will call these categories DataItemSets.
- A DataItemSet is collection of DataItems. The sets could have a taxonomy (categories of categories) so that a set could be part of another set. I could not imagine sets of sets of sets, but it seemed that a friend could be a friend of a friend so a set could be a parent of a set.
- DataItemRelationship will connect a set of “From” DataItems to a set of “To” DataItems. The From and To could be 1 to 1 but we wanted to keep it general. These are the edges of “LIKES” or “KNOWS” in the social network.
- DataItemRelationshipSet will be the taxonomy for the relationships just like a DataItemSet. Unlike many social networks, you may need to classify a relationship with more information than just “LIKES.” Facebook gives you “likes” but a “like” is fairly general, you do not know how strong that like is for any given pair of nodes. So by having a taxonomy for the relationships, we can have categories of categories or whatever you want to more fully describe the relationship.
That’s it. Just 4 main graph node concepts. We will also need to label our nodes with the concept that it represents and to ensure that it has the right set of properties. So a small amount of “infrastructure” is needed to do this labeling and match a label to a set of properties that should be available on that node. For example, a DataItem that represents metadata will have different properties than a DataItem that represents a value in a list of values.
I thought that with these simple concepts we could construct everything that was needed. Since metadata is just data and list of values are just data, it seemed to me that the graph just conceptually holds data and we can treat both the same in the graph albeit with different node properties.
I’ll give it a whirl and report back by trying a very small experiment to see if this design is totally impractical to implement or if it really shines. I’ll also try to hook it up to cytoscape for visualization. However, its clear, just like with MS Access, if you want a solution quickly, just go buy a Global ID-type product.