How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale | High Scalability: "From one world view comments logically belong to a relation binding comments and users together. But if your unit of scalability is the user shard there is no separate relation space. So you go against all your training and decide to duplicate the comments. Nerd heroism at its best. Let inductive rules derived from observation guide you rather than deductions from arbitrarily chosen first principles. Very Enlightenment era thinking. Voltaire would be proud.
In a relational world duplication is removed in order to prevent update anomalies. Error prevention is the driving force in relational modeling. Normalization is a kind of ethical system for data. What happens, for example, if a comment changes? Both copies of the comment must be updated. That leads to errors because who can remember where all the data is stored? A severe ethical violation may happen. Go directly to relational jail :-)"
(Via High Scalability.)
Now this is an interesting one since it resembles greatly several discussions I had a long time ago trying to explain directory services to database folks.
Most LDAP based directory services are based on exactly the same kind of scaled up model that big traffic web sites deal with. The massive bulk of the transaction load is read-only with a very small portion of IO being updates. And in this case you don't even have the ability to do any kind of joins. From this structure it makes a lot more sense to group as much related data into each individual object, reducing the query load to a single object or a small subset of objects and then traversing them as required. The high cost of building a connection and issuing multiple queries is counterproductive when you try and denormalize a directory - you end up making a ton of queries to assemble all of necessary data regarding an object.
So the question becomes, where do you put the update intelligence? In the database schema or in the application. Years of hard-won experience has always pointed us to the database, but we're now moving into a era where some of the applications like Flickr are not longer scalable in the traditional manner.
Also, X.500 and LDAP are the kinds of data structures that map directly to an object oriented environment. There are no conversion phases to go through since objects are presented in the same manner with a schema, containing obligatory and optional attributes.
Using LDAP as a generic datastore is an exercise is frustration once you go past a certain size, simply because the commercial tools available for the servers are not playing in this space in the same way that databases. It's not necessarily a purely technical limitation, but more to do with the fact that there hasn't been any serious market demand for massively scalable directories of this type. The most massively scalable directory service out there is DNS, and it gets around these limitations by being a distributed solution, something that is awfully hard to implement well in LDAP and X.500 without ending up having to do a ton of traversal across directory instances.
It's worth revisiting some of of the implicit wisdom of database architectures when building big systems since the relational method is not necessarily scalable past a certain point or in certain architectures.