One of the things that interests me almost as much as the technical deep dives on how things work are some of the introductory bits where the presenters try to put their product in context as it gives some very useful insight into the product designer.
One of the first things that jumped out at me was the way that David Flynn expressed the evolution of storage by defining the locality of the filesystem. I have a number of seminars where I address the topic in a similar but different way since I tend to be preoccupied about the lower levels of the storage stack, where the filesystem is just the last layer that goes on top.
But from the Hammerspace point of view, the most important thing about the filesystem in this context is that it's the metadata expression of the files that exist in a particular context.
It's all about the metadata
Fundamentally, the underlying architecture of Hammerspace is based on pNFS which is designed to separate the shared file metadata from the file access path. The result is that a namespace or a declared shared mount point may in fact be physically distributed on different back end NFS servers. And now that the mount point is pointing at the metadata service rather than a specific physical NAS, we can migrate (or duplicate!) the actual files across different physical systems transparently to the client while they are being used. There's a great article by Justin Parisi on NetApp's use of pNFS to give you some ideas of the potential that this unlocks (using an open protocol!)
For the Windows folks out there, imagine DFS-N, but the granularity of definition is at the file level rather than at the share level. You connect to a share and the files that are served to you are chosen based on their relative proximity on the network.
Where this starts getting really interesting is when you have a pNFS compliant metadata service that is not limited to the basic file semantics of a traditional filesystem (ACLs, size, location, URI, state, etc.) but an arbitrarily extensible metadata set that can include structured key/value pairs or simple tags. And it's also really nice that that NFSv4 also includes performance and access telemetry in the metadata. So your metadata service knows who was accessing a file, from where, producing how many IOPS and consuming how much bandwidth. But it’s the arbitrarily extensible part of this mix that makes it really useful, where we imagine a scenario where the system contains back end placement rules that require that every file from a given share is duplicated – in the technical sense of having a copy of the data on another physical system, but from a metadata perspective is still the same file, just with a new additional access path – to a host that automatically runs some kind of scanning or ML algorithm on all new or modified files that arrive. The scanning tool can then take the results of it’s scan and add this additional information to the metadata associated with the file via API to the Hammerspace metadata service. The example given in the presentation was (the becoming classic) image recognition routine where it will add a tag stating that this photo contains a dog (or a cat, or an emu, or whatever). Once this tag has been filled in, the back end data mover notices that the tag is filled and then removes the duplicate instance from the server. More useful from a business perspective would be things like scanning for PII and tagging files appropriately so that proper data governance rules will be applied. No more nightly scans of file servers.
Of course, while pNFS is natively available on all modern Linux distributions, this doesn’t solve the problem of presenting file and shares to Windows machines from two perspectives. First off, the user community is used to mounting SMB shares and understands the workflow around this. Secondly and more importantly, while Windows has had an NFS 4 compatible client for a while now, to the best of my knowledge, it doesn’t yet support pNFS (granted, I haven’t checked on Windows Server 2019 yet).
So Hammerspace provides additional data services that can present the contents of the pNFS shares over SMB so regular desktop users can access files stored in the Hammerspace world.
If some of this stuff sounds familiar, particularly the data mover stuff, that’s coming from their history with Primary Data, so Hammerspace is not coming to the table with a completely brand-new software stack, but one that some of the core features around the pNFS and data management have already been thoroughly tested. In fact much of the Hammerspace back-end is the evolution of the IP acquired from Primary Data, along with much of the core team, towards addressing the market of data management, as opposed to the market of infrastructure management.
I think that they have a fascinating opportunity here in moving the management of unstructured data from the scripting world into a policy based world that can take into account all of the technical issues around data locality for performance reasons (is there a copy of this file sufficiently close to the consumer?), application uses (move all of the ingested data into a new data lake for analysis) and governance (keep this type of data on systems within the EU).
Oh, and what if Hammerspace threw in object stores as back ends to ensure that policy based data movement and access will allow you to collect data using regular file sharing toolsets, and make sure that it's also available to your cloud based ML systems that are optimized for object storage? Or cloud based applications that export to object stores that need to be made available to on-prem NAS based applications? All managed by policy, not by scripts.