Tackling that unstructured data mess, practically

Yesterday I wrote a little about the resurgence of phone calls asking about ILM and played a bit of a highlight reel as to a workable strategy from the perspective of a Compliance expert.  The challenge, as I stated, was to tackle this massive problem in unstructured data without trying to solve for world peace in the process.  Today I want to talk more practically about addressing unstructured data growth.  This is timely, since the customer panel at EMC World is talking about this right now.

How We Do It

In Consulting, we help clients tackle the unstructured data estate either through a complete storage and data management strategy or through a tactical, targeted project to size, identify and execute on opportunities to control that data.  The steps are the same either way and assume the client has no tools at their disposal to accelerate the process:

1.) Deploy lightweight discovery tools to capture the disposition of the unstructured data in question such as duplication, aging (last modification, last access) and file types (among other metadata elements)

2.) Talk to the business to understand their requirements for managing those unstructured files such as retention and access requirements, compliance considerations and other legal/regulatory concerns

3.) Review the application and storage infrastructure architecture requirements to support the unstructured estate

How That Works

Step one simply tells us if there is a problem in the first place.  If we scan a petabyte of data and find it frequently used and possessing of high business value, the goals change in steps two and three pretty rapidly from purging data to managing it more cheaply.  We are not often surprised.  Out of this data collection effort we know how big is the problem and where should we focus efforts to correct it (e.g., de-duplication versus archive or purge).

Step two is the trickiest of the three.  I want to understand what the big picture looks like: are they a regulated business where ALL data is subject to controls or is only a small fraction of the unstructured estate covered?  Do they have a set of policies in place today?  Are they reasonable?  Do business units routinely try to subvert IT’s data management approach?  These answers help us frame up recommendations that will actually work in practice and usually give IT intelligence about their consumers they did not possess before.  I can’t tell you how many times I’ve heard “If I knew [some critical business driver] I would have done this so much differently…”

Step three is the easiest piece and also the most fun because we get to play with ideas and speculate about the wrench-turning.  We aim to uncover the features and functionality this business needs to keep people productive in their jobs as they create and use that unstructured data.  If we make architecture recommendations, these data points guide our thinking.  We can do that most effectively when performance management tools are deployed but even without them, this is the one area your IT professionals probably have wired.  A few interviews are all it takes.

Putting it all together

Now that I know the size of the unstructured “problem” and where to target my solutions, I can combine that with useful recommendations for tools, process, policy and so forth that don’t conflict with the mission of the business.  As an example, EMC found years ago that e-mail (and attachments, specifically) was the biggest target with a measurable return on investment.  With a few policy changes and implementation of an archive solution that problem was made far less troublesome.  Did we the users complain?  Of course, but we came around.  We griped mainly because we hate change of any kind.

We might recommend a course of action that includes acquiring a tool or some technology to automate or accommodate the changes we think are needed.  What we won’t do is recommend magic bullet solutions.  Get in, solve the problem, get out.  Like my colleague Sally Dovitz, I just want to keep it simple.  If the business case for widening the net is strong, so be it.  My goal is to keep our guidance manageable and to avoid creating more work and cost than the problem can justify.  I want to tell you to “archive these files right over here and delete those over there.”  If that action demands a tool, I’ll tell you what that tool should be capable of doing.  If not, let’s dive in and clean up the cesspool.

And that’s the way you get started dealing with your unstructured data estate: simple, direct, defined.  Anyone with the a tool (there are lots of free ones) that can report on the disposition of your files can tell you which ones should go away.  The goal should be to get to a positive action for disposing, migrating, managing or otherwise dealing with the data.  The goal should not be to wrap your unstructured data in overcomplicated policy, controls or rules in the hope that some magic tool will solve the problem for you.  Your end users are crafty; they will find a way to unsolve it.

This entry was posted in Big Data, Future of IT, The Nature of IT and tagged , , , , , , , , , , . Bookmark the permalink.

3 Responses to Tackling that unstructured data mess, practically

  1. Lukas says:

    Hi Peter,

    I find your article very useful. At the moment, I am assigned to finding where my company is with its use of unstructured data and probably I do not have to mention that this is an extremely dirty job.
    My company uses a file server with virtualized drives accessible for various business units. They have been using this for years until they implemented software to handle most of the business processes. This software however exports some data onto the file server + there are some left overs from the time when everything was done more or less manually. Other sections of our file server are still heavily used while other might be simply redundant. We still count data in gigabytes, but it is in very bad state. Some people cannot express what they do with data files (word, spreadsheets, jpegs etc.), what they have created in the past and who would be in their collaboration group in terms of the data exchange. They prefer to e-mail each other with attachments taking no advantage of shared storage or intranet SharePoint. Many jobs are ad hoc and job roles are very fluid.
    The company wants me to capture where we are with the data use, propose some recommendations and then work on a Data Management Policy having a future migration to a more robust storage solution in mind.
    My first step was to create an online survey where I grouped people (roughly 50) by their functions, what drives they access, what data files they produce and what apps they use. With this knowledge in mind, I am conducting short interviews with everyone asking them what they use on the server, what they need to access to complete their tasks, what they do with the files and who do they do it for. Majority of people work on their local drives sharing the data that needs to be shared via e-mail. I mean, this is very messy undertaking and I am slowly starting to disbelieve that my whole approach is right at all. I wanted to work on a Business Data Model but it was expressed to me “No, we don’t need that, just model collaboration groups that you identified”. And this is fair enough since honestly, majority of core business processes are handled within the application they use (Manufacturing, Sales, Warehousing, Shipping etc.).

    Peter, what do you think of my approach? I am a graduate and learning on my feet while doing this project and this is invaluable experience for me. What lightweight tools would you recommend to scan our data resources to help me capture the facts about our data. Note that duplication is not much of an issue, maybe on people’s local drives though.
    Your comments are highly appreciated.


    • Peter says:

      I like the approach. It benefits from the fact that your environment and user base are reasonably small. That won’t always be the case but you should use this exercise as a good foundation for when the business grows beyond the ability of one person to act in a meaningful way.

      It’s at this stage I would start encouraging people to tag their documents properly with authorship, subject and content summaries. Metadata makes all the difference. I am sitting on a collection of pictures from the construction of Legoland California. To my knowledge, I am the only one who possesses them. You can imagine how much of a pain it is to identify a random construction site from a 15 year old picture with no notes. Your company’s data management strategy will at some point need to take this kind of consideration into account.

      In our consulting practice, documents are published to SharePoint, where metadata tagging is not optional. This makes it simpler for the crawlers to identify dated or dead information automatically. So do appliances and massive enterprise archives or big data implementations. You’re not approaching that threshold just yet, but it is inevitable. Unstructured data (files) will grow 40%-60% per year. It just does.

      Again, I like your approach. Any time you actually talk to the business about what they create and where their requirements lay you advance the mission of changing the conversation. IT and the business are partners; conversations like these show that partnership at its best.

      • Connie says:

        Hi Irina,Do you still provide Russian classes? I was learning Russian for a few months last year but di871#d2&n;t really like the teacher. I’m enthusuastic about getting back to it again. I would be a beginner, practically. How much do you charge? Thanks,Ronan

Leave a Reply

Your email address will not be published. Required fields are marked *