Julian Ackert, managing director at iDiscovery Solutions, discusses the differences between structured and unstructured data and how database discovery can allow a legal team to conserve resources and get the best outcome for their client.
CCBJ: Databases have been a source of electronically stored information (ESI) for some time, and discovery of structured database content is not new. So, what is database discovery?
Julian Ackert: Traditional discovery looks at and identifies information that resides in individual files – like word-processing documents, spreadsheets and PowerPoint presentations – as well as free-form communications such as emails. But database discovery also considers data points that exist in transactional database systems that can identify the behavior of individuals and help with other kinds of analyses.
What that means, for example, is that if a case revolves around where somebody was when they did something, you can look at the content of unstructured data to understand what was written about the question at hand, but you can also look at the context of structured data, which is essentially a record of their data footprints – where they were at a certain date and time, geolocation coordinates, anything related to that type of information. Content, such as what is written in the body of an email, is something that you would typically have a human being review, but the context can help you go beyond that. Database discovery is looking at both content and context together.
What is the difference between structured data and unstructured data?
Let’s start with the definitions of “structured data” and “unstructured data.” Unstructured data is the content of emails or documents or other individual files, as I was just describing, and structured data is what comes out of a transactional system. Generally, when we’re looking at discovery and fact-finding, we look at the unstructured data, because that’s the easiest for us to digest and understand. Structured data, on the other hand, is millions of rows inside a database, and it’s not an easy exercise to understand what that data represents. One thing to note: all data is structured in some way, shape or form. So the difference between structured and unstructured data is a bit of a misnomer, but unstructured data generally has more free-flowing content that requires the eyes of an individual in order to understand its meaning.
For example, when someone writes an email, they’re using natural language – and the words in that email are unstructured data. The natural language needs to be evaluated by someone to understand what is being said. Whereas if you’re dealing with a product management system that is tracking a product from its source to its destination, each point that is recorded is stored in a structured data system, and the path that the product takes from point A to point B is structured data that’s recorded in a structured database system – also known as a transactional system.
How would you combine both structured data and unstructured data when performing a database discovery?
First, let me say that we’ve come a long way in the discovery process, and this question highlights that fact. When Iinitially got involved with discovery, and really electronic discovery (although, from my experience, you can drop the word “electronic,” because it’s all discovery. Data, information and facts just happen to be recorded electronically in today’s world). When I started in this world of discovery, emails were just becoming a potentially relevant data source. And at that time, emails, which are considered unstructured data, were not very easy to deal with. They were difficult and complex. I remember some case teams saying, “Let’s not even get into emails.” And that was both sides saying, “Emails are not something we should get into.”
That was a long time ago. Emails are a ubiquitous component of discovery today. It’s actually one of the first places you look, which I think is fine. But there may be a better way to investigate facts – if we are able to review the emails and at the same time understand some of the context around them, we get a better, faster view into the facts.
As an example, let’s look at communication frequency to find parties of interest. Emails may have very high levels of communication frequency or a very low level of communication frequency – both of which can provide important context. For example, in a particular case two parties may be very critical to the facts, but I may have zero or very few emails within a particular timeframe for those parties. If so, that is an indication that emails may not be the preferred form of communication between these parties. Or, on the flip side, it may be very interesting context if I have thousands of emails with a particular person in a particular timeframe. We’re taking unstructured data and evaluating it with the context of structure (frequency) to get a better sense of the facts in a matter. we can take this a step further and look even deeper at the structured data. Maybe even flip the scenario and start to look at the context of the structured data first, before we start looking at the content of unstructured emails.
For example, going back to the product life cycle, if I am analyzing a product life cycle through the use of structured data, I see trends and patterns in the product life cycle that tell me, from a fact perspective, that there are some interesting time frames of interest. Based on that analysis of structured data, I can now focus on time frames of key interest for the unstructured data.
Database discovery is looking at both content and context together.
Databases often have millions or even billions of records. So analyzing that amount of unstructured documents seems like it would be a very expensive exercise. Won’t
database discovery, then, be cost-prohibitive?
This ties in nicely with what I was just talking about. The analysis of hundreds of thousands, millions or even billions of transactions does not equate to the review of that same number of documents. Unstructured content in documents is expensive to review because, predominantly, a human being is examining the content to decide whether and how it is relevant to the issues in scope (perhaps assisted with document technology assisted review).
The costs are linearly proportionate to the volume – the more unstructured data you add, the more human review time you need. In contrast, analyses of structured data can be designed once and then applied to any number of records without a linear increase in cost. It’s very widely known that the most expensive part of the discovery process is the reviewing of documents. So spending some time and energy on analyzing structured data first can actually save you significant dollars – relative to the cost of discovery throughout the whole life cycle – because it allows you to hone in on the key points of interest related to the claims and defenses of the case.
I often ask, “What are the questions of fact that we’re trying to answer?” Because if you understand what you’re trying to answer, there are oftentimes very interesting structured data sources that can help you find those answers faster and cheaper than traditional unstructured document review.
Databases are complicated. Is the juice worth the squeeze?
I often get this question, and I’ll tell you that, in my experience, it all comes down to understanding the data sets of key importance. There could be thousands of databases related to a particular workflow or an issue. They may all help answer the question of facts in a case. But not all data is created equal. Oftentimes, unless there’s a good consultative exercise performed up front to focus on the structured data sets of highest value, the juice may not be worth the squeeze after all. A shotgun approach with respect to the structured databases of interest is not going to be the most effective approach. A consultative approach instead allows the case team to be surgical and analyze the databases that are most interesting. When a consultative approach is applied, in almost every case I’ve worked the juice is absolutely worth the squeeze.
How would a legal team get started with this kind of analysis?
To get started, you need two things. First, you need a good consultant on your team that can understand the facts of the case and guide the team towards the potentially relevant structured data of key interest. There is not a “one-size-fits-all” solution for structured data analysis like we see with email. Second, you need technology that will allow you to analyze structured data content in an efficient manner such that your work product is forensically sound, your process is repeatable, and you leverage prior experience. One example of that technology is the xIOT® platform that is offered by iDS.