Rob Jasper, Chief Technology Officer at Intelligent Results
Over the last twenty years, the growing size and sophistication of databases coupled with continually decreasing storage prices have helped businesses amass very large quantities of data. No one disputes that there is great value in this new resource. The promise of data is that if we have enough information about the past, and the right tools for analyzing and understanding that data, we can make useful predictions about future customer behavior, or anticipate customer needs with timely offers. But the tools don't create themselves, and the value in data has often proved to be difficult to find, conserve, or exploit.
Where do we look for opportunities to advance data mining and predictive analytics? On the data mining side, we need to find ways to bring more data, and more different kinds of data—into the process. That means tapping into some kinds of data that have traditionally been difficult to obtain or work with—exotic data. On the analytics side, while modeling algorithms have come a long way in the past 10 or 15 years, and while there's still a lot of technical innovation going on, we need to concentrate on effectively delivering all this innovation to the business community. But for this article, the focus will be on the data mining side—finding and utilizing a wider range of data.
Exotic data is anything outside of basic straightforward, traditional structured data. First and maybe foremost that means text data, though there are other flavors of unstructured data, including semi-structured data and descriptive attributes that we'll be touching on.
Where do organizations encounter text data? The classic example is call center notes, jotted down by reps as they interact with customers. But there's also email correspondence, voice-to-text transcriptions, and so forth. Pretty similar types of stuff, you might think, but each has characteristics that you have to be ready to cope with. For example, customer e-mails tend to be fairly coherent and free of non-standard terms because they're written by people who are motivated to get their points across. Call center notes are at the opposite end of the spectrum. They're typically typed by agents who are unlikely to ever read those notes again—the truth is that hardly anyone ever reads call center notes. They're full of abbreviations and shorthand—it's not even English, really, and probably wouldn't parse with an English language parser: "Cust sez will mail check @ 10/15 if SS chk arvs, else init prtcl B11." A third source of text data, voice-to-text transcription, comes with its own set of problems—for example, there are all sorts of corruptions that get introduced during the transcription process, some that can be corrected automatically, some that can't.
Predictive modeling works by analyzing the discrete elements of information available. These elements can go by different names—attributes, features, variables, fields. With structured data, the number of attributes typically is not great. When you're working with a database table, the number of attributes is the number of table columns. If there are twenty columns, that's just twenty attributes, even if there are 20,000 rows.
But when you model with text, each individual word might be an attribute, depending on your approach. So if you have a word "melon" that appears in your document collection, that's one attribute, and your model can track the occurrence of that word—in which documents does it occur, and how often? You might also have reason to keep track of documents in which the term doesn't occur. You could even keep track of which other terms tend to occur in proximity to our term—cantaloupe or honeydew, perhaps.
A large collection of documents might have tens or even hundreds of thousands of distinct words, along with numerous significant clusters of words that you bring to the model's attention. That's a lot of attributes, and the data, in statistical terminology, has high dimensionality. Traditional modeling algorithm types, like linear and logistic regression, can't cope with that many attributes. But the more recent ensemble modeling techniques are able to cope with high dimensionality, and are therefore well suited to dealing with text.
Semi-structured data can come from things like HTML documents, where you have tags interspersed within the unstructured text. You can also have system-generated text interspersed within an agent's text notes. For example, the agent might be able to activate an on-screen menu: "Customer paid $50," or "Customer promised to pay $15." Since it's not text that's keyed in by the agent, it's more regular, and that regularity constitutes an opportunity—you can pull out amounts or enumerate all the possible values for a drop-down. It's more like structured data, except that you're grabbing it from within unstructured text.
A final type of unstructured data to consider is descriptive attributes. The classic example of a descriptive attribute is sentiment—does a document have a positive, happy sentiment or a negative, unhappy sentiment? You can see why organizations would want to be able to sort documents by sentiment. You start with a set of words like the one below, which defines negative sentiment. The more of those terms that occur in a document—and the more frequently they appear—the more likely that document is to be expressing displeasure.
A lexicon of negative sentiment. Not all words are semantically negative (for example, "satisfactory"), but the presence of such words in text is nonetheless predictive of negative sentiment.
You can take the process one step further by noting which terms tend to occur in proximity to terms known to have negative sentiment. So even if a word like "obnoxious," isn't in your list, an algorithm can figure out pretty easily, after scanning a large number of text selections, that it has negative sentiment. The way it does this is by looking at the co-occurrence of "obnoxious" with other words that are known to have negative sentiment, like "bad" or "atrocious," because it turns out that words that have negative sentiment tend to co-occur with other such words, so if you have a large enough selection of text, you can look at the semantic spin and determine if a word is oriented toward negative or positive sentiment.
In the specific context of predictive modeling, though, sentiment doesn't seem to add much lift to a model. It's a fascinating concept, and it's a great starting point for all kinds of interesting research, but right now it's not particularly compelling for our predictive modeling technology.
Another technique that has proved disappointing is correcting text data. We've seen no evidence—and this was kind of a shock—that fixing typos or spelling errors actually increases model lift. For example, we built some models that contained the term "attorney," and we found that term to be highly predictive. But there were five different variants of the base term—misspellings, plurals, and so forth. So we fixed them up and mapped them all to a single term, but that clean-up didn't increase the predictive power of the model at all. So should we even bother to do such clean up? Except that from the customer perspective, it can be alarming to see that the top five predictive terms in a model includes two or three variations on a single word—intuitively, it seems like the model is failing to recognize something important.
At Intelligent Results, we've built thousands of models over the past five years, and in those models we've tried all kinds of approaches with text. We've combined multiple terms into one and we've taken a single term and divided it into multiple terms and we've fixed mistakes and applied part-of-speech parsers. More often than not, when you're building a predictive model, you're better off not doing any such preparation. This goes against intuition, because our brains appreciate organization, but we test this stuff, and we pay attention to what works and what doesn't.
Exotic isn't just unstructured and semi-structured data. We've been able to find new sources of valuable information even in traditional structured data. One such promising area is raw data—data that hasn't been prepared, transformed, or aggregated in any way. We've implemented algorithms in our software that are very good at leveraging raw data directly, whereas if you're building a traditional linear or logistic regression model, you have to do all this work up front to put the data into a really nice form to deal with matters like outliers and distributions. We can deal with the raw stuff and sometimes leverage that in a way that the clever modeling people might not have been able to anticipate. Let's say that a bank is building a model on transformed, groomed data—normalized account balances and things like that. That's fine, but for our software, we'd also like to have the data the bank used as the basis for their transformations. We've found that we can often find more predictive signal—more lift—in that raw data than we can in the data that the modeling guys have groomed.
Ideally, we'd like to have our cake and eat it too—we want the value that comes from preparing data, because modelers know how to eliminate some noise and isolate what's valuable. But we also want data that's unmodified, because the newer modeling algorithms find signal that would get lost during any clean-up. The good news is that we can have both—we can use both groomed data and raw data, sometimes in the same model. Ultimately, the objective is to get the most bandwidth, the most different perspectives on the data.
Moving into even more exotic territory on the structured side, we come to temporal data and latent variables.
Not to keep fracturing the discussion, but there are actually two kinds of temporal data that you have to take into account: Snapshot data is information about data at fixed intervals—like account balances at monthly intervals. It's relatively easy to compare accounts across standard intervals like these. Transactional data is trickier—it's a record of defined events at irregular intervals. Customer A writes a check on April 7, then goes to the ATM on April 9. Some transactions occur repeatedly, others maybe once, or never.
There's a lot of information in temporal data, but it takes work to get at it. One approach is aggregation, which involves flattening the data—taking, say, a series of balances over time and averaging them. Not very exciting. A more intriguing approach is pattern similarity analysis. Let's say a customer tends to build up credit card balance in the winter and then pay it off over the summer. That's a simple pattern. Real data can follow more complex patterns that aren't so easy to spot. What if you could compare such patterns across thousands of accounts, find points of similarity and ways of assigning different accounts to classes, based on pattern similarity?
You wouldn't have to know why these accounts are similar, or even what makes them similar. But if you knew that certain classes of accounts had a higher likelihood of defaulting, or opening additional accounts, or doing something, then you would be able to process new sets of accounts and identify the ones with the same tendency.
Two accounts might be similar except they're on a different scale—Account A never dips below a minimum balance of $25,000, but Account B goes as low as $2000. Is it appropriate to normalize away that difference? Maybe, maybe not.
When you're dealing with transactional data and you don't have the luxury of fixed intervals it gets even harder—if Account A is showing a pattern over a 24-month scale, and Account B is showing a similar pattern, but over a 6-month scale, how meaningful is that similarity?
Take a look at the images below. In the first image, the analysis is Euclidian—it compares the patterns and doesn't try to adjust the time scales. As a result, it's failing to recognize just how similar the two patterns actually are. But the second image shows how with a little chronological flexibility, you can find similarities that you'd otherwise miss. The term for this kind of pattern matching technique is "Dynamic Time Warping." It's a useful concept—besides having a nice futuristic sound to it.
Euclidian comparison of two patterns over time fails to recognize similarities obscured by distortions in the time scale.
With dynamic time warping, differences in time scale are dampened, revealing the underlying similarity of the patterns.
The final new data source that we can look at is latent variables—these are attributes that you can derive from the original data. Say you have a variable Balance, and you have snapshots of this variable at monthly intervals. Now you can interrogate the movement of that variable over time and come up with new information. Does Balance increase from month 1 to month 4? The answer to that question could itself become a new yes/no variable. Or, is it true that the value of Balance between month 1 to month 4 is always within some range? Or that the value of Balance is between 1,000 and 30,000 at least once between month 1 and month 4? Which is not even to consider the kinds of inquiries you could make about multiple variables—does Balance increase as Age of Account goes up? There is an infinite realm of possible new variables of this type, which we call the "latent variable space." You can create a crop of these new variables and then use my model to assess their predictive power. Over a series of runs you can refine my conjectures and go back to that latent variable space to find new pathways through the available data.
Latent variables add predictive power by extracting information from the data that the modeling algorithm wouldn't otherwise be able to find. The existing modeling algorithms first of all can't yet deal with temporal data, but let's say that they could. There's still no way the existing model can recognize that there's potential for signal in whether Balance between month 1 and month 4 always remains within a certain range. Somebody have to actually create this abstraction— the concept of "always remains"—before the model can search for significance.
The science of navigating through the latent variable space is what's known as an optimization problem. You can move through the n-dimensional latent variable space in a regular way—go further in one direction or another. It's topological. So what we do—or rather, what the algorithms do—is go a little in one direction, check the result, and then, after analyzing the results, adjust course and go a little further. It's a very powerful technique, because it's able to self-correct. Sometimes you'll be able to refine a model using a particular set of latent variables and then you'll eventually come to a point where you're not finding any improvement with a small incremental move in any direction—or at least not enough to justify any more time or energy. So then you've found what we call a "local maximum," and you have to either take what you've got and stop trying to find further improvement, or else jump to an entirely new area.
All these techniques for identifying new types of data go hand-in-hand with finding new modeling algorithms that can consume more data, and more different types of data. It's an exciting area to be working in, and we at Intelligent Results are excited at the prospect of the different kinds of applications we see on the horizon.
As chief technology officer for Intelligent Results, Rob Jasper is responsible for developing the company’s enterprise-level software, including the PREDIGY™ decision management and analytics platform. Rob, who has been with Intelligent Results since its inception in May 2001, has more than 15 years of experience in software research and development. Throughout his career, Rob has focused on applying artificial intelligence and machine learning to real-world problems.
Prior to joining Intelligent Results, Rob was chief scientist at FizzyLab, where he managed the advanced technologies group. Rob’s team led the development of the proprietary text mining and contextual search technology that formed the core platform behind FizzyLab’s product-line, which was used by thousands of people as part of an application-service-provider-based content enrichment service.
Earlier, Rob spent over 10 years with the Boeing Company, where he led research efforts in the areas of ontologies, semantic web, automated reasoning and intelligent computer aided design technologies. While at Boeing, Rob co-authored a company-wide testing guide for use on all business, scientific and engineering applications.
Since 1996, Rob has served on the adjunct faculty of Seattle University’s Master of Software Engineering program where he has taught courses on analysis, design, programming methods and optimization.
Rob holds a bachelor’s degree in economics from Pacific Lutheran University and a master’s degree in software engineering from Seattle University.