As a convenience to our customers, partners and prospects, we provide these terms drawn from the science of predictive analytics, the collections and recovery industry, and Intelligent Results products. We hope you find them interesting and informative.
Account outsourcing; account sale: Accounts deemed not worth working in-house are sold at a discounted rate.
ACD (Automated Call Distribution): A specialized phone system, or the service it provides, for handling incoming calls. The system recognizes and answers calls according to instructions in a database, before sending them to operators or agents. The system also offers management information on the type and volume of calls and on the efficiency of the agents. ACD systems are used extensively in credit and collections call centers.
Action: In PREDIGY, a collection of formulas. See formula.
Additive: A model or data source that contributes unique predictive power is described as additive. If two models are both predictive but cannot be combined to create a new model more powerful than either of the original two, neither model is additive to the other.
Additive Transform Modeling: A predictive modeling technique that builds models iteratively, with each successive model focusing on errors in the previous model.
Aggregation: The process of summing or combining discrete data elements.
AIP (Application Instruction Package): A binary file exported from the PREDIGY Design Environment containing a model, cluster model, or strategy ready to be used for scoring with the Production Engine.
Alias list: A collection of terms to be considered as a single term for purposes of modeling. For example, an alias list could define "cust," "cstr," and "custo" as variants of the base term "customer."
Analytics: The analysis of data for purposes of optimizing business processes.
Attributes: The structured and unstructured data elements contained in each record of a data set. Attributes are also referred to as features. PREDIGY uses the term variables.
Attrition: When customers or subscribers stop using a product, service or supplier. There are multiple types of attrition, spanning the loss of revenue from a single product to the loss of an entire relationship. See also churn rate.
Bagging: Also known as bootstrap aggregation, bagging is an ensemble modeling technique whereby numerous randomly selected subsamples of a data population are analyzed for their ability to predict a value. The model averages over the subsamples to predict a numeric target, or does a plurality vote to predict a segmented target.
Baseline: A collection of data used as a basis of comparison against future collections of data.
Binary model: A binary model predicts the value of a binary target value target. Scores from a binary model often represent the likelihood of an instance being one of the values of the target (for example, the likelihood that the instance will be true).
Boosting: An ensemble modeling technique whereby samples of a data population are analyzed in succession. Instances that the modeling algorithm fails to predict correctly in earlier rounds are weighted more highly or selected more frequently in later rounds; the underlying model focuses more on the highly weighted instances. Based on the work of Robert Schapire, the key insight behind boosting is that a weak classifier (one that can perform slightly better than random) can be boosted over the course of successive rounds to become a stronger classifier.
Bootstrap samples: In ensemble modeling, the process of drawing a sample from a data population, recording its characteristics, returning the sample to the population, and then drawing subsequent samples without reference to any previous sample. This is also known as sampling with replacement.
Box plot: Also known as a box-and-whiskers chart, a box plot displays a graphical summary of the value distribution for a variable, based on quartiles. In a box plot:
Campaign: A marketing term referring to a series of activities related to a single theme or idea used for the purpose of effectively targeting select audiences to build deeper relationships, generate revenue, or create higher levels of awareness.
Candidate variable: Variables that PREDIGY considers in building a model. You select candidate variables from a list of numeric and grouping variables in the data set.
Categorization: The automatic distribution of data into classes or categories.
CHAID: (CHi-squared Automatic Interaction Detector) A tree-based predictive modeling technique. The results generated by this technique are visually very interesting, but it uses no math or equations to get to results and requires very large data sets to get marginally reliable results. Although PREDIGY does use trees for IR Strategy design, it does not use CHAID algorithms to build models or strategies.
Channel: In sales and marketing, the alliance or network of suppliers through which a company's products and services flow to the customer; also referred to as a "sales channel" or a "distribution channel."
Characterization: To characterize a data set is to explore, customize, or annotate its variables.
Charge off: To charge off an account is to not work that account, on the assumption that it is not economically feasible to do so.
Churn rate: As generally applied to analytics, the churn rate is a measure of the number of accounts or records moving into or out of a collection over a specific period of time. As applied to a customer base, churn rate is the proportion of customers who leave a vendor, supplier or service during a given time period. Business decision makers use the churn rate as an indicator of customer dissatisfaction issues, losses due to better competitive sales and marketing offers, or reasons having to do with the customer life cycle. Also see Attrition.
Classifier: A criterion used to sort the data in a data set. Some predictive analytics algorithms rely on harnessing the power of large numbers of classifiers.
Clustering: The partitioning of a data set into subsets (clusters), such that the records in each cluster share common traits. See segmentation.
Cluster model: A structure that analyzes data to determine criteria for clustering, and then segments data by assigning records to clusters. The algorithm PREDIGY primarily uses to create clusters is PAM (Partitioning Around Medoids). See medoid.
Compliance list: A text file containing terms that are not to be considered in building a model.
Contingency matrix: A graphical representation of the correlation between two numeric variables. A typical usage of contingency matrices is to assess the degree of similarity between two models. In the figure below, the Contingency Matrix shows the correlation between one variable (paid_6_months) predicting the likelihood of clients making a payment in the next six months, and another variable (paid_1_month) predicting the likelihood of clients making a payment in the next one month. As you might expect, the correlation turns out to be fairly high. If the rankings correlate well, the highest numbers in the matrix cluster in a diagonal from top left to bottom right.
Continuous model: A continuous model predicts the value of a continuous value target. Scores from a continuous model are predictions of the value of the continuous target.
Correlation matrix: A graphical representation of the correlation between two numeric variables. You can improve the efficiency of some types of models by identify pairs of highly correlated variables and then removing one of the two variables from the model.
Cross sell: A vendor's strategy for selling other products to an existing customer, for the purpose of strengthening the customer relationship. Cross-selling is designed to increase a customer's reliance on a particular vendor and decrease the likelihood that the customer will switch to a competitor. The more different types of products or services a customer uses, the less likely that customer is to attrite. Cross selling should not be confused with up selling, which is selling additional products of higher value or profitability. See Up sell.
CSV File: A non-proprietary format for exchanging tabular data. CSV stands for "comma-separated value." You can create PREDIGY data sets from CSV files, or score operational data in CSV format with the PREDIGY Production Engine.
Data mining: The science of using automation to search and analyze large volumes of data with the objective of establishing relationships and identifying patterns.
Data extraction template: In IR Discover, an XML document that enables you to define and analyze known, arbitrarily named fields of data that appear consistently throughout a body of text.
Data omnivore: Someone with the ability to use all data sources-including numeric, text, temporal/sequential and relational data. Teaching computers to be data omnivorous is difficult, much as teaching computers to navigate in space is harder than teaching them to play chess. Tasks of perception and analysis have to be defined and then broken into steps. Data Omnivores are better business decision makers because they utilize the broadest possible array of data sources. More data = more informed decisions.
Data set: A logically meaningful grouping or collection of similar or related data. In PREDIGY, a data set is a collection of records from which you create variables, samples, models, and strategies.
Decile: A decile is ten percent of the records in a sample or data set, ranked according to the value of one variable in the data set.
Decision tree: A decision analysis concept in which collections of rules are represented as branches in a diagram. In PREDIGY, decision trees are known as strategies.
Delta lift curve: A graphical representation of model results that compares the model score to a baseline numeric variable.
Dense vector: A reduced dimensionality vector space technique that PREDIGY uses to reduce the number of variables for the purpose of enhanced modeling, clustering, and similarity matching.
Derived variable: A variable created by the user by transforming one or more system variables.
Dimensionality: A measure of the complexity of the data set for a model. The more variables there are, the higher the dimensionality of a model.
ECOA (Equal Credit Opportunity Act): A federal law requiring lenders to make credit available without discriminating on the basis of race, color, religion, national origin, age, sex, marital status, or receipt of income from public assistance programs. Predictive models that evaluate data pertaining to matters covered by the ECOA must avoid using any data aggregations that cannot be "explained."
Ensemble modeling: Any predictive modeling technique that aggregates a number of submodels as a means of extracting signal from data.
Exotic data: Data that modelers have traditionally considered difficult or impossible to use. Some types of data traditionally considered exotic include text, temporal data, and latent variables.
Expression: A combination of numbers, operators, grouping symbols (such as brackets and parentheses) and identifiers (existing variables, functions, and samples) arranged in a meaningful way that can be interpreted. In PREDIGY, you build expressions that define new variables and formulas using the PREDIGY Expression Language.
Features: The structured and unstructured data elements that are contained in each record of a data set. PREDIGY uses the term variable instead of feature. Also see attributes.
Formula: An expression that you create and then attach to a strategy (decision tree) in PREDIGY. Formulas are useful for evaluating strategy performance and provide a means of aggregating variable values across records (as opposed to transforms, which aggregate values across variable values within a record). Formulas can contain standard mathematical, logical, statistical, and location functions and operators, and can also reference data set variables.
Grouping variables: Variables having a predefined, enumerated set of possible values, with no natural ordering. For example, a grouping variable named State might comprise the elements WA, CA, OR, and ID.
Harvest: To attach to a file or database for purposes of creating a data set.
Hosted software: Any software solution where the user interface is separate from the back end processing. Intelligent Results Lift applications are hosted analytics solutions.
Interrogation mode: An option, in the PREDIGY Production Engine, for discovering the contents of a PREDIGY AIP (application instruction package) file. Interrogation mode returns information in XML format. You can, in some cases, modify the XML and thereby modify the AIP file for a specific scoring operation.
IRM file: A comma-delimited data file prepared (with delimiters and headers) for harvesting into the PREDIGY Design Environment or scoring with the PREDIGY Production Engine.
JRE: Java Runtime Environment. Software from Sun Microsystems that supports Java programs on a computer. This is a prerequisite for all PREDIGY modules.
Key field variable: Every data set must have one and only one key field variable. This is the variable PREDIGY uses to identify and reference records. Every record in a data set must have a unique value for the key field variable.
Key predictor: A variable determined to contribute to the predictive power of a model. PREDIGY distinguishes structured predictors (variables) from unstructured predictors (text elements).
Latent variables. Variables that can be derived from the structured variables (or other data) in a data set.
Leaf: An end point in a decision tree (strategy).
Lift: An expression of the predictive power of a model. In a lift curve, the further the line for the model "lifts" away from the diagonal random assessment line, the better the model is at finding positive values quickly—that is, the more predictive it is.
LIFT application: A family of analytic solutions, created with the PREDIGY platform and designed to significantly improve collections and marketing performance for financial institutions and receivables management companies.
Linear regression: A statistical method for estimating the expected value of one variable given the value of some other variable or set of variables. Linear regression can also be defined as the process of fitting the best possible straight line through a series of points.
Listener: In the PREDIGY Production Engine, a process that writes out results from a scoring operation. Listeners are defined in the listeners configuration file (listeners.xml).
Logistic regression: A variant of standard regression that can be used when the dependent variable is binary (for example, success vs. failure).
Loyalty Program: A commercial incentive program designed to increase customer loyalty, purchase volume and frequency. Tracking of program usage provides the vendor an opportunity to collect marketing data on frequent users of a product or service. Loyalty programs will often use personalized loyalty cards, club cards, reward cards, point cards and membership programs as pivotal elements in the customer relationship. Loyalty programs focus on status, defined incentive benefits and heightened levels of service—such as customized statements—to build loyalty. While discounting or reward point redemption opportunities are integral parts of many loyalty programs, status, service and merchandise upgrades tend to deliver greater long-term loyalty.
Medoid: In a cluster model, a medoid is an instance (record) that best represents a cluster. When you build a cluster model, PREDIGY first identifies a medoid for each cluster from the training sample. It then assigns each subsequent instance to the medoid it most closely resembles. In creating a cluster model, you specify the desired number of clusters, a training sample, and the Contributing variables the algorithm is to base the clusters on.
Merger: A PREDIGY utility for combining structured and unstructured data together into a single file for harvest. Use the resulting data set to create mixed-data models.
Mixed-data modeling: The use of both structured and unstructured data in a model. Mixed-data models often have high predictive ability—higher than models created exclusively from structured data, higher than models created exclusively from unstructured data, and higher, than models created by stacking an existing structured data model onto an existing unstructured data model. Mixed data models can also include additional types of exotic data, such as time series data.
Model: A statistically based formula for drawing a conclusion about a particular outcome based on the available data. Frequently, that conclusion is a prediction about future behavior.
Model stacking: The process of using the result (score) from one model as a candidate value in a subsequent model.
N-grams: Sets of words that frequently occur together and may have particular relevance when assessed together as a phrase. PREDIGY discovers and works with n-grams of up to three terms. For example, an n-gram "onetime payment" may produce more predictive signal than the words "onetime" and "payment" on their own.
Non-linear: A non-linear model is one where relationships between variables and the outcome are not linear—that is, the process of predicting the value of the outcome can’t be represented by a linear equation. In PREDIGY, the most powerful and innovative modeling strategies are non-linear.
NPA (non performing assets): Assets such as mortgages are deemed delinquent when their payment obligations have not been met. They are classified as non-performing after such obligations have not been met for a specified period of time.
Normalization: A method for mathematically scaling a set of data to a particular value. Normalization is used in modeling to generate more linear sets of data elements, particularly where the base data has a wide range of outliers.
Onboarding: The tactics used to get a new customer using a product or service and engaging in revenue generating activities. One of the most prevalent ways to attract new customers in the financial services market is through offers of free products or services for defined time periods. Once a customer has accepted a free checking account, or a credit card with no interest for the first six months, the bank must quickly onboard the customer by getting them to actually use the account and engage in profit generating transactions, such as using a debit card, attaching a savings account or making additional purchases on the charge card—not just transferring balances without interest for 6 months.
Optimization: In analytics, finding the maximum and minimum value of some objective function, often in the presence of constraints.
Origination: In marketing, origination is the "first use" of a product or service, referring to the origination of transaction revenue in response to a campaign or product offer. A loan origination is the process of a lender obtaining a new loan for a borrower. Depending on credit worthiness, the borrower may qualify for a loan from a lender. Borrower rights in the US are covered under the ECOA. Predictive models that evaluate data pertaining to matters covered by the ECOA must avoid using any data aggregations that cannot be "explained". See also ECOA.
Outlier: A data point so far beyond the normal distribution that its inclusion in a statistical model might impair the accuracy of that model.
Overfit: A model that performs well against training data but not as well against validation samples and operational data is said to be overfit. Overfitting occurs when the model mistakes noise in the training data for signal. Models lacking sufficient data, or models that over-process available data, are prone to overfitting.
Overview map: In IR Discover, the overview map is a tool for graphically representing the concentrations and relationships of different types of terminology in a document collection.
PAM (Partitioning Around Medoids): The primary algorithm used by PREDIGY to create cluster models.
Partial dependence plot: A graphing option for visualizing and comparing values for two variables. In PREDIGY, partial dependency plots show a histogram for one value along the x-axis, with a red line indicating the average value for the second variable relative to the current value for the first variable. A green line shows the mean value for the second variable. The plot below shows one variable, balance, against a second variable, last_payment_amount, and demonstrates a consistent relationship between the values of these variables.
Pass-through variable: A variable explicitly designed for inclusion in a model or strategy exported from PREDIGY, regardless of whether that variable is required by the model or strategy to score records.
Pivot table: In IR Discover, a tool for viewing the interpenetration of two or more dimensions in a document collection.
Plurality Vote: In ensemble modeling techniques, the process of selecting a value by comparing the results of submodels and finding the mode (the most common value).
Pooled model: A model based on data from multiple sources—for example, from different companies in the same line of business. Pooled models have the advantage of being based on a wider foundation of data; they have the disadvantage that variable definitions may not map precisely from one organization to another.
Portfolio Optimization: In financial services, an analytical approach that evaluates expected return relative to risk. Also known as mean-variance (MV) optimization. Optimization at the portfolio level refers to optimal utilization, loyalty and profitability rates for programs designed to stimulate portfolio growth and value.
PREDIGY: An integrated customer analytics and decision management platform from Intelligent Results. PREDIGY comprises five modules:
Queue: Any group of items, such as computer jobs or messages, ordered for processing. In PREDIGY, you can use the Processing Queue to order a set of tasks (model builds, data set loads) for processing.
Raw data: Data that has not been aggregated, transformed, or prepared for modeling. The signal present in raw data can be lost when the data is prepared. Models based exclusively on raw data can be additive to models created with prepared data.
Reason codes: Information identifying the predictors making the most significant contribution to a record's score. Reason codes are produced by scorecard models.
Recovery rate: The amount a creditor would receive in final satisfaction of a claim on a defaulted debt. Sometimes calculated on the absolute value of the debt but more commonly calculated on the fair market value.
Reporting: A readable summarization of data, which can include tabular and/or graphical representations. Reports can be valuable for communicating the significance of statistical trends. The IR Report module in PREDIGY manages reports created with Crystal Reports™, a Business Intelligence application from Business Objects, Inc.
Residual: The difference between an actual value of a quantity and its estimated value. In predictive modeling, the residual is the difference between the actual value and the value predicted by a particular classifier.
Retention: The degree to which a supplier is able to keep or retain customers. See attrition and churn rate.
Roll rate: The chronological progression of an obligation through various delinquency states. Roll rates are a useful method for segmenting delinquent obligations into various categories of risk. Delinquency statuses are sometimes called cycles—for example, Cycle 1: 0-29 days, Cycle 2: 30-59 days, and so forth.
Sample: A subset of records in a data set, selected randomly or according to instructions. In PREDIGY you can create and work with the following types of samples:
Sampling with replacement: In ensemble modeling, the process of drawing a sample from a data population, recording characteristics, returning the sample to the population, and then drawing subsequent samples without reference to any previous sample. Also know as bootstrap sampling.
Score: On the design side of predictive analytics, a score is a value assigned to each record in the training sample during a model build, or a strategy code assigned to a record in a strategy. On the operational side (for example, in the IR Production Engine), the score is the value assigned to a record on the basis of a deployed model or strategy.
Scoreband: Narrowing of a range of potential values to facilitate analysis and categorization. The report loader in the PREDIGY IR Report module can sort outputs into scorebands.
Scorecard model: A model that can identify the predictors making the greatest contribution to the score for any record. This information, sometimes described as "reason codes," is available for records scored with the Production Engine.
Segment name: A name that identifies a leaf node in PREDIGY. Each node must have a unique segment name.
Segmentation: In analytics, the division of a data set into two or more subsets according to a criterion. In marketing, segmentation is the process of grouping customers or prospects into smaller subgroups based on population characteristics or established criteria.
Self cure: Accounts that will eventually meet their obligations without additional intervention from a creditor.
Sentiment: A semantic criterion for categorizing text according to mood or emotional intent.
Service: A Windows application that starts when Windows starts and runs in the background for as long as Windows is running, similar to a UNIX daemon. PREDIGY and IR Discover typically run as services under Windows.
Signal: The latent predictive information in data that predictive modeling attempts to find.
Silhouette plot: A graphical representation of the effectiveness of a cluster model. clusters. The silhouette for a good cluster extends further to the right and has a more vertical profile. This indicates that relatively more of the records in the cluster are closer to other records in the cluster, as opposed to records in other clusters.

Simulation data: Data from a specified data set or sample that shows how records flow through a decision tree in PREDIGY—that is, which records flow through which branch nodes to which leaf nodes. Simulation data is also used by formulas, allowing the strategy developer to evaluate, forecast, or simulate the key metrics.
Split point: The value used to sort records at a branch node in a decision tree (strategy).
Stacking: See model stacking.
Stop word; stop list: A word or list of words to be ignored by the modeling algorithm. Also known as a suppression list.
Strategy: In PREDIGY, a decision or series of decisions made to segment a set of records. A strategy can also be described as a rule-based framework for decision management—that is, a decision tree.
Strategy code: In PREDIGY, identifies a course of action associated with a strategy. As records are scored against a strategy in the IR Production Engine, records assigned to a particular node are associated with the node's strategy code. More than one node can have the same strategy code. Strategy codes are a type of score; the actions associated with a strategy code are sometimes described as treatments.
Structured data: Organized data, typically extracted from a database. Typical field names for structured data are "Balance" and "Date last payment.".
Stumps: "Boosted stumps" is a model-build option that uses boosting with a very shallow tree depth (one branch). See boosting.
Target variable: When you create a predictive model you use all sorts of information that would be available in a real-time setting, plus at least one item of information (captured in a variable) that contains "future knowledge." It is by comparing the real-time data with the future data (that is, the target variable) that PREDIGY discovers the predictive power of the real-time data. Strictly speaking, the missing data need not be "future" data, though that is the typical case. For example, you could analyze training data about a set of people that includes information about the type of car they own. If you identify the car type as the target variable, you can create a predictive model that could analyze operational data that did not include information about car type, for the purposes of predicting that value.
Temporal data: Data that has a temporal component. This can be a series of data recorded at regularly occurring intervals, known as snapshot data, or a series of data occurring unpredictably (made withdrawal, closed account, etc.). The latter is known as "transactional data." Temporal data is a potent source of signal, but one that is difficult to extract. PREDIGY can process temporal data after it has been prepared with the Transformer utility.
Term: A word or abbreviation in text data; the smallest unit of text data considered by a model.
Text mining: The automated extraction of patterns from text data.
Tomcat: The Apache Tomcat Web Server is a prerequisite for web-based components in PREDIGY and IR Discover.
Transactional data: A record of interactions with a customer occurring over time; a type of temporal data. PREDIGY can process transactional data after it has been prepared with the Transformer utility.
Transform: To modify data to make it usable (or more usable) to a modeling algorithm. PREDIGY provides a wide variety of transformation options for converting data types that cannot be used in building models (date, string) into types that can be used (numeric, grouping).
Transformer: A PREDIGY utility for processing temporal data for loading (harvesting) in the PREDIGY Design Environment, and for scoring with the PREDIGY Production Engine. The Transformer works by converting multiple records with the same key variable (assumed to be snapshots or transactions relating to the same account over time) into a single record with new variables that capture the chronological information.
Treatment: A course of action determined by a strategy or decision-tree. See strategy code.
Twigs: "Boosted twigs" is a model building option that uses boosting with a shallow tree depth (for example, two levels deep). See boosting.
Unstructured data: Text data (call center notes, customer verbatim, e-mail correspondence, and so forth).
Up sell: To market and sell more profitable services or products. Up-selling can also be simply exposing customers to options that they may not have been aware of. See Cross sell.
Validation: Records in the data set that are held back and then used to validate the model—that is, to verify that the model is not overfitting to the training sample used to build the model. See overfit.
Variable: The structured and unstructured attributes contained in each record of a data set.
Variance scatterplot: A graph for visualizing the variability in a set of bootstrapped samples at a specified point on a lift curve.
Web server: An internet server that stores HTML documents, which can be retrieved using a web browser. PREDIGY software uses the Apache Tomcat web server.
Web service: A software component that is described in a WSDL (web service description language) and is capable of being accessed via standard network protocols. The Intelligent Results Production Engine Web Service (IRPEWS) supports requests from client applications to score records against the IR Production Engine.
Weight variable: An option for giving specific variables more consideration during modeling.