Predictive analysis is a branch of data analytics that uses current and historical user data to make predictions regarding events. It actually consists of building a model that best fits the specific data characteristics and can predict future events, explain past ones, and group or correlate data records within the dataset
Although predictive analysis and data modelling have been known for a long time, the rise of big data technology and cloud computing, along with low cost data processing, made it viable for almost all organizations. According to recent studies, global predictive analysis market is expected to reach 11 billion USD by 2022, with an average growing of around 21%.
Currently, most companies use predictive analysis in several cross-industry functions and processes such as credit check, fault materials predictions, sales and marketing, delivery optimizations etc.
Marketing departments use predictive analysis methods through automated marketing and sales processes for:
Predictive analysis process, such as delivering personalized marketing services based on intent data, may consists of several sub-processes:
Input data, i.e. user intent data in the specific example, may come from multiple sources for predictive analysis. They may be:
Since raw data comes from different data sources, they should be consolidated and transformed after extraction to be usable.
Data analysis and cleansing consist of discovering possible anomalies such as missing information or noise, consolidating and validating the data, and then storing it in a data warehouse. Depending on the model and required segmentation, patterns may be applied to discover similarities
Depending on the outcome requested, previously performed statistical analysis, and the discovered patterns for a specific dataset, you can apply a prediction model to achieve the best prediction for the probability of an outcome. However it is important to note that the accuracy and usability of the results will depend greatly on the level of data analysis and the quality of your assumptions.
The predictive model lifecycle consists of the following sub-processes :
Along with deploying and running the model, you should run a process monitor in parallel to gain improvements. This lets you continuously evaluate the model for possible adjustments:
Predictive models are machine learning methods that may either produce a prediction or imply a prediction based on a score, or find patterns and potential data groupings and segmentations. These models come in two main categories:
Statistical regression models use regression analysis methods to find relationships between the values of one or more independent variables (the input) with one dependent variable (the output). Based on the relationships discovered, someone can predict the output for any other new input.
Common regression model types include
Classification models are identical to regression ones, but the dependent variables are discrete values instead of continuous .
Clustering models find similarities between data based on common attributes and then group them into clusters.
Implementing a predictive analysis model is often a difficult project, requiring detailed design and specifications, continuous commitment, and strong governance. It involves several functions within an organization including information technology, marketing and sales, and others. Except for the common challenges that come with any project, predictive model deployment challenges are related mainly to data and the model.
User data for predicting intent can come from various data sources:
While uploading the above data to a data warehouse database, errors such as inconsistent data, duplicates, logic conflicts, and missing data may appear. Data consolidation should remove the noise.
On the other hand, if there are strict rules , and the data gets over-cleansed, then the data does not correspond to real data, making training data inaccurate and the prediction model unreliable.
Often, user master data, such as customers, appear several times in backend systems. They may be distinct entities that point to the same customer. Old data should be marked as old and kept out of scope since training a model with old data may lead to wrong conclusions regarding predictions.
Among other factors, model reliability depends strongly on the training dataset size. Following this rule, a large training dataset is more reliable than a smaller one. This is a fundamental machine learning challenge called cold start, where the model is not effective in the beginning due to the lack of data. You can find this in cases of a new product, a new customer group, etc.
From a statistical point of view, beyond a point, feeding the model with data does not improve the accuracy of predictions. This may waste computational and time resources.
Social media contains valuable information regarding behavior and attitude that are often more valuable than attributes hidden in transactional data. But intent data coming from social media are unstructured and difficult to interpret.
Often, predictive models rely on intent data coming from surveys or submitted forms. But surveys completion rate is low and usually customers that submit such information are the loyal ones. This may lead to non-representative training data and a non-reliable prediction model.
Predicting the outcome of anonymous customer remains a challenge, since you do not have a lot of information about this type of customer. Usually, it is restricted to geographic location data and site browsing history. A lot of effort has gone in building unsupervised techniques for predicting what a customer group will like but it still remains a big challenge.
Model type selection, along with input features -independent variables- selection, is critical in building a predictive model.
To cover a wider range, models may incorporate many input variables creating a complex and expensive model. These models are difficult to monitor and adjust, making them an inefficient way to predict outcomes. Instead of quantity, you should focus on the accurate selection of the input variables .This requires a deep understanding of clinical domain knowledge, the business objectives at hand, and the data on which the model will be run.
There is a trend, and a principle, that the best solution is the simplest one. Simple linear models are easy to understand and build, require less computational resources, and they can be easily adjusted. But reality often is not linear.
Primarily due to a lack of knowledge, the model input may include wrong data features (input variables) or features that contain much noise, lowering the prediction efficiency.
While, evaluation methods and metric are clear in supervised machine learning, this is not the case in unsupervised models where metrics may not reflect the model outcome. For example, a good F1 score may have value only if the evaluation data cover all desired features in the model.
Predictive analytics opened up a new realm of possibilities in many areas within an organization. But since they are reliant on set models and data solutions, they can present problems with deployment and thus not being efficient if they are not optimized.