In this article, you will learn about the different attribution models and their advantages and disadvantages. A special focus is placed on machine learning, which enables media mix modeling (MMM) with a high model quality.
In recent years, media consulting and planning has seen an increasing volume of requests not only to measure properly but also to predict and efficiently plan the impact of media campaigns using data-driven solutions. First is to define what are the constraints: how different media channels can be combined as sensibly as possible within a given budget or what goal can be achieved with the campaign in terms of key brand figures or sales and what budget is necessary to achieve this goal. Consequently, attribution models are needed that show the contribution of the media channels to the results.
MMM can provide answers to these questions. Here, statistical methods are used to determine the contribution of each media channel to a defined target variable:
In an MMM, aggregated data is used to provide a more generic perspective. The different media channels are considered together and external factors are also taken into account (such as price and promotion campaigns, competitive behaviour or the overall economic situation). This has the advantage that a media channel is not optimised on its own and, for example, no user-related data is needed, but only data on the media performance per channel and the respective target variable over time. However, optimising the media budget so that the campaign result is maximised through the interaction of all channels is a complex multidimensional problem in which aspects such as non-linearity and saturation, time-delayed impact, and interaction effects must be taken into account.
Linear models have long been tried and tested in the implementation of MMM. Good results are achieved with these and they are widely used because they are easy to explain, transparent and can be very precise. However, their limitations are also widely known. Conceptually, the non-linear and time-lagged relationships between media and the target variable and interactions between media channels are often poorly represented by linear models, which also require a very large amount of time if many iterations have to be run to calibrate the model properly. Furthermore, linear models mean in essence a reduction in complexity, for example, digital channels have to be combined because their individual impact cannot be isolated, even though the digital channels, in particular, are becoming increasingly important in the media mix.
Marginal utility curves show the non-linear relationship between media performance and target variable as well as saturation level and optimum as the point of most efficient media use.
The central result from an MMM are the marginal utility curves per channel. These typically follow a logarithmic relationship. This means that too little media budget means that potential is not exploited and too little impact is achieved. In contrast, too much media budget means that the campaign is no longer profitable (saturation effect). In the middle is the optimum, where the increase is highest and thus the media output is used efficiently. These marginal utility curves are the basis for deciding how the budget should be distributed to the different channels as efficiently as possible in order to optimise the budget.
Advertising does not necessarily have an immediate effect on the business variables but may have an influence over weeks after the campaign. Brand awareness must first build up among end customers through repeat contacts, or the buying decision may have been made and the action of buying happens at a later stage. This must be taken into account in the model, i.e. the adstock effect must be determined for the media channels so that this effect can also be included in the media planning and the forecasts based on the model.
In an MMM, the joint effect of all media channels used should be considered. In a campaign, the different channels are usually planned together and often used at the same time. The end customers thus have contact with the brand under consideration via various touchpoints. The mutual reinforcement of the channels is taken into account, so that, for example, digital activity supports the effect of the TV campaign. Correctly depicting these interaction effects between several variables can be problematic in linear models if the media channels were used simultaneously. It is often difficult to correctly identify the effect of underrepresented channels with low budgets.
These complex interrelationships between the variables can, if necessary, be better mapped with newer machine learning (ML) models. This is the term used for artificial systems or algorithms that develop complex models on the basis of sample data, and generate knowledge, so to speak, by recognizing patterns/correlations that can then be generalized and applied to new unknown data.¹ ML is a sub-area of artificial intelligence (AI), whereby we should rather speak of a rationally thinking and acting system, which must be distinguished from the conscious intelligent (not always rational) decisions of a human being. Currently, mainly very specialized applications from the field of AI, such as image recognition or speech output, are actually established in practice.²
The advantage of more complex ML algorithms is that interaction effects, non-linear correlations, and also the time-delayed effect of media are automatically taken into account. When working with linear models, it can be very time-consuming to make assumptions about the non-linear function of the variables, define the adstock effect or variables for the time-lagged effect and test various models with different combinations of variables. The relevant variables are identified and redundant variables are excluded in order to avoid over-specification on the one hand but to achieve the best possible goodness of fit on the other. If too many characteristics are included in a linear model, the contribution of individual characteristics may no longer be reliably estimated, but all-important characteristics must be included to sufficiently explain the target variable.
The model is too closely fitted to the data/The model is not sufficiently fitted to the data.
The model has a good accuracy fit.
In contrast, algorithms based on decision trees, for example, can be very helpful in the creation of an MMM, as the inclusion of variables is checked by the algorithm and the probability of over-specification is reduced. These methods can show both the individual and joint effects of the included characteristics. Thus, the non-linear correlation of the media channels with the target variable can be realistically represented as well as interaction effects between the included characteristics, for example, the media effect in combination with a simultaneous price campaign or the different effects of media activity over time for seasonally driven products.
In addition, decision trees have proven their worth because of their good model quality. In the modeling process, the data are divided into training and test data sets. The training data is used to build the model. The test data are used to check the forecast and show how robust the model is, i.e. how accurately the forecast works on unknown data. In particular, algorithms that create many independent decision trees are characterized by their high accuracy (e.g. random forest, gradient boosted trees). Boosting methods in particular generally lead to very good results. In this case, learning is done from the trees with poorer forecasts during model creation, so that with each run the previously incorrectly forecast values lead to an improvement of the model.
The tree diagram shows a simplified example of the structure of a decision tree with sales as the target variable.
While ML algorithms were considered a black box in the past due to their complexity, their explainability is now also given. One criticism used to be that the models are precise, but it is not apparent how exactly the characteristics included contribute to the forecast. Approaches such as Shapley Values or LIME (Local Interpretable Model-agnostic Explanations) now make it possible to interpret the models well and to show which input is decisive for the forecast result. With their help, the contribution of the included characteristics of the different media channels can be determined and visualized, and, for example, the marginal utility curves of the media channels can be derived. Furthermore, other algorithms can be used to solve the optimization problem of budget distribution on the basis of a finished model (such as Autograd). In this way, not only can scenarios be created for media planning, but the optimal or most efficient budget allocation to the media channels can be found automatically.
The MMT Suite enables simpler, more precise, and faster decisions in media planning. More complex ML algorithms provide the basis for automating the modeling process and thus scalability for many brands, products, or even regions, as statistically robust models can be created time-efficiently.
Which method is ultimately used always depends on the respective application and the quality of the available data. The principle of "as simple as possible, as complicated as necessary" applies to the choice and calibration of models. Therefore, linear models have their justification in practice. They can generate initial findings, if necessary, and complex algorithms can be used in further stages of development. For the creation process, it is important to understand the performance of the different methods. In order to select the appropriate algorithm, it is always necessary to assess how well the procedure works for the use case.
Other examples of the use of machine learning in MMT Scope apart from MMM can be found in Test&Scale, MTA, and TV Performance.
¹ Fraunhofer -Gesellschaft (2018): Maschinelles Lernen - Kompetenzen, Anwendungen und Forschungsbedarf, S. 9.