Combatting fraudulent transactions with Machine Learning

Financial crime has always been a major concern for financial institutions[1]. Combatting fraud requires an expensive apparatus of algorithms, technology and equipment that must be constantly updated. Also, when fraudulent activity such as money laundering fails to be detected, it can drive sanctions by the regulators.

The current systems for the detection and prevention of criminal transactions are based on historical data analysis. Although analyzing the past is believed to provide a solid base for predicting the future, it can be assumed that this methodology is somewhat incomplete because of the lack of statistics which appropriately address unknown future technical transformations and settings[2]. Thus, it is argued that in order to detect future money laundering threats, the use of “synthetic simulation” models is likely to be the best solution. This method, based on generating artificial data sets which seek to reproduce the statistical characteristics of real world data, is most likely to provide for a more complete spectrum of information, bridging the gap between historical data and unknown future data[3].

Furthermore, machine learning has the potential to radically enhance the performance of synthetic data simulation. There are two major issues which affect systems based on historical data: the incompleteness of the information used for simulating future transactions and the restraints imposed on its collection by privacy regulation. Once these are overcome by synthetic data simulation, greater flows of data of a higher predictive value for financial crime will be available.

The status quo

Today’s detection and prevention systems rely on historical data which banks collect over many years. In this way, they create numerous reliable indices which are then used to forecast future scenarios and to monitor their own operations, so as to spot suspicious transactions. Nevertheless, a high number of transactions escape detection by remaining “hidden”; the data collected can only help distinguish the so-called false positives from the true positives, the latter being the truly criminal money movements[4]. In other words, there remain false negatives — transactions which appear to be “clean”, but which in reality are only so due to the lack of an appropriate index that could flag them as potentially criminal.

Moreover, it is insufficient to have in place a system which, in assessing the transactions, simply qualifies them as true positives or false positives. It is a limited view which does not allow enough space for improvement in money laundering detection methods[5]. The lack of “creativity”, “future wisdom” and “diversity” in historical data analysis precludes its successful deployment in this context, as these features are necessary to anticipate most, if not all, potential future criminal operations.

Photo by Pietro Jeng on Unsplash

Machine learning technology and the improvements it offers to synthetic simulation models

In practice, the present “simulation” approach can be described as an automated videogame. The outcome is to project future simulated scenarios by integrating the “limited” historical data with synthetic data, collected on a larger scale through machine learning.

Basically, different types of virtual agents with relevant roles in financial transactions are programmed to interact with each other. This is called agent-based modelling. In this way, unknown types of fraud are conceived and included in the data. The data is “synthetic”, as it is not derived from transactions previously made by real individuals, but from the scenarios which the machine learning software is able to generate[6]. This method is enhanced by self-learning algorithms which automatically provide new, consistent data sets from the data previously made available.

The information and detection indices produced in this way are, in theory, more comprehensive and broader in scope, when compared to historical techniques. Synthetic data models integrate both the past and all the future financial transactions possibly conceivable by the virtual agents’ simulations, whereas historical data analysis encompasses all the transactions which have positively happened at a given time, according to the information available. This process could be further developed by using machine learning[7] to refine the synthetic data created by the agents, so creating more precise scenarios and detection indices. Monitoring systems could definitely benefit more from agent-based modelling[8].

Machine learning is at the core of Artificial Intelligence (AI), a technology based on algorithms (learning algorithms) that perpetually improve themselves by starting off from an input of data (training data). This process makes the software gradually better at finding specific links between information as more data is provided.

Therefore, it appears that the quality of the data is key. In general, the more specific, pertinent and comprehensive is the initial data set, the more one is likely to obtain relevant results from the machine learning process. For instance, it is essential to find effective ways for assessing data quality, so as to avoid the multiplication and the recurrence of biases stemming from the original inputs, in the new patterns which are obtained by algorithmic processing. Biased data is not objective and its suitability for detecting financial crime is impaired[9]. Since there are also strong limitations regarding both the quantum of data gathered by banks, and the uses to which it is put, for reasons of property rights protection[10] and privacy[11], it is no surprise that simulated synthetic data could be even more appealing than has so far been stated.

Privacy issues

Companies which handle personal data must put in place a series of measures in order to guarantee the rights of the people to whom the data belongs[12]. This constrains the pooling of transactional information in the real world. The lack of such constraints is a huge comparative advantage for the new systems which depend on artificial modelling [13].

It can be argued that historical data derivation methods are influenced by privacy regulation more than approaches that rely on synthetic figures, since, to function in a better way, the former need to be fed with huge amounts of real peoples’ data[14]. Where customer sensitive data is concerned, this problem is exacerbated as formal consent must be sought before the data is processed[15]. Moreover, another non-negligible impediment relating to privacy is the limited availability of opensource data streams, which makes it even more expensive, if not sometimes impossible, to get hold of the desired quantity and quality of data[16]. On the contrary, although the issue deserves further study, it may be observed that the presented simulation methodology does not wholly rely on personal information, as the programmed agents are not real people; their interactions are planned on an abstract and theoretical level which produces new self-made data flows. The realistic synthetic datasets created through virtual interactions do not contain any additional customer information and require neither legal nor private transaction-related disclosures. Agent-based simulation, compared with historical data analysis, would undoubtedly be an improvement with respect to personal information protection because it uses real data only for “getting started”.

In the post GDPR[17] context wherein many organisations would prefer to reduce their reliance on handling personal data, the use of simulators to produce synthetic data would appear to be the most privacy-compliant and economical method in the long term[18]. This type of data would also be the most suitable for machine learning.


Fraudulent financial transactions are a major concern as they evolve over time, along with technology. Furthermore, methods for detecting criminal operations, such as money laundering, have also become stronger. However, thanks to synthetic simulation applied in combination with machine learning, current problems may be overcome, leading to improvements in preventing and detecting financial crime.

This synthetic simulation method reduces reliance on personal big data, thus avoiding major privacy compliance issues. Using synthetic data, which is prima facie consistent with recent privacy regulation, reduces costs of complying with customer (real) information protection and perhaps more importantly, of sourcing data when it is not owned by the company itself. If machine learning is then able to work with the data collected in the aforementioned way, it is possible to increase substantially the quantity and the quality of both the data stream itself and the methods of analysis employed on it.

Author’s note: I would like to thank and acknowledge Edgar Lopez Rojas for his inspiring talk and subsequent contributions to my research

[1] Digitalization has made money laundering increasingly sophisticated. This has led to a growthin the costs for putting in place measures which can prevent these forms of criminality.

See “Uncovering Hidden Financial Crime Through Advanced Simulation, A Bluepaper from Simudyne” published in July 2019, available at:, accessed on 15 November, 2019.

See also B. Monroe, “Global Cost of Fraud Tops £3 Trillion”, Accountancy Daily, May 2018, available at: 3-trillion, accessed 22 April, 2019.

[2] E. A. Lopez Rojas, A. Sani, C. Barneaud, “Advantages of the PaySim Simulator for Improving Financial Fraud Controls”, Norwegian University of Science and Technology, 2019.

[3] Ibid.

[4] See “Uncovering Hidden Financial Crime Through Advanced Simulation”, fn. 1.

[5] Ibid.

[6] E. A. Lopez-Rojas and E. Zoto, “Triple Helix Approach for Anti-Money Laundering (AML) Research Using Synthetic Data Generation Methods”, in The 10th International Conference on Society and Information Technologies: ICSIT 2019, 2019.

See also: E. A. Lopez-Rojas, S. Axelsson, D. Gorton, “RetSim: A Shoe Store Agent-Based Simulation for Fraud Detection”, in The 25th European Modeling and Simulation Symposium, number c, 2013, Athens, p. 10.

And also E. A. Lopez-Rojas, S. Axelsson, “Money Laundering Detection using Synthetic Data”, in Julien Karlsson, Lars; Bidot, editor, The 27th workshop of (SAIS), 2012, Orebro, Linkoping University Electronic Press, pp. 33–40.

Finally, E. A. Lopez-Rojas, S. Axelsson, “Multi Agent Based Simulation (MABS) of Financial Transactions for Anti Money Laundering (AML)”, in Audun Josang and Bengt Carlsson, editors, Nordic Conference on Secure IT Systems, 2012, Karlskrona, pp. 25–32.

[7] In machine learning, the developer of the simulation software manually creates the initial algorithms which form the building blocks from which new rules are inferred. New algorithms originate from existing ones without any intervention, setting a new layer of instructions for the computer to perform. This means that starting off from the original data, computers are instructed on new tasks through new algorithms in a perpetual and exponential process, which produces new information consistent with the previous inputs. The basic function of machine learning is to feed training data to a learning algorithm, hence automatically generating new models, protocols, tasks and actions.

See Internet Society, “Artificial Intelligence and Machine Learning: Policy Paper”, 18 April 2017, available at:, accessed on 20 November 2019.

[8] See E. A. Lopez-Rojas, A. Sani, C. Barneaud, “Advantages of the PaySim Simulator, fn. 2 and 3.

[9] A technical perspective on what is being discussed may be found at R. J. Mooney, “Comparative Experiments on Disambiguating Word Senses: An Illustration of the Role of Bias in Machine Learning”, December 1996, downloadable at arXiv:cmp-lg/9612001, accessed on 22 November 2019.

[10] The issue of establishing ownership on data highlights how the more the data becomes valuable, the less incentives there are for creating open-source platforms. These would make increasing amounts of data publicly accessible, offering opportunities for collecting big data to benefit new initiatives.

For an insight see H. Varian, “Open source and open data”, 12 September 2019, available at:

[11] See conclusions reached in E. A. Lopez-Rojas and E. Zoto, “Triple Helix Approach for Anti-Money Laundering (AML) Research”, fn. 9.

[12] The General Data Protection Regulation n. 2016/679 (GDPR), which has come into force in July 2018, sets uniform protection standards for handling data, across all the European Union’s Member States. Furthermore, the European Charter on Fundamental Rights affirms the right to privacy at art. 7, while distinguishing it from the right to data protection, given at art. 8.

[13] For possible solutions in overcoming the challenges (which are relevant to this paper) set by handling personal big data under the GDPR regime, see: E. A. Lopez-Rojas, D. Gultemen, E. Zoto, “On the gdpr introduction in EU and its impact on financial fraud research”, in The 30th European, Modeling and Simulation Symposium-EMSS, 2018, Budapest.

[14] Ibid.

[15] See Art. 9 of the GDPR, entitled Processing of special categories of personal data, parag. 2(a).

[16] See previous fn. 15

[17] European Union’s General Data Protection Regulation

[18] See “Uncovering Hidden Financial Crime Through Advanced Simulation”, fn. 1.

About the Author:

Domenico Piers De Martino is President and Fintech Researcher at OFLS, and a Masters in Law and Finance candidate (2020), University of Oxford.

About the Editor:

Vaibhav Manchanda is an Economics graduate from the University of Chicago, and a BA Jurisprudence candidate (2021), University of Oxford

Oxford Fintech & Legaltech Society —