MANM389 Dissertation for Business Analytics
Executive summary (1000 words; 10% of marks): Set out on its own immediately after the title page. This often takes the form of a series of summary statements, ordered under similar headings to those used within the Dissertation. These summarise the key information or findings. The Executive summary should be written for an intelligent layman.
Declaration of Originality (see Appendix F)
Table of contents: An outline of the entire dissertation in list form, setting out the sequence of the sections with page numbers. It is conventional to number the preliminary pages (executive summary, table of contents) with lower case Roman numerals (i.e. (i), (ii), (iii) etc.) and the main text pages (starting with the first chapter) in Arabic numerals (1, 2, 3 etc.) as shown below:
CONTENTS PAGE
List of Tables
List of Figures
List of Abbreviations
Acknowledgements
CHAPTER 1 (Title) 1
• List of tables and figures: A table is a presentation of data in tabular form; a figure is a diagrammatic representation of data or other material such as graphs, photographs, images or maps. Tables and Figures should be numbered consecutively according to chapter (e.g.: Table 1.3 is the third table in Chapter 1, and Figure 4.2 is the second figure in Chapter 4). Each should be separately listed with page numbers.
• List of abbreviations: Abbreviations should be used sparingly, and those that are not self-evident or in common use should be explained where they first appear in each chapter by giving the full expression and the abbreviation in brackets, e.g. ‘gross domestic product (GDP)’. Abbreviations not in common use should appear at the beginning of the dissertation.
Useful rules for abbreviations:
No full stops in abbreviations consisting of initial capital letters, UK, US (adjective), EEC, OECD, BBC, UN. Note: ‘United Kingdom’ and ‘United States’ should be spelt out when used as nouns;
No full stops after abbreviations ending with last letter of word abbreviated, Dr Mr Mrs St;
• Introduction and/or definition of research problem: The introduction should set out the purpose and scope of the dissertation, clearly explaining what it is about, how it is structured, but more importantly, why the research is necessary and to whom. You need to ensure that the academic and applied rationale is well explained and justified. An academic rationale should answer the questions “Why don’t we know this already? Why is more study on this topic needed?” and an applied rationale should demonstrate the relevance of the topic to contemporary business environments. The section should end with the main aim and objectives of your study.
• Literature review (this may be more than one chapter):This section gives an overview of the context and background to the research problem. It builds on your problem definition and aims and objectives and so is an expansion of the concise arguments you make there. It is probably the section that will give you most scope to show off the wide range of sources you have consulted. Although based on existing literature, you should still present your material critically.
• Methodology: This section evaluates and justifies the research methodology that will be used to obtain the data to answer the research questions. It states the research problem, discusses the operationalisation of hypotheses (where relevant), discusses the research instrument used, the method of collecting the data – including sampling, the analysis of the data and the validity and reliability of data. It should contain enough detail to allow someone else to repeat your study.
• Analysis/ Findings/ Results: You should present your data in an appropriate form, which may include tables, graphs or in the case of qualitative data, verbatim quotes. Select the format that best suits your data, and do not present your data in more than one form. Ensure that the text around your presented data pulls out the key findings, rather than repeats what is already given. A table/figure should never be presented without supporting text. Tables and figures should be clearly and consistently labelled either above or below, and the reader should be able to understand the table/figure from the title without referring to the text for explanations. Units of measurement, the year to which the data refer, geographical area covered, and sources should be clearly stated. The labels in the text and in the lists should correspond exactly.
• Discussion: It can be hard to know which section to discuss your results – this or the preceding one – and you may decide to combine these two sections into one or more chapters based on theme, depending on your topic and your supervisor’s views. However, what is vital is that your Dissertation contains sufficient analytical discussion in addition to the more descriptive ‘scene setting’ material of the literature review sections, and presentation of results. It is here that you will compare and contrast your findings with those already reported in the literature.
• Conclusions: Here you need to answer the “So what?” question. What significance do your research findings have? For whom? Why? and How? In this chapter you link the research problem with literature review and findings, stating what you can conclude based on the work conducted. Based on your conclusions you should comment on managerial implications, the limitations of the research, suggest further work and better ways to resolve the problem.
• Full list of references used in the dissertation:You should provide correctly formatted bibliographic details for every citation included in the dissertation. Do not include material which is not referred to in your text (also see Section 8.0 below on referencing and academic misconduct).
• Appendices: Often misused and misunderstood, an appendix should only be used to include supplementary (but non-essential) material which, if included, would disrupt the flow of the text. Appendices are not marked so do not include any vital information, e.g.: results of analysis, in one if you want the content to be considered as part of the assessment. Appendices do not contribute to the overall word length.
Exploring the Factors Influencing Term Deposit Subscription in Direct Marketing Campaigns: A Case Study of a Portuguese Banking Institution
The landscape of banking has been experiencing revolutionary adjustments, especially in nations with established financial institutions like Portugal, which is one of the case studies taken. Given the abundance of similar product options now available in the market, the issue comes not only in drawing in new customers but also in keeping the attention of existing ones. Deposits for a certain period emerge as the central product within this framework. They provide customers with some kind of financial security. The purpose of this research is to investigate the elements that influence a customer's propensity to invest in term deposits, particularly when the consumer is persuaded to do so via direct marketing activities. Based on the results of the preliminary research, factors such as age, past financial commitments, and the general economic situation had a substantial impact on the number of term deposit subscribers. The results of the logistic regression showed a clear linear separation based on the weightage of various elements, which made it simple to analyze and comprehend the factors that were the most important. When it came to dealing with non-linearities in the data, SVM performed remarkably well. This was particularly true for complicated customer profiles, where conventional analyses could have failed. By using a large number of decision trees, Random Forest surpassed other methods in discovering non-obvious patterns. As a result, it provided a more nuanced picture of how customers behave. The degree of success achieved by direct marketing efforts was highly dependent on the strategy used. The effectiveness of these efforts did, however, significantly improve if insights gleaned from machine learning models were included in their design. Customers of a more advanced age had a greater proclivity for term deposits. This pattern was caught more correctly by machine learning models, namely SVM, than by more conventional techniques of statistical analysis. Traditional direct marketing continued to be successful even though the statistics showed an increasing trend toward digital platforms. Because of its capacity to process massive data sets, Random Forest made it possible to delve more deeply into the digital-versus-traditional interaction.
The first step in the three-step process that is taken for this study is the logistic regression, which produced encouraging findings. It demonstrated a good capacity to differentiate between those who would and would not subscribe to term deposits by achieving an Area Under the Curve (AUC) score of roughly 0.936. The model claimed to have an accuracy of 91.54%, which suggests that it was reliable in producing accurate predictions. However, its recall of 41.37% indicated that it was only successful in identifying around 41% of real-term deposit members. On the other hand, its accuracy of 66.30% indicated that when it forecasts a subscription, it is typically correct. Notably, the F1 score was found to be at its maximum, which may indicate a computational error or a perfect precision-recall balance, both of which are very unlikely to occur in the actual world. When we switched to the ensemble model known for its precision known as Random Forest, the AUC increased even more, reaching 0.9517, which is a strong indication of its greater capacity for discrimination. It revealed that a collection of decision trees may improve prediction dependability by achieving an accuracy of 92.02%, which was slightly higher than the accuracy achieved using Logistic Regression. Precision was almost identical to that of Logistic Regression, coming in at 66.27%, while recall was much greater, coming in at 50.74%. This improvement gives the impression that Random Forest is now more capable of detecting true-term deposit members. When compared to Logistic Regression, the F1 score, which is an amalgamation of accuracy and recall, was recorded at 0.5748, indicating a performance that was better balanced overall. Finally, SVM, which was developed to discover the best hyperplane dividing various classes, had an AUC of 0.9517 and was similar to Random Forest in this regard. The accuracy of the model was quite close to that of Logistic Regression, coming in at 91.54%. It is interesting to note that SVM's accuracy was somewhat lower than Logistic Regression's, coming in at 64.78%, but SVM's recall of 44.57% was an improvement over Logistic Regression, guaranteeing that it recognized a bigger percentage of genuine subscribers. The F1 score that was calculated as a consequence, 0.5281, highlights the trade-offs that SVM had to negotiate between accuracy and recall.
In conclusion, the research, "Exploring the Factors Influencing Term Deposit Subscription in Direct Marketing Campaigns: A Case Study of a Portuguese Banking Institution," used predictive modeling to determine which factors had the most effect on term deposit subscriptions. Insights into the significance of several characteristics in determining a customer's chance to subscribe were provided by the sophisticated analytical methods used, such as Logistic Regression, Random Forest, and SVM. In particular, the AUC of the Random Forest model was higher than that of the other models, showing that it was better able to account for nonlinear linkages and complicated interactions. Different patterns and outcomes were found while assessing alternative models for term deposit subscription using the Portuguese banking dataset. With and without feature selection, we compared the Logistic Regression, Random Forest, and SVM models. On the training data, the Random Forest model performed well, with an accuracy that was very close to flawless. By using feature selection, the train-test accuracy gap narrowed for all models, indicating improved generalization to new data. The C5.0 model performed well in the examination, with a high percentage of right predictions. Confusion matrix results showed that the model accurately differentiated between subscribers and non-subscribers, suggesting its usefulness in real-world settings. The confusion matrices revealed the prediction abilities and shortcomings of each model. Some models did a better job of identifying real positives, while others did a better job of discouraging false positives. As a whole, these results point Portuguese banks in the right direction, helping them hone their advertising efforts and boost their term deposit subscription rates. This in-depth analysis will allow for better, more specific communication with clients.
1.1 Introduction
Banking institutions in today's highly competitive market are always on the lookout for new ways to attract and retain customers. Term deposit advertising and other forms of direct marketing have long been effective means of accomplishing these goals. In a bank's portfolio, term deposits serve a crucial function as a reliable financing source and as a foundation of trust with customers because of the guaranteed interest rate they give for a certain period (Tekouabou et al., 2019). The Portuguese banking industry, with its rich history and rapid development, provides a fascinating setting in which to study the intricacies of direct marketing and evaluate its efficacy. The ability of banks in Portugal and elsewhere to discover and understand the critical elements influencing a prospective customer's choice to subscribe to term deposits is becoming more important as economic circumstances and consumer habits change. This report focuses on the inner workings of direct marketing initiatives run by a dominant Portuguese bank and sheds light on the factors that have a major influence on term deposit memberships by analyzing the many facets of these campaigns (Nethala et al., 2022). Such knowledge may help marketers plan more precise and successful campaigns in the future and having a firm grasp on these intricacies would not only boost campaign effectiveness but also confirm the connection between banks and their customers, which is becoming more important as the financial industry becomes more customer-centric.
1.2 Background
There has been a dramatic shift in the banking industry, particularly in emerging markets like Portugal, and with the market already saturated and a myriad of comparable products on the market, it may be difficult for institutions to stand out and attract new customers. Term deposits are a staple of the banking industry, helping to both build clientele and guarantee a constant supply of funds. Banks secure a steady flow of cash and provide their clients with a safe investment choice by accepting these fixed-term deposits that pay interest (Borugadda et al., 2021). Traditional channels for advertising financial services items have been direct marketing initiatives. To attract customers to their term deposit products, banks use a variety of methods, including phone calls, emails, and increasingly digital platforms. However, the success of such initiatives depends on a wide variety of variables, from the general economy and interest rates to the specifics of the demographic being targeted, and knowing these context-specific elements is very important in the Portuguese banking industry. Customer confidence is especially important in Portugal because of the country's recent history of navigating economic crises and banking sector consolidations (Gupta et al., 2021). Therefore, the purpose of this study is to investigate, via the lens of direct marketing efforts by a well-known Portuguese bank, the myriad of factors that influence a customer's choice to commit to a term deposit. The findings will aid in elucidating the complexities of the client decision-making process and informing future marketing tactics in the banking industry for MBA assignment expert.
1.3 Research Aim and Objectives
1.3.1 Aim
The main aim of the study is to develop a predictive model using machine learning algorithms for Portuguese banking institutions that can accurately forecast the likelihood of term deposit subscriptions for individual customers and based on that proper marketing strategies will be recommended for proper interactions with the consumers.
1.3.2 Objectives
• To understand the factors that influence consumers to opt for term deposits.
• To focus on the various predictive machine learning algorithms that Portuguese Banking Institutions can utilize to understand the impact of economic indicators.
• To understand the way predictive models can help a Portuguese Banking Institution in understanding the consumer decision to opt for a term deposit subscription.
• To focus on the various marketing strategies that Portuguese Banking Institutions can use for proper interactions with the consumers.
1.4 Research Questions
● What is the demographic profile of clients who subscribe to a term deposit in a Portuguese banking institution?
● How do variables such as age, job type, marital status, and education level influence the likelihood of subscribing to a term deposit in a Portuguese Banking Institution?
● What is the impact of economic indicators, such as employment variation rate, consumer price index, consumer confidence index, and number of employees, on term deposit subscriptions?
● How do the previous campaign's outcome and the number of contacts performed by Portuguese banking institutions before and during the current campaign affect subscription behaviour?
● What is the accuracy of linear regression, SVM, and Random Forest in predicting the likelihood of term deposit subscription?
● What are the different marketing strategies that can be used by Portuguese banking institutions for proper interactions with customers?
1.5 Research Rationale
The banking industry is ever-changing in reaction to changes in the economy and customers' preferences. Term deposits become a staple in this setting because they provide banks with much-needed liquidity while also giving investors a steady and, typically, attractive rate of return on their money. Term deposits are a huge deal, so banks must understand what makes customers want to put down their money for an extended period. Strategically advertising term deposits to a large audience has been made possible via direct marketing initiatives. However, their success varies; some efforts are more effective than others in reaching their intended audience. Banks may benefit from a more informed strategy for their future outreach efforts if they can determine the root causes of this disparity and work to address them and adding specificity to the probe is the decision to concentrate on a Portuguese Banking Institution. The Portuguese banking system, with its peculiar difficulties resulting from past economic swings and cultural idiosyncrasies, may provide lessons that are relevant outside Portugal. Banks may improve their marketing to better meet the expectations of modern customers by gaining a better grasp of the elements driving term deposit subscriptions as the digital era continues to disrupt old banking paradigms. Therefore, this study is situated at the crossroads of economics, marketing, and consumer psychology, with an emphasis on direct marketing's role in shaping consumers' preferences for term deposits.
2.1 Introduction
The Covid-19 pandemic has caused an unprecedented global catastrophe, impacting the vast majority of countries across the world. Central banks have responded with swift and extensive actions, which have been historically unprecedented. A new research database has been introduced, offering comprehensive data on how central banks from 39 different countries reacted to the Covid-19 crisis. These countries represent a mix of developed and developing economies. The database categorizes the monetary policy announcements into five main areas: interest rate measures, reserve policies, loan operations, asset purchase programs, and foreign exchange operations (Cantº et al., 2021). Each of these categories is further detailed with specific information, such as maturity, qualified counterparties, asset types, and the availability of a fiscal backup. Additionally, the database provides insights into the history of the introduction and deployment of each policy instrument.
2.2 Factors that influence consumers to opt for term deposits
E-banking is becoming more popular as a means for financial institutions to meet their clients' demands. To better serve their clients, many brick-and-mortar banks now provide online banking services (Li et al., 2021). E-banking has grown in popularity and simplified financial transactions as a result of technological developments. However, maintaining happy Internet banking customers is a difficult task. Banks rely heavily on satisfied customers as a means of staying ahead of the competition. Therefore, the purpose of this research is to investigate the elements that affect e-banking users' happiness. The study highlights cloud services, security measures, e-learning, and service quality as the four most important aspects of an e-banking experience that contribute to client happiness. The research applies structural equation modeling to examine these elements, checking the precision and accuracy of the underlying causal model and measurement model (Li et al., 2021). The ultimate goal of the research is to learn how banks may better tailor their e-banking offerings to customers' needs and preferences.
2.2.1 Macroeconomic variables
This research examines the variables that affect the interest rates of banking products using a one-of-a-kind dataset from the Dutch banking industry. This study investigates how macroeconomic variables, bank characteristics, and account characteristics all have a role in determining interest rates. In particular, the analysis shows that interest rates have grown increasingly sensitive to bank risk ever since the global financial crisis (Bikker and Gerritsen, 2018). One major conclusion is that interest rates on time deposits more accurately reflect the state of the economy than do those on savings accounts. Interest rates on deposit products vary not just across financial institutions, as the research shows, but also inside financial institutions, between different account types. Maturity-increasing circumstances, such as withdrawal fees for savings accounts and the product maturity term for time deposits, have a substantial impact on interest rates. Curiously, these factors tend to increase the rate of interest on a given financial instrument. In conclusion, the findings of this study provide new insight into the ever-shifting dynamics of interest rates in the banking sector, suggesting a heightened sensitivity to bank risk ever since the 2008 financial crisis (Bikker and Gerritsen, 2018). In addition, it provides useful information for the Dutch banking sector by highlighting the many ways in which economic circumstances and product characteristics affect interest rates.
The trend wherein nations with high-interest rates often provide high projected returns on short-term deposits is at the core of the interest parity issue as it has been revealed. HSBC Holdings' first-half earnings increased dramatically due to rising interest rates at central banks throughout the world, forcing the company to buy $2 billion in shares and raise its profitability goal. From a target of at least 12% for this year and a reported 9.9% for 2022, the bank has increased its near-term return on tangible equity aim to a minimum of mid-teens for 2023 and 2024 (White, 2023). It is a well-established fact that housing investment drives economic expansion and contraction in the United States. The macroeconomic significance of this investment has received more attention since the housing boom of the 2000s. This piece is an effort to close that void. The research proposes a new index, the "houses' own-rate of interest," which combines the mortgage interest rate with the rate of inflation in home prices. The analysis predicts a negative association between this index and the pace of residential investment growth since it reflects the true cost of purchasing homes. Long-term residential investment growth rate is shown to be inversely related to the rate at which homes earn their own interest, according to a research of the US economy from 1992 to 2019. According to the findings, the pace of increase in residential investment had no discernible impact on homeowners' interest rates throughout the short-term adjusting process (Petrini and Teixeira, 2023).
2.2.2 Interest rates
Based on the study done by Busch and Memmel, (2017), higher interest rates improve banks' net interest margins in the long run. This study, which looks at a time series centered on the German banking sector, verifies this connection. Specifically, they find that a rise in interest rates by 100 basis points is associated with a 7-basis point rise in banks' net interest margins. Higher interest rates have been shown to increase bank profitability over the long run by increasing their net interest revenue. The studies also find the reverse impact in the short term, which is rather intriguing. There seems to be a transitory negative effect on banks' net interest margins immediately after a rise in interest rates. In the long run, however, the positive correlation between interest rates and net interest margins prevails, negating the negative impact in the short term. The repercussions of the current low-interest-rate climate are also explored. The interest margins on retail accounts, especially term deposits, have dropped significantly. To be more precise, a drop of as much as 97 basis points has been seen in these margins (Busch and Memmel, 2017). This discovery sheds insight into the difficulties banks have in making a profit and controlling their interest revenue in the face of historically low-interest rates. The research sheds light on the intricate interplay of interest rates, net interest margins, and the profitability of the banking industry, illuminating both short- and long-term changes and trends.
As interest rates drop to historically low levels, the ability of monetary policy to stimulate growth in bank lending wanes, according to empirical evidence (Borio and Gambacorta, 2017). After taking into consideration a wide range of other variables, such as business and financial cycle circumstances and individual bank characteristics including liquidity, capitalization, financing costs, risk, and diversified income, this finding still holds. Focusing on a sample of 108 large worldwide banks, this analysis analyses the effect of monetary policy on bank lending in a low-interest-rate environment. Low interest rates have less of an influence on lending since they diminish the profits that banks make from their more conventional intermediation operations, according to the study's authors. This pattern provides some explanation for the tepid growth in lending seen between 2010 to 2014 (Borio and Gambacorta, 2017). This research shows that even at historically low interest rates, monetary policy can only go so far in encouraging bank lending, and it emphasizes the importance of profitability concerns in determining lending behaviour.
2.3 Data mining for term deposit marketing
Banks and other financial institutions may give their clients the opportunity to deposit a specified amount of money for a set period of time at a fixed interest rate via a financial arrangement known as a term deposit, fixed deposit, or time deposit (Khromov, 2018). One may choose a period that lasts anything from a few months to many years, during which time you have access to the money because of the longer duration involved, interest rates on term deposits are often greater than those on standard savings accounts. At the conclusion of the period, the depositor receives back both the principle and any interest that has accumulated. Term deposits are preferred by savers who want a guaranteed rate of return on their money and the peace of mind that comes with knowing exactly when their money will be completely returned to them (Khromov, 2018). There may be fees associated with withdrawing money from a term deposit too soon.
Economic constraints and intensive marketing rivalry are creating difficulties for term deposits at present. Even though many studies have highlighted the significance of consumers and customer segmentation in bank and deposit marketing, several challenges prevent this research from being put into practice (Zhuang et al., 2018). These include obsolete data, inadequate mapping, and a lack of particular strategies for deposit market segmentation. This study applies data mining strategies through SPSS Modeller to solve these concerns and boost the efficiency of bank marketing. The goal is to better understand the characteristics of consumers to make predictions about their behavior concerning term deposit subscriptions. The research aims to help banks fine-tune their marketing approaches to better meet the demands of their clients by using data mining. With this method, a bank can better target its audience and tailor their message to stand out in the increasingly competitive deposit market (Zhuang et al., 2018). The study's overarching goal is to help financial institutions better adapt to changing market conditions and fiercer marketing competition in the context of term deposits by bridging the gap between theoretical customer segmentation concepts and their practical implementation.
2.3.1 Various Data Mining models
The repercussions of the current financial crisis were examined in research undertaken by a Portuguese retail bank from 2008 to 2013. The study examined 150 characteristics of bank customers, offerings, and community characteristics in depth. Using data from as recently as July 2012, a semi-automatic feature selection approach was used to reduce the number of characteristics to 22, simplifying the modeling process (Gladilin and Saitov, 2019). Logistic regression, decision trees, neural networks, and support vector machines were the four data mining (DM) methods also evaluated. The two criteria utilized to assess how well they performed were the AUC of a receiver operating characteristic curve and the ALIFT of a LIFT cumulative curve. The most current data following July 2012 was used for the examination, and a rolling window approach was used. The neural network (NN) performed the best of the DM models, with an AUC and ALIFT of 0.8. By choosing the top half of better-classified customers, the study was able to correctly identify 79% of subscribers (Gladilin and Saitov, 2019). The research shows that data mining tools, and neural networks in particular, may be used to accurately forecast consumer behavior and more precisely target prospective subscribers. This study's findings may help the Portuguese retail bank improve its marketing efforts and client acquisition while facing economic uncertainty and other difficulties.
Many banks now utilize telemarketing, a kind of direct marketing, to pitch consumers on the idea of long-term deposits. Predictive modeling, which is made possible by data mining, provides an answer to this problem. However, the prediction performance may decline when dealing with data that includes several aspects, such as banking client information (Ładyżyński et al., 2019). The success rate of the bank and the prediction model are the key topics of this study. In the first place, it looks for ways to lessen the amount of data characteristics used in the prediction model. Second, the research highlights the significance of training set balance to produce a more trustworthy model. The accuracy rates and metrics, such as true positives and receiver operating characteristics (ROC), of the suggested technique, are compared to those of the original prediction model via performance tests. The findings reveal that the improved method using fewer features exhibits superior prediction performance (Ładyżyński et al., 2019). This study helps improve banks' telemarketing approaches by shedding light on how to balance training sets and improve data intake for higher prediction accuracy. With these upgrades in place, telemarketing may be used by financial institutions with little disruption to customers and maximum success for their campaigns.
2.4 Predictive Models to understand consumer decision
Bank loans in the current financial climate carry several dangers, creating difficulties for both lenders and borrowers. For efficient loan management, it is crucial to evaluate and understand these risks. Increased activity in the banking industry generates massive amounts of data that reveal consumer patterns and magnify loan-related dangers (Hamid and Ahmed, 2016). Data Mining, which seeks to combat this data flood and extract useful insights by analyzing large databases, has emerged as an important and exciting area of study. Using data mining methods, this study discusses the banking industry's pressing need for a revised loan risk categorization model. The model is built using banking sector data to make predictions about loan outcomes. The suggested model is developed using three different algorithms: j48, bayesNet, and naive Bayes (Hamid and Ahmed, 2016). In today's banking system, these algorithms serve a crucial role in assessing data and aiding the precise categorization of loan risks, therefore assisting both banks and borrowers in effectively managing the risks inherent in loans.
2.4.1 Naïve Bayes, Decision Tree, ANN and Support Vector Machine algorithm for prediction
A Portuguese retail bank's data from 2008–2013 were used, which included customer and product information as well as socioeconomic factors such as the effect of the financial crisis. An initial collection of 150 traits was reviewed, and from those, 21 were chosen as being most relevant to the suggested method. The purpose of this work is to provide a unique classification approach to improving the prediction of telemarketing target calls to boost long-term deposits at financial institutions (Koumétio et al., 2018). To correctly categorize clients based on feature categories, the novel suggested method functions in a manner that implicitly accentuates the most relevant qualities. Whether or not the characteristics are normalized, the suggested method is robust and accurate in experimental settings. Results are compared to those of other popular machine learning models including Naive Bayes, Decision Trees, Artificial Neural Networks, and Support Vector Machines to see how well they perform. This study introduces a novel categorization strategy that vastly improves the forecasting of telemarketing lead calls while promoting bank CDs (Koumétio et al., 2018). The technique's promising performance in the area of predictive analytics for financial services stems from its use of a limited number of highly relevant characteristics and its insensitivity to feature normalization.
To anticipate whether a potential new client is likely to have a term deposit or not, the suggested AIN is built to represent a network of consumers who already have such deposits with the bank. The research also presents new formulae for determining the affinity of an antigen to an antibody and an immune system (Lu et al., 2016). The purpose of this research was to investigate the use of Artificial Immune Systems (AIS) and, more specifically, AINs (artificial immune networks) for pattern identification and data processing in a wide range of scientific and technical fields. The major goal is to use an AIN to provide suggestions for bank term deposits by using it as a collaborative filtering and classification model. Feature selection strategies are used to isolate and keep just the most relevant characteristics for use in classification, hence improving the model's performance. The results of many tests show the potential of the AIN model to solve important problems. The suggested model outperforms competing models and achieves the maximum accuracy in testing despite the difficulty of class imbalance in the test dataset (Lu et al., 2016). This demonstrates the value of artificial immune networks as a powerful resource for collaborative filtering and bank term deposit recommendations, opening up new possibilities for better banking sector decision-making and client targeting.
2.4.2 Classification algorithms
It is very necessary to build a Customer Relationship Management (CRM) system to manage relationships with both current and future clients efficiently. To improve the efficiency of this system, a thorough analysis of the most recent published research investigates the many data mining strategies that are now in use in a wide variety of business domains, corporate sectors, and organizations (Rahman and Khan, 2018). After that, a model that is specifically designed for recognizing the behavior of customers in the banking industry is presented. The k-NN (k-Nearest Neighbors) algorithm, decision trees, and artificial neural networks are the three types of classifiers that will be used in the suggested model that will be used to predict the behavior of customers in the banking business. The efficiency of each classifier is analyzed in great detail, and their results are compared to one another to decide which one is the most successful at predicting the actions of banking customers. In the rapidly changing environment of the banking business, the use of these data mining methods and the subsequent analysis of their results may lead to the acquisition of priceless insights, which in turn can facilitate improved decision-making processes and create better client connections.
2.4.3 Classification and mapping of consumer journey for membership services for term deposit
A data mining response model based on random forests is proposed to better identify target consumers for banking campaigns, with an emphasis on the critical role of customers' responses in direct marketing. Class imbalance, a typical problem in telemarketing, might reduce the effectiveness of data mining methods. The study also investigates how class imbalance strategies might be used in the financial sector to provide a solution to this problem (Miguéis et al., 2017). This research contrasts the efficacy of an undersampling approach, the EasyEnsemble algorithm, with an oversampling approach, the Synthetic Minority Oversampling Technique (SMOTE), in improving the performance of the response model. The objective is to identify the best method for addressing class differences. The importance of attribute characteristics in the response model is also investigated. Notably, the model's discriminative performance improves once demographic data, contact information, and socioeconomic factors are included (Miguéis et al., 2017). According to the study's findings, random forests with an undersampling algorithm (EasyEnsemble) provide the best prediction performance compared to the other methods used. This demonstrates the usefulness of the suggested data mining response model, which may aid financial institutions in identifying the most profitable consumers to target with their marketing initiatives.
Direct marketing is becoming commonplace in the corporate world, and it has a lot of potential in the banking and financial services sector. In this study, the researchers provide a new classification method that uses partial training datasets. In this experiment, the researchers used data from a real-world direct marketing campaign, more particularly, a telemarketing effort designed to estimate the likelihood of a customer subscribing to a term deposit plan (Barman et al., 2016). Customers would be divided into subsets according to shared demographic characteristics using this suggested technique. The X-means clustering technique is used to accomplish this sorting. The X-means cluster technique is used to extract useful subsets of consumers from the massive customer database based on their shared demographic traits. After that, the study compares the suggested classification approach to three common classifiers like Naive Bayes, Decision Tree, and Support Vector Machine to gauge its efficacy. The results show that the suggested technique outperforms prior research using the same banking data, demonstrating its superiority in forecasting enrollment in term deposit plans (Barman et al., 2016).
2.4.4 Modelling
The researchers in this study used data mined from a term savings campaign run by a Portuguese bank between May 2008 and November 2010 to build prediction models. The data is distorted because of the low participation rate of 11.27 percent, which might result in high specificity but poor sensitivity in prediction models (Hlongwane, 2018). Researchers use under-sampling, over-sampling, and the Synthetic Minority Over-Sampling Technique to generate three more evenly distributed datasets for modeling purposes. Models for predicting who could sign up for a term savings product are created using random forest, multivariate adaptive regression splines, a neural network, and a support vector machine, and their relative performance is compared. The research uses the ROC curve, the confusion matrix, the GINI, the kappa, the sensitivity, the specificity, the lift, and the gains charts to assess the prediction performance. The most accurate model for forecasting the adoption of long-term savings products is a multivariate adaptive regression splines model constructed using over-sampled data (Hlongwane, 2018). This research sheds light on how to fine-tune direct marketing campaigns for more precise client targeting by emphasizing the significance of having a good mix of data and modifying the parameters.
Knowledge gained via data mining is a powerful resource for making sound choices and a large number of characteristics may be used to characterize an issue, smaller subsets of features may have a more significant impact on individual event clusters within that problem (Taj and Kumaravel, 2020). Specifically, the study presented a divide-and-conquer strategy to improve the process of describing telemarketing contacts for bank deposit sales by combining data-driven sensitivity analysis to extract feature importance with expert review. The research identifies call direction (inbound/outbound) as a critical component. Targeting effectiveness was much improved by taking a fresh look at inbound telemarketing as a separate issue. This demonstrates the value of the approach and its potential applications, especially in the highly competitive field of banking marketing (Taj and Kumaravel, 2020). The approach taken in the study shows how combining data-driven insights with expert viewpoints may greatly enhance solving difficult decision-making problems.
2.5 Various Marketing Strategies
As per Bakry et al., (2021), the study shows that age is a critical factor in identifying a company's demographic niche and informing successful advertising campaigns. In particular, millennials have become the dominant demographic in consumer databases owing to their lavish spending habits. Sharia banks in Indonesia are aware of this demographic shift and are eager to attract their business. The purpose of this study is to identify potential strategies that Bank Syariah Indonesia might use to better appeal to millennials. The study uses in-depth interviews and questionnaires to compile its data for a descriptive qualitative analysis of field research. Opportunities for banks to attract and retain millennial customers include educating customers about banking products and contracts, capitalizing on the growing interest in going digital with banking services, and developing targeted marketing campaigns (Bakry et al., 2021). Bank Syariah Indonesia and other banks may successfully increase their market share in Indonesia by seizing these openings, which would allow them to meet the changing demands of Indonesia's millennial customer base.
Consumers' widespread embrace of new digital technology has prompted businesses to rethink their approaches to marketing and customer service. Because of the proliferation of digital resources, businesses must rethink their marketing strategies and get a more nuanced appreciation for their customers' needs. Focusing on deposit money institutions in Nigeria, this research will analyze how digital marketing has affected consumer satisfaction (Onobrakpeya and Mac-Attama, 2017).
Statistical methods such as simple percentage, correlation, and multiple regression analysis were used in this cross-sectional survey study. In a survey of Nigerian deposit money banks, e-mail marketing was shown to have the most beneficial impact on customer satisfaction. This research demonstrates that consistent e-mail contact is highly valued by consumers since it increases their pleasure and adds value to their experience. The research found that businesses whose websites had higher-quality content ranked higher in search engine results and were in a better position to succeed in the marketplace (Onobrakpeya and Mac-Attama, 2017). Adopting digital marketing methods may increase customer satisfaction in the Nigerian banking industry, notably via efficient e-mail communication and high-quality website content. Businesses that take advantage of these shifts in the digital sphere will likely forge deeper bonds with their clients and get an edge in the increasingly competitive marketplace.
According to Csikósová et al., (2016), the study analyzes the success of marketing initiatives inside one bank and recommends key performance indicators for raising productivity across the board. To achieve good future developments and improved results for individual marketing efforts, the method uses the Balanced Scorecard to set strategic objectives in line with the enterprise's business plan. One of the qualifications for the business is having a fully operational marketing department with sufficient funds for marketing costs. Measure the success of the marketing efforts and evaluate them against the competition with the use of key performance indicators. The research shows that the company's loan growth rate is lower than the market growth rate and that client profitability has decreased. The report proposes strategic solutions to these problems, with an emphasis on measuring success from the standpoint of consumers. The banking corporation may pinpoint trouble spots and realign marketing initiatives to better serve customers by developing a strategy plan that places a premium on customer-focused performance assessment (Csikósová et al., 2016). This strategy should increase loan growth and boost customer profitability, boosting the bank's overall performance and making it more competitive in the banking industry.
According to García and Vila, (2020), whether or not nudging has any effect on the behavior of financially literate people, such as seasoned finance and pension experts. The research involves a field experiment with participants from a major Spanish life and pension firm. The participants were financially literate and understood the need of saving, but the experiment showed that this knowledge was not enough to motivate them to start saving. The results highlight the need of combining two key criteria to motivate people to take action. Financial literacy and understanding the financial implications of retirement are two important rational tools. However, these elements are insufficient to bring about a noticeable shift in behavior. Nudging is the second essential factor; it entails little prompts to help people make the right judgments without limiting their freedom of action. The study reveals a successful strategy for promoting long-term saving practices among people by combining rational instruments with nudging tactics, especially by using the default choice (García and Vila, 2020). This method is effective even for financially educated people, such as professionals who are well-versed in finance and pensions, since it closes the gap between information and action, resulting in better retirement savings and planning.
The financial sector has been looking at new ways to improve database marketing. Researchers have struggled to develop a robust analytical framework due to the peculiarities of bank marketing data. Despite several efforts to address the problem of skewed data and improve the accuracy with which Artificial Neural Networks can forecast their customers' intents, the problem persists (Ghatasheh et al., 2020). This study aims to improve the predictability of bank customers' desire to apply for a term deposit, particularly on highly skewed data sets. The research suggests enhanced Artificial Neural Network models that take a cost-sensitive approach to tackling the problem of skewed data. The strategy shown here is an attempt to lessen the devastating effects of unbalanced data without compromising the integrity of the primary data sets. The resulting models are extensively evaluated, verified, and compared to other machine-learning models. An actual telemarketing dataset collected from a Portuguese bank is used for the studies. The banking sector is funding this study to improve the accuracy of predictions made by Artificial Neural Networks and to circumvent the problems caused by skewed data (Ghatasheh et al., 2020). The study's overarching goal is to help financial institutions develop more efficient and precise marketing campaigns by increasing the precision with which they estimate consumers' interest in applying for term deposits using cost-sensitive methodologies.
Based on the study done by Ilham et al., (2019), the effectiveness of a telemarketing approach designed to increase sales of the bank's long-term deposit products. The main goal is to create a model that can identify which consumers have the most promising track records of boosting a business's bottom line. The research examines the effectiveness of several classification algorithms by comparing their Area Under Curve (AUC) and Accuracy scores on the Protestant Bank dataset available in the UCI Machine Learning repository. With an Accuracy of 97.07% and an AUC of 0.925, the experimental findings show that Support Vector Machine (SVM) yields positive results. As a result, SVM seems to be the best option among classification algorithms for identifying potential clients who will be interested in time deposit products promoted through telemarketing (Ilham et al., 2019). When used to promote long-term deposit products through phone or cellular communication channels, SVM may help the bank more precisely target and engage prospective consumers, thus increasing company profitability.
3.1 Introduction
The research “Exploring the Factors Influencing Term Deposit Subscription in Direct Marketing Campaigns: A Case Study of a Portuguese Banking Institution” is done using a mixed approach. In a mixed approach, quantitative and qualitative methods is used to perform the study. In the qualitative method, all the published articles related to the topic that has been previously performed is being focused and information is gathered from the article to get insight for this study. The previously published articles give an in-depth understanding of the concept of term deposit and the factors that influence consumers to opt for Term Deposit subscription based on which recommended various marketing strategies is formed for Portuguese Banking Institution to promote term deposit subscription.
3.2 Research Onion
Exploration and discovery are at the core of research, with each new finding bringing us closer to the truth. Research is like an onion; it has many layers that need to be unveiled. The "Research Onion" is a framework for thinking about the many steps involved in planning and carrying out a study. Let's peel back the layers of the Research Onion to have a better grasp of this useful method. The Research Onion is a metaphor for the several steps that must be taken to complete a study (Melnikovas, 2018). Saunders and his colleagues created this approach to help researchers make well-informed choices during their research. Each sub-layer of the Research Onion communicates with the others, shaping the whole investigation. Like peeling an onion, a well-structured study needs researchers to give careful consideration to each layer of the Research Onion. Validity, reliability, and applicability of research results may be ensured by methodical exploration of these levels (Sahay, 2016).
Figure 1: Research Onion
Source: (Sahay, 2016)
3.3 Data collection and analysis
An essential stage of this research project is the gathering and examination of data (Al-Ababneh, 2020). The first part of the research is qualitative and entails a thorough literature evaluation of publications that have already been written on the topic of term deposit subscriptions. The quantitative part of the study now examines the copious customer data gathered from the Portuguese bank. Effective marketing strategies may be informed by the findings of empirical research, which include collecting and systematically analyzing data to identify patterns, correlations, and trends.
The quantitative data collection and analysis process encompasses several key steps:
Data Gathering: The Portuguese financial institution's customer database is mined for information on demographics like age, income, level of education, and more. This information may be used to better understand customer habits and inclinations.
Data Preprocessing: The obtained data must first be cleansed, organized, and processed in preparation for analysis. Addressing missing values, normalizing variables, and checking for data inconsistencies all fall under this category.
Visualization using Power BI: The research makes use of Power BI, a robust data visualization tool, to provide graphical depictions of customer behavior. Generated charts and graphs make it easier to see how various factors influence consumers' decisions to open a term deposit account.
Prediction Modeling with R Programming: Different prediction models, such as Linear Regression, Random Forest, and Support Vector Machine (SVM), are constructed and evaluated using R programming to forecast consumers' actions. The prediction powers of these models are determined by training and testing using consumer data.
3.4 CRISP-DM framework
The CRISP-DM (Cross-Industry Standard Process for Data Mining) framework will be instrumental in conducting the study "Exploring the Factors Influencing Term Deposit Subscription in Direct Marketing Campaigns: A Case Study of a Portuguese Banking Institution" based on the provided dataset.
Figure 2: CRISP-DM framework
Source: (HOTZ, 2023)
3.4.1 Business Understanding
In the context of the research topic "Exploring the Factors Influencing Term Deposit Subscription in Direct Marketing Campaigns: A Case Study of a Portuguese Banking Institution," the business understanding phase of the CRISP-DM framework entails learning everything possible about the banking institution's goals and the unique difficulties associated with term deposit subscriptions. Term deposit product research requires an understanding of the banking industry's dynamics, client behavior patterns, and product nuances. This phase lays the groundwork for the succeeding phases of data collection, analysis, and model creation by matching the research objectives with the goals of the institution and addressing their particular requirements. It guarantees that the study's findings aren't only theoretically interesting, but practically useful for improving direct marketing efforts and increasing term deposit signups.
The goals achieved by this project are:
• Data mining and visualization to establish a relationship between a predictor and an outcome.
• Cleaning and reducing the data to a manageable subset for modeling.
• Optimization of default prediction accuracy by application-level modifying and training of three classification algorithms: Linear Regression, Random Forest, and Support Vector Machine.
• Evaluating the performance of the classification algorithms by ROC, AOC, Confusion Matrix, Precision, Recall, and F1 score.
3.4.2 Data Understanding
The dataset containing customer information, campaign details, and subscription outcomes will be thoroughly examined. This involves assessing data quality, identifying relevant attributes, and understanding their meanings.
Source of Data
The data is available in Kaggle from where it has been downloaded and used in this study (kaggle.com, 2022). The information comes from a Portuguese bank's direct-marketing efforts. The marketing efforts relied heavily on telemarketing. It usually took many interactions with the same customer to determine whether the product (a bank term deposit) would be subscribed to.
Data Dictionary
The dataset has 21 columns and 41188 observations which shows the marketing initiatives of Portuguese Banking Institutions done for term deposits.
age: Age of the client (numeric)
job: Type of job the client has (categorical: 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown')
marital: Marital status of the client (categorical: 'divorced', 'married', 'single', 'unknown')
education: Level of education of the client (categorical: 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown')
default: Whether the client has credit in default (categorical: 'no', 'yes', 'unknown')
housing: Whether the client has a housing loan (categorical: 'no', 'yes', 'unknown')
loan: Whether the client has a personal loan (categorical: 'no', 'yes', 'unknown')
contact: Communication type used to contact the client (categorical: 'cellular', 'telephone')
month: Last contact month of the year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
day_of_week: Last contact day of the week (categorical: 'mon', 'tue', 'wed', 'thu', 'fri')
duration: Duration of the last contact in seconds (numeric)
campaign: Number of contacts performed during this campaign for this client (numeric)
pdays: Number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: Number of contacts performed before this campaign for this client (numeric)
poutcome: Outcome of the previous marketing campaign (categorical: 'failure', 'nonexistent', 'success')
emp.var.rate: Employment variation rate - quarterly indicator (numeric)
cons.price.idx: Consumer price index - monthly indicator (numeric)
cons.conf.idx: Consumer confidence index - monthly indicator (numeric)
euribor3m: Euribor 3-month rate - daily indicator (numeric)
nr.employed: Number of employees - quarterly indicator (numeric)
y: Whether the client subscribed to a term deposit (categorical: 'no', 'yes')
EDA
The dataset collected during the direct marketing campaign run by the Portuguese financial organization will be analyzed using exploratory data analysis (EDA). Customers' ages, occupations, levels of education, preferred ways of communication, and the final results of their subscriptions are just some of the demographics, behaviors, and interactions that may be uncovered by using EDA. The variables that affect subscriptions to term deposits may be better understood with the use of EDA's visual representation of distributions, correlations, and trends. Future statistical analysis and model construction will be informed by these findings, allowing for data-driven optimization of marketing tactics and increased efficacy of campaigns.
3.4.3 Data Preparation
Data preprocessing will be performed to clean, transform, and integrate the data as this includes handling missing values, encoding categorical variables, and scaling numerical features (Dåderman and Rosander, 2018).
Feature Selection and Correlation
Feature selection and correlation analysis will play a pivotal role in the study of "Exploring the Factors Influencing Term Deposit Subscription in Direct Marketing Campaigns" using the Portuguese marketing dataset. By employing advanced statistical techniques, we will identify and prioritize the most influential variables, streamlining the model's complexity and enhancing its predictive power. Correlation analysis will unveil interdependencies among variables, highlighting significant associations between client attributes and subscription outcomes. This process will provide actionable insights, aiding in the identification of key drivers and guiding strategic decisions to optimize future marketing campaigns for the Portuguese banking institution.
3.4.4 Modeling
Machine learning models will be applied to the prepared data to explore factors affecting term deposit subscriptions. Techniques like Linear regression, Random Forest, and SVM may be used to predict subscription outcomes and identify significant predictors.
Random Forest
The dataset will be analyzed using the Random Forest technique, and the variables that affect subscriptions to term deposits will be investigated. Random Forest works to find useful predictors such client demographics, communication channels, campaign timing, and economic factors by generating an ensemble of decision trees. It will evaluate the significance of features and their relationships, shedding light on successful advertising approaches. Successful term deposit subscriptions in the Portuguese financial institution's direct marketing efforts may be traced back to the algorithm's ability to manage complicated connections and avoid overfitting.
Linear Regression
Linear Regression will be used to examine the study's dependent variable (term deposit subscription) in connection to its independent factors (customer age age, campaign length, etc.). It will quantify the impact of each factor on subscription rates by fitting a linear equation to the data. By identifying relevant predictors, our approach will provide light on the nature and strength of these connections. For the Portuguese financial institution's direct marketing operations to generate new term deposit subscriptions, Linear Regression is an invaluable tool due to its ease of use and interpretability.
SVM
Using information about customers and campaigns, SVM seeks to locate a hyperplane that accurately differentiates subscribers from non-subscribers. It will categorize prospective subscribers and non-subscribers, which will aid in determining what characteristics are most important to each group. Discovering the intricate connections that affect the performance of a term deposit will be facilitated by SVM because to its capacity to handle non-linear correlations and high-dimensional data. The study's use of SVM will help a Portuguese bank better understand what makes people sign up for new term deposits in response to their direct marketing efforts.
3.4.5 Evaluation
The models' performance will be evaluated using appropriate metrics like accuracy, precision, recall, and F1-score. This step ensures the models' reliability in capturing subscription patterns.
Confusion Matrix
To evaluate the efficacy of a classification model, a confusion matrix may be employed. True positives (successfully predicted positive outcomes), true negatives (correctly predicted negative outcomes), false positives (incorrectly predicted positive outcomes), and false negatives (incorrectly predicted negative outcomes) are all clearly distinguished (Theissler et al., 2022). To better understand the strengths and shortcomings of a model for generating predictions, it may be evaluated across a wide range of metrics using this matrix.
Figure 3: Confusion Matrix
Source: (Theissler et al., 2022)
ROC and AUC
ROC and AUC are essential tools for evaluating the performance of binary classification models. The ROC curve is a graphical representation that illustrates the trade-off between a model's true positive rate (sensitivity) and false positive rate (1-specificity) at various classification thresholds. A model with a higher true positive rate and lower false positive rate will have an ROC curve closer to the upper-left corner, indicating superior performance.
Figure 4: ROC and AUC
Source: (Muschelli III, 2020)
AUC, on the other hand, quantifies the overall performance of a model by measuring the area under the ROC curve. AUC ranges from 0 to 1, where a higher AUC value signifies better discrimination and classification accuracy. An AUC of 0.5 indicates random guessing, while an AUC of 1 represents a perfect model. AUC is particularly useful for comparing different models and selecting the best-performing one.
Accuracy, precision, recall and F1-score
Fundamental measures used to assess the effectiveness of classification models include accuracy, precision, recall, and F1-score.
The correctness of a model may be assessed in aggregate by looking at its accuracy, which is the percentage of right predictions relative to the total number of forecasts. However, if the courses aren't evenly matched, the results might be deceiving.
Accuracy in positive prediction is measured by precision, which indicates what percentage of "positives" were really correct. In situations where false alarms may rack up high costs, this feature's ability to reduce them is invaluable.
Measures how well a model can find genuine positives among a set of observed positives; sometimes called sensitivity or true positive rate. In situations when missing true positives is more harmful than false positives, this is an absolute need.
F1-score is a balanced statistic that takes into account both accuracy and recall. When there is a significant socioeconomic gap, this is a very helpful tool.
3.4.6 Deployment
Insights gained from the models will be translated into actionable recommendations for the banking institution. Strategies for targeted campaigns, optimal contact channels, and timing adjustments can be suggested based on the study's findings. This research did not cover the CRISP-DM framework's sixth stage.
The final step involves revisiting the initial business objectives and validating whether the study's outcomes align with the institution's goals. Adjustments or additional analyses may be conducted if necessary.
3.5 KDD- Knowledge discovery in Database
Knowledge Discovery in Databases (KDD) will play a significant part in the research project named "Exploring the Factors Influencing Term Deposit Subscription in Direct Marketing Campaigns: A Case Study of a Portuguese Banking Institution." In this method, insights are extracted from the dataset via a series of interrelated processes.
Selection: The study has started by selecting a relevant dataset of a Portuguese banking Institution which contains customer age, job, marital, education, default, housing, loan, contact, month, day_of_the_week, duration, campaign, pdays, previous, poutcome, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m and y from the Portuguese banking institution's direct marketing campaigns.
Preprocessing: Data preprocessing involves cleaning and transforming the dataset, handling missing values, and standardizing formats as this step ensures the data's quality and prepares it for analysis (Munir and Anjum, 2018).
Transformation: During this stage, the data will be transformed into a suitable format for analysis. Features like customer age, job, contact, and campaign is identified and extracted.
Data Mining: Utilizing various data mining techniques, the study will explore correlations and patterns within the dataset. Machine learning algorithms, such as linear regression, Random forest and SVM, will be applied to identify key factors influencing term deposit subscriptions.
Interpretation/Evaluation: The results of the data mining process will be interpreted to gain meaningful insights into the factors impacting subscription decisions. Model performance will be evaluated using metrics like accuracy, precision, recall, and F1-score.
Knowledge Representation: The findings will be represented visually through graphs, charts, and tables. This will facilitate a clear understanding of the relationships between different variables and their impact on term deposit subscriptions.
Deployment: The insights gained from the KDD process will be applied to optimize future direct marketing campaigns for the banking institution. Strategies for targeting specific customer segments, refining communication channels, and optimizing campaign timing can be devised based on the study's conclusions.
4.1. Overview
The analysis begins Using data from a Portuguese bank, this chapter expands the CRISP-DM framework to assess the performance of machine learning algorithms developed to forecast subscriptions to term deposits with focus on data preprocessing. It includes data cleaning by addressing missing values, handling duplicates, and mitigating outliers in specific numeric columns. Categorical variable (job, education, marital, month, etc) are encoded into numeric for machine learning compatibility. Feature selection through PCA involves the identification and extraction of significant variations in the data by creating linear combination of variables, effectively reducing dimensionality while retaining essential information and minimizing multicollinearity. To address the challenge of class imbalance, a Synthetic Minority Over-Sampling Technique (SMOTE) was employed, ensuring a balanced dataset for modeling. The data is then split into training and testing sets using an 80:20 and 70:30 ratio for various machine learning models. Data preparation activities may be included throughout the modeling phase thanks to the CRISP-DM framework's adaptability in job sequencing. After the models were trained, they were examined thoroughly using pre-set metrics, with a focus on the selected criteria, such as the area under the curve (AUC). By objectively contrasting the models' results, we learned all we could about their advantages and disadvantages. It is due to this comparison that we were able to determine which model performed best. By highlighting the relevance of the malleability of the CRISP-DM framework and its potential to provide optimum predictive outputs, this chapter presents a systematic method for improving and choosing the most effective machine learning algorithms for term deposit subscription prediction. Using SPSS to run a C5 model to generate a rule set for customers who subscribe to the term deposit. These rules are then employed for customer profiling. Subsequently, we recommend tailored marketing strategies based on these profiles.
4.2. EDA
4.2.1 Demographic Profiles of customer who subscribe Term Deposit
The Figure 1 shows demographic profile of customer who subscribed Term Deposit Dashboard.
Figure 5: Demographic Profiles dashboard
The proportion of customers who subscribed to a term deposit among the total customer base is 11%. These subscribers exhibit specific profile characteristics.
Type of Job: Term deposit subscribers are primarily in management roles, possibly due to their higher income and financial awareness. Technicians, known for job stability, also show interest. Blue-collar workers, while numerous, may subscribe less due to lower disposable income.
Age Group: The most common age group among term deposit subscribers falls within their early 30s, followed by young adults and individuals in their late 50s.Early 30s individual are in the phase of life where they may be more focused on saving and investing. Young adults may not have substantial saving yet, while late 50s individuals might be preparing for retirement and, hence, showing interest in term deposit.
Marital Status: Term deposit subscribers are mostly married, possibly due to shared financial responsibilities and long-term planning. Singles may prioritize different goals, and divorced individuals could seek financial stability.
Education Level: Term deposit subscribers often have university degrees, indicating higher income and financial literacy. Primary or high school educated individual may have fewer financial resource or awareness.
Month: The preferred month for term deposit subscription in the 2nd quarter, like May, August, July, and June, may coincide with bonus periods or improved financial prospects, motivating investments. In Portugal, this trend could be linked to the fact that 13th and 14th month salaries must be paid in summer and mid-December, respectively. These extra income periods align with the increased interest in term deposit these months (Portugal, May 12,2023).
Day of Week: Subscribers engage most on Thursdays possibly due to more time for financial decisions. Tuesdays and Wednesdays are early workweek days, likely keeping people occupied.
Contact Mode: The primary mode of contact among term deposit subscribers is cellular, accounting for 83.04% of communications may be due to convenience.
Housing Loan & personal Loan: Client with a housing loan and no personal loan have the highest subscription rate (12%), followed closely by with both a housing and personal loan. This suggest that those with a housing loan, but no personal loan may have more financial stability for investment due to their property commitment and lower immediate financial obligations
Default Credit: For customer with no credit in default who have subscribed to a term deposit, there are 4197 customers in this category. Among customers with no credit in default who subscribed to a term deposit, there are 4197 such customers. This group is more likely to subscribe due to their responsible financial history, making them suitable candidate for term deposit.
4.2.2 Impact of Social Economic variable on term deposit subscriptions
Figure 6: Impact of Social Economic variable
The Figure 2 shows Impact of Social Economic variable towards Term deposit Subscription.
Upon close examination of the aforementioned dashboard, it appears that the socioeconomic variable may have a substantial impact on term deposit subscription. This observation stems from the fact that both the term deposit Yes and No categories exhibit similar trends.
Figure 7: Impact of Social Economic variable towards TD subscription
Emp.Var.Rate: When emp.var.rate is negative (e.g -3.4,-3.0,-2.9,-1.8,-1.7,-1.1), it suggests an economic downturn or instability. When emp.var.rate is positive(e.g., 1.1,1.4), it suggest economic growth and stability. The count of yes (TD subscription) is generally higher during period of positive.
Cons.Price.Idx: Generally, it appears that when the customer price index is lower(indicating lower inflation), there is a higher count of yes response for term deposit subscriptions. This suggests that lower inflation tends to be associated with a higher likelihood of people subscribing to term deposits.
Euribor3m (interest Rate): Euribor3 is an important indicator that can influence individuals decisions including whether to subscribe to term deposit. Client may be inclined to invest in term deposit on prevailing interest rates. For example, for the Euribor3 rate of 4.875 there are 72 clients who have subscribed to a term deposit but on the whole there is no identified trend in this data.
Con.Conf.idx: The graph reveals that term deposit subscriptions have occurred at a range of cons.price.idx values, both higher and lower.
Nr. employed: It appears that there isn’t a consistent trend between the no of employees and term deposit subscription. While some yes response occur at higher values of nr.employed there are also yes response at lower values.
4.2.3 Key factors influencing Subscription behaviour
Figure 8: Term deposit Subscription by Campaign
For reference variable description,
Campaign- no of contacts performed during the campaign
Poutcome- outcome of the previous marketing campaign
Previous- number of contacts performed before this campaign and for this client
For customers contacted once during the campaign (Campaign = 1), there are 2,300 customers who subscribed to a term deposit. As the number of campaign contacts increases, the count of customers who subscribed generally decreases. For instance, when the campaign contact count is 2 (Campaign = 2), there are 1,211 customers who subscribed. This trend continues, with a decreasing count of subscribers as the number of campaign contacts increases, until the last few data points where there are only a small number of subscribers for a high number of campaign contacts.
Figure 9: Campaign by poutcome, TD subscription and Previous
Customers who had a "poutcome" of "failure" in previous campaigns were contacted during the current campaign ("Count of campaign") multiple times (ranging from 1 to 5 times). Despite multiple contacts, they still subscribed to a term deposit ("TD Subscription = Yes”). Similarly, customers who had a "poutcome" of "success" in previous campaigns were contacted during the current campaign ("Count of campaign") and also subscribed to a term deposit ("TD Subscription = Yes”)
Figure 10: poutcome by TD subscription
The graph suggests a strong positive correlation between the outcome of the previous marketing campaign outcome (“poutcome”) and the likelihood of a customer subscribing to a term deposit in the current campaign.
Figure 11: Previous by TD subscription
Customers who had fewer previous contacts (0 or 1) are more likely to subscribe to a term deposit, with the majority of them doing so. As the number of previous contacts increases (2, 3, 4, 5, 6, or 7), the likelihood of subscription decreases. Customers with more than 4 previous contacts are less likely to subscribe to a term deposit.
4.3. Data Pre-processing: Structure and Cleaning of Dataset
Figure 12: Structure of the Dataset
The dataset consists of 21 variables and 41188 observations.
Figure 13: Dataset after removing Duplicates
In the dataset, after removing duplicates there are 21 variables and 41176 observations.
Figure 14: Missing Values
There is no missing value in the dataset.
Figure 15: Checking outliers.
Figure 16: Removing outlier
With a substantial dataset comprising 41188 observations it possesses the resilience to accommodate outliers without impeding the analytical process. Notably, the selected algorithms, namely Random Forest and SVM, exhibit a robust disposition toward outliers. Adeptly managing extreme values without undue influence. It’s imperative to recognize that outliers can, at times, encapsulate valuable information or signify rare occurrences of significance to your investigation. Eliminating them might result in the forfeiture of pivotal insights. Thus, the judicious choice is to retain these outliers, enabling a comprehensive exploration of the dataset’s intricacies and derivation of profound conclusions.
Figure 17: Converting Char columns to numeric
Label encoding transforms categorical character column into numeric format by assigning a unique integer to each distinct category. This conversion facilitates machine learning models’ compatibility with non-numeric data, enabling them to process and derive insights from categorical features effectively.
Figure 18: Standardization of numeric column
Standardization, in this context, involves scaling numeric columns to have a mean of zero and a standard deviation of one. It helps ensure that all variables are on the same scale, preventing certain features from dominating the modeling process due to their larger magnitude. This process improves the stability and performance of machine learning algorithms by making them less sensitive to the relative scale of input features.
The variable “duration is highly correlated with target variable “y” (i.e., term deposit Subscription) because it represents the duration of the call made to the client, and longer calls may be more likely to result in a positive outcome. However, in practice, the duration of a call is typically not known beforehand, and it could be considered a form of data leakage.so “duration” variable is dropped.
Figure 19: PCA code
Principal Component Analysis (PCA) is powerful technique used in data analysis for feature selection and noise reduction. It works by identifying and combining original features to capture the most significant variations in the data.
Figure 20: PCA results
In the PCA result, the first principal component (PC1) has the highest standard deviation (1.9612), indicating that it captures the most variance in the dataset. In fact, PC1 alone explains 42.74% of the total variance, demonstrating its importance. When you consider the cumulative proportion of variance, the top 5 components (PC1 to PC5) collectively explain 89.57% of the total variance. This dimensionality reduction is valuable for simplifying data while retaining a substantial amount of the essential information, making it a crucial tool in data analysis and model building.
Figure 21:data target variable y before and after SMOTE
The bar plot of the target variable “y” (i.e., Term deposit subscription) after applying SMOTE visually demonstrate a more balanced distribution, where the counts of both classes are approximately equal. This balanced dataset is crucial for training machine learning models that can make fair predictions for both positive and negative outcomes.
4.4. Logistic Regression Model
Logistic regression is a statistical technique primarily employed for modeling the association between a dependent binary variable and a set of independent variables. In the context of Portuguese banking dataset, the primary aim is to predict whether a customer will subscribe to a term deposit, yielding two distinct outcomes: ‘yes’ or ‘no’. This dataset encompasses a diverse array of customer attributes, encompassing factors like age, job, marital status, and education level. These attributes hold relevance both for customers and the bank’s marketing endeavours. Logistic regression leverages these predictor variables to make predictions about the binary outcome, effectively determining whether a customer will opt for a term deposit. Within the logistic regression model, each attribute is assigned a specific weight or coefficient, significance and its role in shaping the final prediction. The model generates probability scores, ranging from 0 to 1, by applying the logistic function. To convert these probabilities into actionable binary predictions, a threshold value, typically set at 0.5, is established. Predictions exceeding this threshold are categorized as one outcome, while those falling below it are assigned the alternative outcome. Evaluation metrics such as accuracy, precision, recall, and the area under the receiver operating characteristic (ROC) curve come into play. These metrics serve as critical barometers for measuring the model's efficacy in accurately classifying customers into subscription and non-subscription categories, thereby shedding light on its predictive prowess within the realm of binary classification tasks.
The result generated for Logistic Regression after splitting the dataset into 70:30
Table 1: Logistic Regression AUC, Accuracy, Precision, Recall, and F1 Score
Figure 22: ROC curve for Test and Train data-70:30 (without feature Selection)
Figure 23: ROC curve for Test and Train data-70:30 (with feature Selection)
The result generated for Logistic Regression after splitting the dataset into 80:20
Table 2: Logistic Regression AUC, Accuracy, Precision, Recall, and F1 Score
Figure 25: ROC curve for Test and Train data-80:20 (with feature Selection)
It visually illustrated the trade-off between a model’s ability to correctly identify positive cases and its tendency to incorrectly classify negative value. A “sharp elbow” on the ROC curve represents a point where the model achieves a significant improvement in sensitivity without a substantial increase in false positive. This elbow often corresponds to an optimal threshold for making binary predictions, allowing you to strike an effective balance between correctly identifying true positives and minimizing false positive in classification task.
4.5. Random Forest Model
Random Forest is an ensemble learning method that addresses several common problems in machine learning:
Overfitting: By averaging or taking a majority vote over many decision trees, the model reduces the likelihood of overfitting to noise or outliers in the training data (Schonlau and Zou, 2020).
Bias-Variance Trade-off: Random Forest balances bias and variance, achieving better generalization on unseen data. Each tree in the forest might have high variance, but when combined, their variances cancel out, resulting in a model with less variance and reasonably low bias (Schonlau and Zou, 2020).
Handling Large Data with Many Features: Random Forest can efficiently handle datasets with a large number of features, determining feature importance and offering insights into the most influential predictors in the dataset.
When applying the Random Forest algorithm to the Portuguese banking dataset to predict term deposit subscriptions:
Feature Importance: One of the standout advantages of Random Forest is its ability to rank features based on their importance in making accurate predictions. In the context of the Portuguese banking dataset, Random Forest can provide insights into which factors – be it age, balance, employment type, previous marketing interactions, or other variables – most influence a customer's decision to subscribe to a term deposit.
Figure 26: Feature Importance of Random Forest
“pdays” has the highest MDA value, indicating it is the most influential feature for predicting term deposit subscriptions. This means that the number of days since the client was last contacted (pdays) has a substantial impact on whether they subscribe. “previous” is closely followed by “poutcome” indicating that both the no of contacts performed before this campaign and the outcome of the previous marketing campaign are highly relevant to current subscription predictors.
Features like “poutcome”, “age”, “previous” and “days” have relatively high MDG values, suggest that these variables are important contributors to the model’s predictive accuracy. These features likely play a significant role in distinguishing between the target classes or improving the model’s overall performance.
Model Interpretability: While individual trees can be complex and hard to interpret, by examining the aggregation method and feature importance scores, stakeholders can get a good sense of what's driving predictions.
Reduced Overfitting: As an ensemble of multiple decision trees, where each tree is built using a subset of data and features, the Random Forest algorithm inherently avoids overfitting, making its predictions more reliable and generalizable to unseen data.
The result generated for Random Forest after splitting the dataset into 70:30
Table 3: Random Forest AUC, Accuracy, Precision, Recall, and F1 Score
Figure 27: ROC curve for Test and Train data-70:30 (without feature Selection)
Figure 28: ROC curve for Test and Train data-70:30(with feature Selection)
The result generated for Random Forest after splitting the dataset into 80:20
Table 4: Random Forest AUC, Accuracy, Precision, Recall, and F1 Score
Figure 29: ROC curve for Test and Train data-80:20 (without feature Selection)
Figure 30: ROC curve for Test and Train data-80:20(with feature Selection)
The ROC curve positioned near the top-left corner of the plot. This means that model has a high True Positive Rate (TPR) while maintaining a low False Positive Rate (FPR), which is an ideal scenario.
4.6. SVM
SVM is a powerful supervised machine learning algorithm primarily used for classification and regression tasks. It offers several advantages:
High Dimensionality: SVM can handle a high-dimensional space, making it suitable for datasets with numerous features.
Maximum Margin Classifier: SVM works by finding a hyperplane that best separates the classes with the maximum margin and this ensures that the model generalizes well on unseen data (Filzmoser and Nordhausen, 2021).
Kernel Trick: SVM's capability to use different kernel functions, like polynomial, radial basis function (RBF), and sigmoid, enables it to classify data that is not linearly separable in its original space.
Handling Features: Given that the Portuguese banking dataset might contain multiple features (like age, job type, balance, and others), SVM's capability to operate in high-dimensional spaces becomes beneficial. Each feature adds a dimension, and SVM can find hyperplanes in these multi-dimensional spaces to differentiate between subscribers and non-subscribers.
Non-linear Boundaries: Banking datasets might exhibit complex relationships where a linear boundary may not suffice to separate the classes. Using the kernel trick, SVM can map the data to a higher-dimensional space, enabling the identification of non-linear decision boundaries.
Robustness: Financial datasets often contain noise, possibly due to data entry errors, outliers, or anomalies. SVM's inherent robustness ensures that the model doesn't get unduly influenced by these outliers, delivering consistent predictions.
Model Interpretability: Though SVM models, especially with non-linear kernels, aren't as interpretable as some other models, the support vectors identified can provide insights into critical data points or instances that are more challenging to classify.
Optimization: SVM focuses on optimization, ensuring that the model finds the best possible boundary (with the largest margin) between classes, leading to better generalization on unseen data.
In conclusion, the special features of the SVM algorithm make it an attractive method for forecasting term deposit subscriptions in the Portuguese banking dataset. To make sure its forecasts are reliable and precise, it can easily browse the information and pick out subtle patterns and connections. This has the potential to improve financial organizations' marketing, customer service, and subscription base.
The result generated for SVM after splitting the dataset into 70:30 split
Table 5: SVM AUC, Accuracy, Precision, Recall, and F1 Score
Figure 31: ROC curve for Test and Train data-70:30(without feature Selection)
Figure 32: ROC curve for Test and Train data-70:30(with feature Selection)
Table 6: SVM AUC, Accuracy, Precision, Recall, and F1 Score
Figure 33: ROC curve for Test and Train data-80:20 (without feature Selection)
Figure 34: ROC curve for Test and Train data-80:20 (with feature Selection)
The sharp elbow in the ROC curve suggests that there’s specific threshold where the SVM model achieves an excellent balance between true positives and false positives. This threshold maximizes the model’s ability to correctly classify positives instance while keeping false alarms to minimum. It’s a valuable insight for selecting an appropriate operating point in scenarios where cost of false positives and false negatives differs.
4.7. Profiling the Customer by running C5 model in SPSS
Customer profiling through the utilization of C5 models in SPSS is a powerful analytical approach that enables organization to gain a comprehensive understanding of their clientele. By extracting valuable insights and patterns from data, this method facilitates the creation of distinct customer profiles, allowing businesses to tailor their marketing strategies and services to meet individual preferences and needs effectively.
Figure 35: C5 model
Figure 36: Predictor Importance
The figure shows the predictor importance and rule set of C5 model.
Rule sets include a confidence score that indicates the accuracy or reliability of the rule. A higher confidence score suggests that the rule is more dependable in making predictions.
Look for patterns and commonalities among the rules. there are specific variables or conditions that frequently appear across multiple rules. This can provide insights into what factors influence customer behavior.
Figure 37: Results
The analysis of the C5.0 model's performance reveals that it achieved an overall classification accuracy of 91.47%, denoting a commendable level of predictive accuracy. Nonetheless, it is imperative to acknowledge instances where the model exhibited errors, classifying 6,236 data points as "no" when they should have been labeled as "yes," and 3,900 data points as "yes" when they should have been designated as "no." In more technical terms, the model's precision, measuring its ability to accurately identify "yes" when it is indeed "yes," yielded a value of 0.624. Simultaneously, its recall, signifying the proportion of actual "yes" instances correctly classified as such, yielded a value of 0.914.
4.12. Summary
This chapter highlights the adaptability of the CRISP-DM framework and its potential to optimize predictive outcomes, offering a systematic approach to selecting effective machine learning algorithms.
5.1. Model comparison and Evaluation of model
5.1.1 Logistic Regression
All Features: The Logistic Regression model when using all features provides an accuracy of 80% on both the training and test datasets. The consistency in precision (79% and 79%) and recall (81% and 82%) between training and test datasets respectively, shows that the model generalizes well and does not suffer from overfitting. The F1-Score, which represents the balance between precision and recall, is at 80% and 81% for the training and test sets respectively, indicating that the model provides a consistent performance across different metrics.
With Feature Selection: The performance drops to 71% in accuracy, showcasing the significance of certain features in the prediction. The reduced precision and recall values further validate the importance of feature selection in the model's efficiency.
5.1.2 Random Forest
All Features: The Random Forest model achieves a perfect accuracy of 100% in the training dataset, which might be indicative of overfitting. However, its performance on the test dataset remains strong at 92%. The high AUC of 97% emphasizes the model's capability to distinguish between the positive and negative classes.
The incredibly low FP and FN values showcase its precision in classification.
With Feature Selection: The model retains its perfect score on the training data and drops marginally on the test dataset to 91%. This marginal change suggests that the Random Forest algorithm is robust and can handle feature variability well.
5.1.3 SVM
All Features: SVM shows a commendable accuracy of 90% on the training data and 89% on the test data, showcasing its generalizing capability. While the precision and recall values are balanced, the lower TN value compared to other models highlights its lesser capability in identifying true negatives.
With Feature Selection: SVM's performance decreases slightly, showcasing sensitivity to feature changes.
Figure 38: Comparison of Models
5.2 ROC Explanation
5.2.1 Logistic Regression
ROC Curve of Logistic Regression without feature Selection
The ROC curve of logistic regression done without feature selection trained and tested in 70:30 ratio shows that the curve is inclined towards left corner but not upper left corner. This means that Linear regression prediction is accurate but not highly accurate as the graph shows that it has highest specificity but low sensitivity.
Figure 39: ROC curve of Logistic Regression without Feature Selection (70:30)
The ROC curve of logistic regression done without feature selection trained and tested in 80:20 ratio shows that the curve is inclined towards left corner but not upper left corner. The result is quite similar to that of 70:30 without feature selection where the graph shows it has high specificity and low sensitivity.
Figure 40: ROC curve of Logistic Regression without Feature Selection (80:20)
ROC Curve of Logistic Regression with feature Selection
The ROC curve of logistic regression done with feature selection trained and tested in 70:30 ratio shows that the curve is not much inclined towards left corner. This means that Linear regression prediction is not much accurate as the graph shows that it has low specificity and low sensitivity.
Figure 41: ROC curve of Logistic Regression with Feature Selection (70:30)
The ROC curve of logistic regression done with feature selection trained and tested in 80:20 ratio shows that the curve is not inclined towards left corner. The result is similar to that of the 70:30 ratio with feature selection and the graph has low specificity and low sensitivity. For accurate predictions the graph should be of high sensitivity and low specificity.
Figure 42: ROC curve of Logistic Regression with Feature Selection (80:20)
For the Logistic Regression model without feature selection:
Test Data:
True Negative (TN) = 3393: Predicted as negative and are negative.
False Positive (FP) = 757: Wrongly predicted as positive but are negative.
False Negative (FN) = 880: Wrongly predicted as negative but are positive.
True Positive (TP) = 3206: Predicted as positive and are positive.
Train Data:
TN = 13393, FP = 3076, FN = 3472, TP = 12999.
For the Logistic Regression model with feature selection:
Test Data:
TN = 3127, FP = 1023, FN = 1349, TP = 2737.
Train Data:
TN = 12259, FP = 4210, FN = 5492, TP = 10979.
The matrix helps in understanding the model's accuracy, precision, recall, and F1 score. The diagonal values (TN, TP) represent correct predictions, while off-diagonal (FP, FN) indicate errors.
Figure 43: Confusion matrix of Logistic Regression
5.2.2 Random Forest
ROC Curve of Random Forest without feature Selection
The ROC curve of random regression done without feature selection trained and tested in 70:30 ratio shows that the curve is inclined towards left corner. The accuracy of the random forest prediction capability is high since there is high sensitivity and constant high specificity.
Figure 44: ROC curve of Random Forest without Feature Selection (70:30)
The ROC curve of random regression done without feature selection trained and tested in 80:20 ratio shows that the curve is inclined towards left corner same as that of 70:30. The accuracy of the random forest prediction capability is high since there is high sensitivity and constant high specificity.
Figure 45: ROC curve of Random Forest without Feature Selection (80:20)
ROC Curve of Random Forest with feature Selection
The result of the ROC curve of random forest with feature selection is same as without feature selection.
Figure 46: ROC curve of Random Forest with Feature Selection (70:30)
The random forest result seen from the ROC curve it can be stated that Random Forest has higher prediction rate as compared to Logistic Regression.
Figure 47: ROC curve of Random Forest with Feature Selection (80:20)
Confusion Matrix
For the Random Forest model without feature selection:
Test Data:
True Negative (TN) = 3709: Actual negatives correctly predicted.
False Positive (FP) = 450: Actual negatives wrongly predicted as positive.
False Negative (FN) = 441: Actual positives wrongly predicted as negative.
True Positive (TP) = 3636: Actual positives correctly predicted.
Train Data:
TN = 16468, FP = 1, FN = 0, TP = 16471.
For the model with feature selection:
Test Data:
TN = 3762, FP = 388, FN = 347, TP = 3739.
Train Data:
TN = 16468, FP = 1, FN = 0, TP = 16471.
Figure 48: Confusion Matrix of Random Forest
5.2.3 SVM
ROC Curve of SVM with and without feature Selection (70:30)
The ROC curve of SVM done with and without feature selection trained and tested in 70:30 ratio shows that the curve is inclined towards left corner but there is high specificity which shows it has a balanced prediction capability.
Figure 49: ROC curve of SVM with and without Feature Selection (70:30)
ROC Curve of SVM with and without feature Selection (80:20)
The ROC curve of SVM done with and without feature selection trained and tested in 80:20 ratio shows that the curve is inclined towards left corner but there is high specificity which shows it has a balanced prediction capability.
Figure 50: ROC curve of SVM with and without Feature Selection (80:20)
Confusion Matrix
For the SVM model without feature selection:
Test Data:
True Negative (TN) = 3709: These are correctly predicted negatives.
False Positive (FP) = 450: These are wrongly predicted as positives.
False Negative (FN) = 441: These are wrongly predicted as negatives.
True Positive (TP) = 3636: These are correctly predicted positives.
Train Data:
TN = 14755, FP = 1605, FN = 1714, TP = 14866.
For the model with feature selection:
Test Data:
TN = 3857, FP = 605, FN = 293, TP = 3481.
Train Data:
TN = 15412, FP = 2161, FN = 1057, TP = 14310.
Figure 51: Confusion Matrix of SVM
5.3. C5 Model Evaluation
The confusion matrix provides a clear breakdown of the true positive, true negative, false positive, and false negative predictions.
True Negatives (TN): 34,212 instances were accurately predicted as class 0. These represent cases where the model correctly identified the absence of a specific outcome or condition.
True Positives (TP): 32,648 instances were accurately predicted as class 1. This means that the model correctly identified these cases as meeting a specific outcome or condition.
False Positives (FP): 2,336 instances were falsely predicted as class 1. In the context of our problem, this means the model mistakenly predicted these cases as meeting the outcome or condition when they actually didn't.
False Negatives (FN): 3,900 instances were falsely predicted as class 0, suggesting the model failed to identify these cases as meeting the specified outcome or condition.
From the confusion matrix, several metrics can be derived:
Accuracy: (TP + TN) / Total = (32648 + 34212) / (34212 + 2336 + 3900 + 32648) = 0.91 or 91%. This indicates that the C5.0 model correctly predicted the outcome 91% of the time.
Precision: TP / (TP + FP) = 32648 / (32648 + 2336) = 0.93 or 93%. This showcases the model's accuracy in predicting positive instances.
Recall or Sensitivity: TP / (TP + FN) = 32648 / (32648 + 3900) = 0.89 or 89%. This indicates that the model correctly identified 89% of all actual positive instances.
Specificity: TN / (TN + FP) = 34212 / (34212 + 2336) = 0.94 or 94%. This emphasizes the model's accuracy in predicting negative instances.
In conclusion, the C5.0 model shows promising results, with a 91% rate of accuracy. It provides a well-rounded strategy for classification with good accuracy and recall. Depending on the issue domain, the specificity score highlights the model's competence in identifying negative occurrences. False positives and negatives, however, suggest that more development is required. The costs of various sorts of mistakes, as they pertain to a certain application, must constantly be taken into account.
5.4. Customer Profiling based on rule set
Based on the study's extracted rule set, customer profiling indicates clear tendencies among those most likely to react favourably. There were three main types of consumers who emerged:
The typical demographic consists of highly educated people in their early 30s. People in administrative, managerial, or technical roles are often reached by phone in the months of June and August, on Thursdays or Tuesdays. Customers that fit this profile often own their own homes outright but haven't taken out any loans. They also typically display certain economic indications, such as a Euribor 3-Month rate between 4.968 and 4.961.
People who are 37-65.5 years old, are typically retired, and have only completed elementary school form a third demographic group. On Mondays or Wednesdays, you may reach them through phone; they have a place to live, but no debts or credit problems.
The third group consists of housewives who are married and above the age of 30.5. They have housing and typically prefer mobile phones over other forms of contact.
By breaking down consumers into distinct groups, marketers are able to create messages that will more effectively reach and resound with each subset.
5.5. Recommended Marketing Strategies
Segmented Marketing Campaigns:
As gleaned from the provided customer profiling, there are distinctive segments such as young professionals, retirees, and housemaids. Tailoring marketing campaigns specific to these demographics can increase engagement. For instance, retirees might be more interested in long-term security, while young professionals could be enticed by competitive interest rates and flexibility.
Educational Workshops and Webinars:
Organize sessions to educate customers on the benefits of term deposits, addressing common misconceptions. For younger demographics, focus on financial literacy, emphasizing the role of saving.
Leverage Technology:
Mobile apps and online platforms can offer personalized suggestions based on user behavior. An interactive, user-friendly app can also feature term deposit calculators, FAQs, and virtual assistant support. SMS and email reminders for term deposit renewals, special offers, or interest rate changes can keep customers informed in real-time.
Personalized Customer Service:
Employ relationship managers or customer service reps to maintain personalized interactions. They can offer advice tailored to individual financial situations and goals.
Loyalty Programs and Incentives:
Offer tiered interest rates based on the deposit amount or loyalty points that can be redeemed for other banking services.
Community Engagement:
Host community events, sponsor local initiatives, or collaborate with local businesses. This not only reinforces the bank's community ties but also provides a platform to introduce banking services, including term deposits.
5.6. Summary
Overfitting concerns aside, the Random Forest (with all features) model clearly outperforms the other two. Despite its uniformity, SVM and Logistic Regression fall short of the Random Forest in terms of performance measures. Notably, the performance of none of the models was noticeably improved by using feature selection. Therefore, the Random Forest with all characteristics seems to be the most trustworthy option for forecasting term deposit subscriptions using the Portuguese banking dataset. Still, caution must be used to avoid overfitting the model, which may need the use of cross-validation or the tweaking of model hyperparameters.
6.1 Research Limitations
Studying "Exploring the Factors Influencing Term Deposit Subscription in Direct Marketing Campaigns: A Case Study of a Portuguese Banking Institution" provides a helpful understanding of term deposit subscriptions and their particular characteristics. The study has its shortcomings, however, just as every other piece of research does. Since this study only focuses on Portugal, its findings may not apply to banks in other countries or cultures. It's possible that the research doesn't account for regional differences in consumer behavior or the impact of external influences. Secondly, there is the possibility of retrospective bias when using past data for analysis. There is no guarantee that customers' historical preferences will continue to be reflected in their banking habits since the industry and consumers themselves undergo constant change. There may also be non-researched external economic, political, or sociological factors that have affected term deposit subscriptions throughout the study period. The probable exclusion of certain important factors is still another limitation. While several aspects were taken into account, there may be still others that are more subtle but impact a customer's final verdict that wasn't accounted for in the research. It's possible that certain factors relevant to consumers' decision-making were overlooked in this analysis, such as the influence of personal financial advisors, the influence of word-of-mouth recommendations, and the influence of prior consumers' experiences with the bank. Finally, the research relies heavily on quantitative data, which, although statistically solid, may overlook qualitative aspects like client moods or views that might give a fuller insight into their decisions.
6.2 Research Significance and Contribution
In the dynamic field of banking and financial advertising, the study "Exploring the Factors Influencing Term Deposit Subscription in Direct Marketing Campaigns: A Case Study of a Portuguese Banking Institution" is of crucial importance. In a highly competitive market, financial institutions need to increase client engagement and revenue by learning what factors impact a customer's choice to subscribe to a term deposit. The study's concentration on Portugal allows for an in-depth examination of the cultural, economic, and behavioral peculiarities of its target audience. It highlights which direct marketing initiatives connect best with prospective subscribers and gives information on the success of different strategies used by banks. Banks may improve their marketing strategies and reach their target demographic more effectively by identifying and measuring the aspects that have the greatest impact on subscription rates.
Furthermore, in an age when data-driven decision-making is crucial, the results of this study might form the basis for future advertising initiatives, not just in Portugal but perhaps in other locations with comparable economic landscapes. The study's results may help financial institutions better allocate their resources, attract new customers from the correct demographics, and address their specific needs and concerns. The importance of this study goes beyond the immediate implications of the results; it has the potential to direct more well-informed marketing choices, increase financial institutions' profits, and better serve their customers in the arena of term deposits.
6.3 Implications for management practice
Direct marketing is an area where the results of the research "Exploring the Factors Influencing Term Deposit Subscription in Direct Marketing Campaigns: A Case Study of a Portuguese Banking Institution" provide useful information for banking management. Managers may improve the efficacy of their marketing campaigns for term deposits if they have a firm grasp of the factors most strongly associated with customers' decisions to sign up for them. If a study finds that people with certain demographic or behavioral characteristics are more likely to respond positively to a term deposit campaign, the company can direct its efforts and resources more effectively. Managers may improve their communication strategies by learning which direct marketing channels and messages are most successful. This will help them contact the appropriate people with a message that will connect with them and motivate them to take action.
Furthermore, the results may be used to direct training modules and customer engagement protocols in the field of customer relations and service. It is possible to improve customer satisfaction and maybe increase conversion rates by training front-line workers to proactively address common issues and motives among prospective subscribers. The findings of this study, if implemented by management, could improve marketing efforts, resource allocation, customer service, and, ultimately, term deposit subscriptions, driving growth for the banking institution.
6.4 Further Research
Cross-Cultural Comparisons: A compelling avenue for future research could involve comparing the determinants of term deposit subscriptions in different cultural or economic contexts. This would discern if the influencing factors identified in the Portuguese banking scenario hold in other regions or if there are unique local determinants.
Temporal Analysis: Investigate how the influencing factors have evolved, especially given the rapid technological changes and shifts in consumer behavior in recent years. This could provide insights into emerging trends and forecast future patterns.
Impact of Technological Interventions: With the rise of digital banking, studying the impact of technology, like mobile banking apps or AI-driven customer service, on term deposit subscriptions can offer valuable insights.
Psychological and Behavioral Analysis: Delve deeper into the psychological aspects that drive decisions related to term deposit subscriptions. Behavioral economics tools and theories can provide a nuanced understanding of customer choices.
Effect of Macro-Economic Indicators: Researching how broader economic factors, such as interest rates, inflation, or economic recessions, influence term deposit subscriptions can add another layer of depth to the understanding.
Alternative Financial Products: Study the interplay between term deposits and other financial products. For instance, do customers who prefer term deposits shy away from more volatile investments like stocks?
Segmented Analysis: Break down the customer base further into specific segments (e.g., age groups, professional categories) and explore if there are unique determinants within these smaller groups.
In summary, although the research on the Portuguese Banking Institution does give important insights, there is a vast number of additional points and views from which this issue may be treated, each of which offers a better knowledge of the dynamics of term deposit subscriptions. These other angles and perspectives can be found in a wide variety of places online.
Al-Ababneh, M.M., 2020. Linking ontology, epistemology and research methodology. Science & Philosophy, 8(1), pp.75-91.
Bakry, M., Masse, R.A., Arake, L., Amiruddin, M.M. and Syatar, A., 2021. How to attract millennials? Indonesian sharia banking opportunities. WSEAS Transactions on Business and Economics, 18, pp.376-385.
Barman, D., Shaw, K.K., Tudu, A. and Chowdhury, N., 2016. Classification of bank direct marketing data using subsets of training data. In Information Systems Design and Intelligent Applications: Proceedings of Third International Conference INDIA 2016, Volume 3 (pp. 143-151). Springer India.
Bikker, J.A. and Gerritsen, D.F., 2018. Determinants of interest rates on time deposits and savings accounts: Macro factors, bank risk, and account features. International Review of Finance, 18(2), pp.169-216.
Borio, C. and Gambacorta, L., 2017. Monetary policy and bank lending in a low interest rate environment: diminishing effectiveness?. Journal of Macroeconomics, 54, pp.217-231.
Borugadda, P., Nandru, P. and Madhavaiah, C., 2021. Predicting the success of bank telemarketing for selling long-term deposits: An application of machine learning algorithms. St. Theresa Journal of Humanities and Social Sciences, 7(1), pp.91-108.
Busch, R. and Memmel, C., 2017. Banks' net interest margin and the level of interest rates. Credit and Capital Markets–Kredit und Kapital, 50(3), pp.363-392.
Cantº, C., Cavallino, P., De Fiore, F. and Yetman, J., 2021. A global database on central banks' monetary responses to Covid-19.
Csikósová, A., Čulková, K. and Janošková, M., 2016. Evaluation of quantitative indicators of marketing activities in the banking sector. Journal of Business Research, 69(11), pp.5028-5033.
Dåderman, A. and Rosander, S., 2018. Evaluating frameworks for implementing machine learning in signal processing: A comparative study of CRISP-DM, SEMMA and KDD.
García, J.M. and Vila, J., 2020. Financial literacy is not enough: The role of nudging toward adequate long-term saving behavior. Journal of Business Research, 112, pp.472-477.
Ghatasheh, N., Faris, H., AlTaharwa, I., Harb, Y. and Harb, A., 2020. Business analytics in telemarketing: Cost-sensitive analysis of bank campaigns using artificial neural networks. Applied Sciences, 10(7), p.2581.
Gladilin, P. and Saitov, I., 2019. Data-Driven Approach for Dynamic Pricing for Decision Making Systems in Marketing and Finance. In 2019 25th Conference of Open Innovations Association (FRUCT) (pp. 102-108). IEEE.
Gupta, A., Raghav, A. and Srivastava, S., 2021. Comparative study of machine learning algorithms for Portuguese bank data. In 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (pp. 401-406). IEEE.
Hamid, A.J. and Ahmed, T.M., 2016. Developing prediction model of loan risk in banks using data mining. Machine Learning and Applications: An International Journal, 3(1), pp.1-9.
Hlongwane, R.W., 2018. Selecting the best model for predicting a term deposit product take-up in banking (Master's thesis, University of Cape Town).
Ilham, A., Khikmah, L., Indra, Ulumuddin and Bagus Ary Indra Iswara, I., 2019, March. Long-term deposits prediction: a comparative framework of classification model for predict the success of bank telemarketing. In Journal of Physics: Conference Series (Vol. 1175, p. 012035). IOP Publishing.
Khromov, M., 2018. Retail Bank Deposits: Sluggish Dynamics. Monitoring of Russia's Economic Outlook, pp.9-10.
Koumétio, C.S.T., Cherif, W. and Hassan, S., 2018, October. Optimizing the prediction of telemarketing target calls by a classification technique. In 2018 6th International conference on wireless networks and mobile communications (WINCOM) (pp. 1-6). IEEE.
Ładyżyński, P., Żbikowski, K. and Gawrysiak, P., 2019. Direct marketing campaigns in retail banking with the use of deep learning and random forests. Expert Systems with Applications, 134, pp.28-35.
Li, F., Lu, H., Hou, M., Cui, K. and Darbandi, M., 2021. Customer satisfaction with bank services: The role of cloud services, security, e-learning and service quality. Technology in Society, 64, p.101487.
Lu, X.Y., Chu, X.Q., Chen, M.H., Chang, P.C. and Chen, S.H., 2016. Artificial immune network with feature selection for bank term deposit recommendation. Journal of Intelligent Information Systems, 47, pp.267-285.
Melnikovas, A., 2018. Towards an Explicit Research Methodology: Adapting Research Onion Model for Futures Studies. Journal of futures Studies, 23(2).
Miguéis, V.L., Camanho, A.S. and Borges, J., 2017. Predicting direct marketing response in banking: comparison of class imbalance methods. Service Business, 11, pp.831-849.
Munir, K. and Anjum, M.S., 2018. The use of ontologies for effective knowledge modelling and information retrieval. Applied Computing and Informatics, 14(2), pp.116-126.
Muschelli III, J., 2020. ROC and AUC with a binary predictor: a potentially misleading metric. Journal of classification, 37(3), pp.696-708.
Nethala, V.J., Pathan, M.F.I. and Sekhar, M.S.C., 2022. A Study on Cooperative Banks in India with Special Reference to Marketing Strategies. Journal of Contemporary Issues in Business and Government Vol, 28(04).
Onobrakpeya, A. and Mac-Attama, A., 2017. Improving customer satisfaction through digital marketing in the Nigerian deposit money banks. Open Access International Journal of Science and Engineering, 2(7), pp.15-24.
Petrini, G. and Teixeira, L., 2023. Determinants of residential investment growth rate in the us economy (1992–2019). Review of Political Economy, 35(3), pp.702-719.
Rahman, A. and Khan, M.N.A., 2018. A Classification Based Model to Assess Customer Behavior in Banking Sector. Engineering, Technology & Applied Science Research, 8(3).
Sahay, A., 2016. Peeling Saunder's research onion. Research Gate, Art, 3(2), pp.1-5.
Taj, S.M. and Kumaravel, A., 2020. INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY FUZZY PETRI NETS CONSTRUCTION FOR PREDICTING THE BANK-DEPOSIT PROFILES.
Tekouabou, S.C.K., Cherif, W. and Silkan, H., 2019. A data modeling approach for classification problems: application to bank telemarketing prediction. In Proceedings of the 2nd International Conference on Networking, Information Systems & Security (pp. 1-7).
White, L., 2023. HSBC rides rising rates to double its income, launches $2 billion share buyback. [Online]
Available at: https://www.reuters.com/business/finance/hsbc-launches-up-2-bln-buyback-235-first-half-profit-jump-2023-08-01/
Zhuang, Q.R., Yao, Y.W. and Liu, O., 2018. Application of data mining in term deposit marketing. In Proceedings of the International MultiConference of Engineers and Computer Scientists (Vol. 2).
HOTZ, N., 2023. What is CRISP DM?. [Online]
Available at: https://www.datascience-pm.com/crisp-dm-2/
kaggle.com, 2022. Portuguese Bank Marketing. [Online]
Available at: https://www.kaggle.com/datasets/aakashverma8900/portuguese-bank-marketing
Theissler, A., Thomas, M., Burch, M. and Gerschner, F., 2022. ConfusionVis: Comparative evaluation and selection of multi-class classifiers based on confusion matrices. Knowledge-Based Systems, 247, p.108651.
Hou, S., Cai, Z., Wu, J., Du, H. and Xie, P., 2022. Applying Machine Learning to the Development of Prediction Models for Bank Deposit Subscription. International Journal of Business Analytics (IJBAN), 9(1), pp.1-14.
Schonlau, M. and Zou, R.Y., 2020. The random forest algorithm for statistical learning. The Stata Journal, 20(1), pp.3-29.
13th month pay in Portugal (2023) Horizons. Available at: https://joinhorizons.com/countries/portugal/hiring-employees/13th-month-pay/ (Accessed: 13 September 2023).