Transaction clustering based on text description


Compile | VK
Source | analytics vidhya


We live in the age of digital technology. When was the last time you walked into a store without digital transactions?

These digital trading technologies have rapidly become a key part of our daily lives.

Not just at the individual level, these digital technologies are at the heart of every financial institution. Through a variety of possible options (e.g. online banking, ATM, credit or debit cards, UPI, POS machines, etc.) and running reliable systems in the background, payment transactions or fund transfers have become very smooth.

We will generate an appropriate description for each transaction:

In this paper, we will use clustering, a popular machine learning algorithm, to discuss a real use case of a financial institution customizing a product for its customer base.

The motivation behind this case study

As a financial institution, it is always important to provide customized services to existing customers according to their different interests. Capturing customers’ intentions is a major challenge for any financial institution.

Social media platforms such as twitter, WhatsApp and Facebook have become the main sources of information for analyzing customers’ interests and preferences.

Financial institutions often have huge costs in obtaining data from third parties. Even so, mapping a social media account to a unique customer becomes very difficult.

So how can we solve this problem?

Some of the solutions to the above problems can be solved by using internal transaction data provided by institutions.

We can classify the transactions performed by customers into different categories according to the transaction description message.

This method can be used to mark whether the transaction is for food, sports, clothing, bill payment, home furnishing, etc. If most of the client’s transactions appear in a particular category, we can better estimate his / her preferences.

This is our approach

The key step for us to address the problem is to find out how to deal with it.

Determine the number of topics

We start with all transactions and map their description messages to each customer. First of all, we have an important task to determine the number of clusters (or) categories (or) topics. To achieve this, we use the topic model.

Topic model is a method of unsupervised classification of documents. It can find natural project groups when we are not sure what to look for. It mainly uses latent Dirichlet assignment (LDA) to fit the topic model.

It treats each document (that is, a transaction) as a mixture of topics, and each topic is a mixture of words.

For example, the word budget may be used in movies and politics. The basic assumption of this LDA is that every observation in the sample comes from an arbitrarily unknown distribution, which can be explained by a generative statistical model.

Let’s take a look at this solution to our problem.

In the transaction description, there is a generation statistical model to generate all the words from the unknown distribution (i.e. unknown group or topic) in the transaction description. We try to build a statistical model so that it can predict the probability that a word belongs to a particular topic.

thematic coherence

Determine the total number of topics by manually viewing the keywords for each topic.

But this leads to disagreement among everyone, and we need a way to evaluate the right number of topics. We use the measure of topic coherence to determine the correct number of topics.

Topic coherence is applied to the first n words of a topic. It is defined as the average / median of the similarity score of pairs of subject words. A good model will produce a coherent theme, that is, a topic with a high score of topic coherence.

A good topic is one that can be described with a short tag; therefore, that’s what the topic consistency measure captures.


We can then determine the total number of topics / clusters (in our case, 7 topics). We should start to assign each transaction description message to a topic. When assigning documents to topics, topic models alone may not produce accurate results.

Here, we use the output of topic model and other features to cluster transaction description messages using k-means cluster. Here, we will focus on building a feature set for k-means clustering.

  • basic feature
    • Word count, number count, special symbol count
    • Maximum number sequence length, number character ratio
    • Average, maximum word length, etc.
    • The week, day and month of the transaction, whether there is a date, whether it is a weekend transaction, etc.
    • Transactions executed on or before the last 5 days of the month
    • Public holidays and festival transactions.
  • Find features, industry top brands and common terms are used as search names. Count the number of words related to a specific industry in the transaction description.
    • food: vegetables, domino, fresh direct, etc.
    • sport event: baseball, Adidas, football, football shoes, etc.
    • healthy: Pharmacy, hospital, gym, etc.
    • Billing and EMI: policies, powers, statements, schedules, withdrawals, calls, etc.
    • entertainment: Netflix, prime shows, spotify, Soundcloud, bar
    • Electronic Commerce: Amazon, Wal Mart, eBay, Ticketmaster, etc.
  • other: Uber, Airbus, packer, etc.
  • Topic model features
    • The DTM matrix of unigram and bigram generated by TF-IDF metrics is modeled. For unigram and bigram DTM matrices of transaction description, we get two sets of seven different probabilities for each topic

Last thought

Each transaction description has about 30 characteristics. We perform K-means clustering to assign each transaction description to one of the seven clusters.

The results show that most of the observations near the cluster center are labeled with the correct topic. Few observations far away from the cluster center are given the wrong topic label.

Of the 350 transaction descriptions checked manually, about 240 (with an accuracy of about 69%) are correctly marked as appropriate topics.

Now we have at least a basic estimate of the preferences and interests of our internal customers. We can send customized quotes and options to keep them involved and improve the business.

Although the method of using topic model is relatively new, the method of classifying customers by transactions is mainly used by credit card issuers.

American Express, for example, has been using this approach to create interest maps for customers. This interest map not only divides transactions into major groups such as food and tourism, but also creates micro market segments for Thai food lovers and wildlife lovers, all of which only come from rich transaction data!

Link to the original text:

Welcome to visit pan Chuang AI blog station:

Sklearn machine learning Chinese official document:

Welcome to pay attention to pan Chuang blog resource collection station: