What is machine learning?

Machine learning is a way of analysing large amounts of data to search for patterns which would otherwise be difficult or impossible to detect due to their complexity. The machine learning software ‘learns’ from the data and then presents this information in a format that enables you to solve problems, make informed decisions and create smarter applications.

What is it used for?

Machine learning can be applied to any task that requires predicting the future based on historical data. This would include forecasting revenues, identifying buyer behaviour, recommending products such as music, DVDs, holidays etc, fraud detection, deciding when an engine needs servicing, or indeed anything that depends on analysing a large amount of data to arrive at insights.

How does it work?

Take fraud detection as an example (although machine learning is applicable to much more than this). Let’s imagine you’re trying to find the best way of detecting credit card fraud. But all you have to work with is the historical data shown in figure 1 below:

Name Amount Fraudulent
Brown £12,400.27 No
Jones £420.76 Yes
Severina £9,400.12 No
Ogbona £1,200.00 Yes

Figure 1: With limited data, it’s difficult to identify meaningful patterns.

Of course, with such a small amount of data it may be possible to find a pattern just by looking at it. The problem with working with limited data, however, is that any pattern you uncover is likely to be wrong. Given the data in Figure 1, for instance, you might conclude that fraudulent transactions only occur with relatively small sums, which certainly isn’t true.

The more data you have at your disposal, the more accurate the pattern becomes, although identifying the pattern becomes more difficult. Now let’s look at the credit card transaction data shown in Figure 2:

Name Amount Where issued Where used Age of cardholder Fraudulent
Lee £17,000 UK RUS 26 Yes
Ferguson £4,986.56 USA RUS 23 No
Trong £1,832.06 GER JAP 23 No
James £22,000 USA USA 41 Yes
Khan £14,230.14 USA RUS 26 Yes
Rodrigues £16,500 UK USA 41 Yes
Patel £29,723.14 UK FIN 40 Yes
Williams £13,123 UK RUS 25 Yes
Smith £15,456.10 UK RUS 27 Yes
Jenson £6,560.89 UK FRA 42 Yes
Rogers £12,307 UK RUS 29 Yes
Blaker £15,340.12 USA FRA 41 No

Figure 2: with expanded data, we can begin to identify meaningful patterns.

The extra data convincingly disproves our first attempt at finding at a pattern—i.e. fraudulent transactions are usually for small amounts. We can see this just by looking at the account holders named Patel and James.

The data shown in Figure 2 might suggest that most incidences of fraud happen to account holders in their forties, but then if we look at the account holder named Smith we can see that it doesn’t follow the pattern. Another emerging pattern is that credit cards issued in the UK were used fraudulently in Russia. But once again, if we look at the account holder named Williams, it contradicts the pattern.

Another possibility is that the transactions most likely to be fraudulent belong to account holders whose cards were issued in the UK, are used in Russia, and belong to people in their 20s? Wrong again—the cardholder named Jenson contradicts this hypothesis.

The real underlying pattern in the data, of course, is that a transaction is most likely to be fraudulent if the cardholder is in their 20s, the card was issued in the UK and used in Russia, and the amount is for more than £10,000.

Since the amount of data we’re working with here isn’t very large, it’s possible arrive at answers manually by a simple process of elimination.

But now let’s suppose we have not just tens of records to work with, as in our illustration above, but tens of millions. And there aren’t just six columns of data, but 60 or more columns. There’s a strong likelihood that hidden amongst this mountain of data are patterns that will help us identify which transactions are most likely to be fraudulent.

However, to work through such a large amount of data simply by sifting through it manually would be impossible. You would Instead you have to find ways of extracting the data and running it through a computer to identify any underlying patterns. This is exactly what the machine learning process does. It scours large amounts of data, applying statistical techniques to uncover patterns residing in the data. Vitally, it then generates codes that can be used to recognize these patterns.

Referred to as a ‘model’, the generated code can then be called by any application that needs to solve this specific problem. Taking our example above, the calling application must provide information such as age, the sum involved, the country in which the card was issued and where it was used. The ‘model’ then assesses the likelihood of a fraudulent transaction.

How GIROUX can help

The explosion in information technology means that businesses have never had more data at their disposal – data which, if used intelligently, has the capacity to yield a wealth of profitable information. But without the means to interrogate this ‘big data’, the secrets buried in its depths are destined to remain that way. The rise of big data means that we are entering the era of machine learning. By making machine learning more accessible, easier and less expensive to use, GIROUX is committed to addressing this challenge. Using technologies such as Microsoft Azure Machine Learning (Azure ML) and other processes, our data scientists are committed to bringing machine learning into the mainstream.

How can GIROUX make a difference to the value of your organisation through the application of machine learning technologies?

To find out more or to discuss your needs in detail, contact us +44 20 3287 7620