Data mining: when what you see is NOT what you get

I was in the four-member Singapore unit, testing internal processes at the bank. Our New York unit had a staff strength of about 25, a part of which was a dedicated data mining unit (DMU) that crunched various reports.

One day, the Legal unit (global) head asked for a report that seemed simple enough: How many people from each APAC country bought into each family of products in all of 2005? For example, if Jane of Singapore invested in Apple stock (Securities family) and a Franklin Templeton fund (Mutual Funds family), Mary of Singapore bought Apple stock, and John of Thailand bought Microsoft stock, it would show:

(Note: Whether Mary or Jane bought Microsoft stock in addition to or instead of Apple, the report would read the same, because both products belong to the same family: Securities.)

The problem was that neither our New York unit nor I had direct access to the raw set of transactions. Because of geographical restrictions, the guys with the raw data said they could pass it on to a US-based unit alone, so Jay of the DMU was nominated. He promptly sent the APAC Legal head (copying my boss and me) a glamorous first cut of the summary report — a PDF replete with logos and borders — when what I had requested was a spreadsheet I could sort. [Had I been in Jay’s place, I would have run it by at least one other person before sending it off to top management, and focused more on the content than the format at this stage!]

A few versions later, I was puzzled to find that each successive version contained higher figures. When I asked Jay why, he said that he had forgotten to include some earlier months’ data. This made me wary of the entire report, so I asked him a different question: For a moment, forget about who invested in what. Tell me how many accounts we have registered in any one country, say, Thailand.

Jay: Four.

Me: But your report shows that thirteen accounts from Thailand have invested in mutual funds. How?

Flummoxed, he sent a note to my boss: “I must stress that this report is NOT to be used for making business decisions.” Huh? Then exactly what do you use data mining for?

Now convinced there was a mistake, I asked him to send me all the raw data. I studied it for a day and — eureka!

Here is what had happened. Jay had received the data in batches (one for each month). He had prepared a summary for each month and then added up all months’ summaries to get the year’s summary. So, for example, let’s say you got data for February, where (to keep the example simple) everyone bought the same instruments as in January, with John entering a Franklin Templeton fund, in addition. The summary would be very similar, with an additional row for Thailand and “Mutual Funds”, as shown below.

Therefore, if you wanted the summary for January and February, you would add up the two, right?

WRONG! On the face of it, this seems logical, but here is a case where the sum of the parts could become more than the whole. Often, the same investors had invested in similar products every month, and they would not necessarily get a separate entry each time. For example, if Jane bought Apple stock in January and Microsoft stock in February, she should have still got just one entry (Securities) for the entire year.

Lesson learnt: When dealing with numbers that could impact a business, don’t hesitate to ask for a sanity check!

[See below to compare the wrong and right answers.]

[Disclaimer: I have used the names of these instruments as examples; I do not hold any of them.]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s