Challenges With Data Procurement And Analysis In Real Businesses

This post has been written by Naveen Prashanth, a second year PGP student at IIM Bangalore who was part of the just-concluded Spreadsheet Modelling for Business Decision Problems course.

Throughout this course, we have enjoyed the experience of playing with data and deriving interesting models / solutions to business problems. The path to making a good decision based on data seemed to involve mostly model making and driving insights. However, having worked for 4 years in analytics space, I have observed a wide variety of challenges that make driving decisions through data analysis a lot more complex. I would like to share some of those insights on this forum, so we can guard against them and ensure they do not come in the way of us fully utilizing our learnings from this course at work.

Firstly, data procurement is a time consuming and tricky business. In my first job, I was working with several retail store giants in the US that banked with Citibank. Though the data for each of these portfolios was supposed to be very similar, given similar economic conditions and business variables, there was stark variation in how the database of each portfolio was stored, the ‘level’ of each dataset as well as the definition of various fields. For example, the field ‘Revenue’ in one portfolio was defined differently from the other, though one would imagine there ideally is no business justification for the same. Further, data owners exhibit their own idiosyncrasies and can be fairly protective about the data they wish to share.

I have observed that the remedy for this situation is to make clear why the data you need is actually needed. It is best to start with the hypotheses you wish to test, how they make sense for the business, the data needed to validate these hypotheses and then delve into specific fields needed. This way, data owners feel empowered and involved in the analysis, area more likely to share data with you and most importantly also offer their valuable suggestions to improve your analysis.

The second issue is in the area of validation. Typically, we used to double check code and ensuring the data is being pulled right. Even in the area of model building, the common practice was to re-visit the logic and formulae to ensure accuracy. However, to accurately analyse data, one should adopt the process of cross-validation, where a brand new angle is used to validate the data. For instance, if A and B are drawn from different databases, and we know that A divided by B has to be between 50% and 60% by business sense, then this check can be applied once the datasheet is up. The advantage of this method is that we are mimicking how an end user would validate the model, while also being able to derive insights in this process. Tying up summarized database information to existing MI reports in the organization, with a certain tolerance level for the match, is also a good practice.

The third issue pertains to timelines. Most people who are not involved with data analytics have limited idea, not entirely their fault, on how long data analysis takes. It is important to clearly outline the steps needed to finally derive the decision, starting from data collection stage, and assign a precise timeline for each of the tasks. Socializing this schedule with other relevant stakeholders in the firm will also ensure no issues are raised later.

The final issue rests with the complexity appetite of the end user. In my previous organization, we built a killer NPV model that cut decision process time by almost half, and also freed up some very strained resources in the firm. However, senior management did not adopt this model since it was considered ‘too complex’ by them. It is important to be aware of the inclinations of the end user, since it is of course better to have as simple solution implemented than a complex one on the shelf. Clear formatting and visualization techniques we learnt in the course can further help break down prejudices on data complexity.

by Naveen Prashanth

Advertisements

Who will rate the raters

by Ashwin Ravikumar

Editor’s Note: We discussed the mechanics of building a rating system, and aggregating ratings, in yesterday’s class. This post is in reference to that. 

IMDB ratings: Mission critical to life

When I hear of a movie, the first thing I invariably do is pull up its IMDB page and look at the rating- an action many of us resort to. These ratings have helped me discover many wonderful movies. Just a couple weeks back, I heard of a movie called Birdman, the reflex action led me to the IMDB website where I saw a respectable rating of 8.1. Now interested, I promptly went to the theater and caught the movie, and I hear today that it won the Oscar for the Best Movie. IMDB for the win! 

The IMDB dual rating system with a separate formula for top 250

Spurred by today’s class on review systems, let’s take a peek beneath the hood of the IMDB ratings. The formula for calculating the Top Rated 250 Titles is as below:

Weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C where:

R = average for the movie (mean) = (Rating)

v = number of votes for the movie = (votes)

m = minimum votes required to be listed in the Top 250 (currently 25000)

C = the mean vote across the whole report (currently 7.0)

The second term above is interesting: it pulls the movie rating towards the mean vote of 7.0. Further, for rating the top 250, only “regular voters’” votes are considered.

The formula for any movie (non-250) is just a simple weighted average of all votes, after application of “various filters”. You and I can contribute to this “regular” rating of movies. IMDB does not disclose its exact methods of weighting.

A consequence is that every movie has 2 ratings- a regular rating and a top 250 rating. For example, Shawshank Redemption (#1) has a rating of 9.2 on the top 250 list and 9.3 on its separate movie page (see below). Some movie may have high ratings on their individual page but do not make it to the top 250. The implication is that for choosing the best of the crop, reviews from the accredited folk matter the most.

imdb2imdb1

The “Class”ic issues crop up

IMDB uses mean ratings above, which could suffer from a bias stemming from both the issues discussed in class:

  1. User behavior: My 7 is different from your 7; for instance an American’s 7 is different from a German’s 7 leading to higher ratings for an American movie.
  2. Cuisine behavior: A drama movie rater (serious critic) might be much more conservative in his scoring than say an Animation movie rater (a childlike adult)

The solution is weighted means, as discussed in class. We need to give some “power users” more weightage than others, when calculating scores. While IMDB claims “In order to avoid leaving the scheme open to abuse, we do not disclose the exact methods used”, it has provided some sketchy details. The weighting scheme apparently has been developed over 10 years and is tweaked regularly. What they are saying in essence is: “Trust us, we know what we’re doing”.

In IMDB we have to blindly trust; Rotten Tomatoes more transparent but different methodology altogether

IMDB is keeping its rating system as a closely guarded secret, like Coca Cola does for its formula. So we movie buffs have no option but to blindly trust IMDB’s word, and number. Rotten Tomatoes is another popular movie rating website, and IMDB’s closest rival. RT differs from IMDB in two ways; firstly it provides two ratings, and secondly it has a different methodology for calculation.

Rotten Tomatoes provides two movie ratings: Critic rating and Audience rating. The Critic rating is called the “Tomatometer” and is the official, trademarked rating of the website. The methodology used here has not been discussed in class: the Tomatometer is simply the “percentage of professional critic reviews that are positive for a given film or show”. The Audience rating, meanwhile, is simply the percentage of regular RT users who’ve rated a movie 3.5 or above on a 5 point scale. This is akin to saying a positive rating is equivalent to 1 upvote, while a negative rating is equivalent to a vote of 0, with the total score is divided by the total number of votes. This is akin to a binary scale for rating- either a like or a dislike.

Differing philosophies; rate RT better overall because of option but IMDB better on comparability

For their flagship scores, RT and IMDB follow different philosophies- while the Tomatometer is based wholly on critic scores, IMDB takes into account scores of all users. Therefore, IMDB is a crowdsourced score while the Tomatometer is not.IMDB appears to be losing some information due to its insistence on a “1 score” system. On RT, there are some movies which clearly the average movie audience loves while the critics hate, and vice versa (see Lucy below). As a viewer, on RT, you can judge whose rating you want to use. On IMDB, though, you don’t know how the weighted average is calculated – it’s a secret formula – and are therefore unaware if a movie is a hit with the average cinema goer or with the “power user” critics. However, there are 4 ratings numbers one can use in RT: The Tomatometer, the “Average Tomatometer rating”, the audience score and the Average Audience Rating. It is difficult to compare two movies given four numbers for each. Therefore, if I want to watch a movie, I’d prefer to use RT’s scores (Audience and Tomatometer), but if I want to compare the ratings of two movies, I’d prefer to use IMDB because there are 4 scores for each movie on RT, which makes difficult the process of comparing.

rotten

Next time when you reach the IMDB page of a movie and you think about reading this post, then please mentally doff your hat to me. For more such dope, visit https://ashwinwins.blogspot.com. And, enjoy that movie!

Sources:

IMDB Rating formula: http://www.imdb.com/chart/top?ref_=nb_mv_3_chttp

IMDB Rating details: http://www.imdb.com/help/show_leaf?votes

Rotten Tomatoes rating formula: http://www.rottentomatoes.com/about/

Ashwin Ravikumar is a second year PGP student at IIMB, and part of the Spreadsheet Modelling for Business Decision Problems course. 

Three-dimensional formula in MS Excel

by Anantha Krishnan S

To begin with, I would like to share one of the tips which I have come across in the recent times. It’s on setting up 3-dimensional formula in excel. This simple-yet-effective operation will help you save time and maintain consistency across the workbook.

In many business situations, you may be required to set up a workbook with multiple worksheets of same template. Examples ranges from a sales head tracking sales data across different regions to a production planner accessing the inventory of several items across production plants etc. Here the key is the use of same template and summarizing all in a single worksheet.

For better understanding, let’s assume that you are a statistician at ESPNCRICINFO.com. You are supposed to compile data and produce some interesting content with the help of data on number of fours, sixes, dot balls, overs played and wickets in the upcoming India-Australia-England series.

The tournament involves 7 matches. It would be easier to compile data if we set up the workbook with same template.

ananth1

Let’s create 7 worksheets first. For this open seven empty worksheets. (By the way, you can change the default number of worksheets through ‘File -> Options -> General’ menu). Also, create one more summary sheet as shown here.

ananth2

Then, for replicating the template to all 7 sheets, select Sheet 1, press & hold down the SHIFT key and press the last worksheet – sheet 7.

ananth3

With all sheets selected, enter the template you want to replicate in Sheet1.

ananth4

Thus whatever you enter in Sheet1 will be replicated to other selected sheets as well. Use different sheets (till Sheet7) for capturing data for subsequent matches.

By the end of the tournament, we can use 3-dimensional formula to summarize the tournament statistics. For example, to calculate the total attendance of the tournament, type =SUM( ,move your cursor to the first cell (cell B1 of sheet1), hold down the Shift key and click the last cell (cell B1 of sheet7). Lastly, enter a right parentheses in the formula bar, and you will see the formula SUM(sheet1:sheet7!B1). This formula tells Excel to sum cell B1 in all worksheets, starting with Sheet1 and ending with Sheet7. Also, you can copy this formula and extend this till B6 as shown here.

ananth5

I just used this example to give you an idea about 3-dimensional formula. You can use all your complex excel functions maintaining same template to extend the use for capturing complex real life scenarios.

Anantha Krishnan S is a second year PGP student at IIMB, and part of the Spreadsheet Modelling for Business Decision Problems course.