When you want to impress a customer with a demo of a BI solution, you may run into issues with what datasets to use. Customer data may not be accessible, or there may be security concerns with using real data; and external sample data may not be relevant, or the dataset may not be formatted correctly for your model. In this blog, we explain how to generate your own dataset so that you can build a compelling demo where your customer can picture what insights are possible with their own data.
As a business intelligence professional, there’s occasionally a need to demo a business intelligence tool for a new or existing customer. To conduct this demo, you first need a dataset to use with the BI tool. You want to provide an engaging demo where the customer can see what the tool would look like with their own data, but soon encounter problems when using their data, like:
Undeterred, you turn to the internet to find an appropriate external dataset, only to encounter the following problems:
What can you do?
Build your own dataset! Construct fake data that closely mimics the real-world data of your customer. For this example, we will consider a property and casualty mutual insurance customer. Your customer provides various coverages to its member companies. Although members pay premiums annually, the revenue is recognized on a monthly basis. Each month, managers from each line of coverage submit their budgeted revenue based on new or lost members and premium adjustments.
Through conversations with your customer you also learn the following facts:
Using this information, you construct a simple data model that you will base your demo dataset on. Your dataset will have member, line of coverage, and date dimensions with monthly revenue and budget facts.
A date dimension will help us build our fact tables. Thankfully, code already exists for many databases to build a date dimension. In this example, we will be using MySQL.
Next, we create our line of coverage dimension, which includes the coverage name and the start and end dates for when the coverage was offered. We use an arbitrary high date of 2099-12-31 to represent coverages that are currently being offered.
To build our member dimension, we will start with an existing list of companies with various attributes about those companies. For our member dimension we will keep the company name, city, state, type (public/private), and category (sector). Then we will create additional attributes which will allow us to build our fact tables. We will use Excel to build these attributes, though we could instead use the mathematical functions in MySQL.
When building our custom attributes, we will typically use two techniques:
RANDBETWEEN()functions. To generate random values within a given range, we can use the expression
LOW + RAND()*(HIGH-LOW), where
HIGHare the range bounds. This expression even works for decimal and negative bounds. Alternatively we can use the
RANDBETWEEN()function to generate random integers within a given range.
RANDBETWEEN()to assign each member a random integer between 1 and 100. We can then use this random number to apply attributes to a certain percentage of members. Let’s use our member join year as an example. We know that our customer started business in 1991, and that 20% of their members joined in the last 5 years. Assuming our demo date is in late 2017, the formula for our join year would be
RANDBETWEEN(1991,2012)). Using this formula, approximately 20% of our members will be randomly assigned a join year in the past five years, while the remaining members will be randomly assigned a join year between 1991 and 2012.
Using the two techniques described above, we add the following the following attributes:
budget_error_factor, will be used to generate our fact tables. This represents the member’s annual premium at a fixed point in time, and is a random value between $30k and $120k.
We will leverage attributes from our dimensions to generate our monthly premium revenue allocation fact. The query below will create a fact table that has one record per member per month. There will only be records for when a member was active and when their respective coverage was active. Additionally, the revenue will grow or decline over time, which will produce more interesting charts in your BI tool demo.
Finally, we build upon our revenue fact to create our budgeted premium fact. The budget numbers will be off from the revenue numbers by the budget_error_factor on the member dimension.
A good demo with realistic data should result in an engaging discussion with the customer, where they start to picture what insights are possible with their own data and how the tool can improve their decision making. Even with our simple demo data model, when coupled with a modern BI solution, users can now see how easy it would be for them to determine relevant metrics such as premium revenue by industry or line of coverage, budget variance to actual, member retention rates, and lost revenue.
Some additional benefits of our demo data are that it can be reused for user training before the data warehouse is built, or it can be used to compare multiple tools simultaneously. Probably the biggest benefit, however, is that users will be excited about the implementation of the tool, evangelize what they’ve seen, and help drive adoption throughout the organization.
Here are some tips and tricks to keep in mind when building your dataset: