How to Generate Your Own Datasets for a Powerful Demo

When you want to impress a customer with a demo of a BI solution, you may run into issues with what datasets to use. Customer data may not be accessible, or there may be security concerns with using real data; and external sample data may not be relevant, or the dataset may not be formatted correctly for your model. In this blog, we explain how to generate your own dataset so that you can build a compelling demo where your customer can picture what insights are possible with their own data.

Problem

As a business intelligence professional, there’s occasionally a need to demo a business intelligence tool for a new or existing customer.  To conduct this demo, you first need a dataset to use with the BI tool.  You want to provide an engaging demo where the customer can see what the tool would look like with their own data, but soon encounter problems when using their data, like:

  • The customer may have a low level of analytical maturity, with no data warehouse.  Data exists in disparate spreadsheets with no easy way to join them together.
  • There is a data warehouse, but due to the wide demo audience, there are sensitivity issues as to who is allowed to see the data.
  • There are security concerns with bringing existing data out of the current environment.

Undeterred, you turn to the internet to find an appropriate external dataset, only to encounter the following problems:

  • The readily available test datasets associated with your BI tool or database of choice are not relevant to your customer.
  • The dataset is not relational and may be a single, wide table.  Relational datasets are helpful for demonstrating the powerful drill down and aggregation capabilities of modern BI solutions.
  • The dataset requires a lot of cleansing or transformation to be useful.
  • The dataset does not have a license that allows for commercial use.

What can you do?

Solution

Build your own dataset!  Construct fake data that closely mimics the real-world data of your customer.  For this example, we will consider a property and casualty mutual insurance customer.  Your customer provides various coverages to its member companies.  Although members pay premiums annually, the revenue is recognized on a monthly basis.  Each month, managers from each line of coverage submit their budgeted revenue based on new or lost members and premium adjustments.

Through conversations with your customer you also learn the following facts:

  • The company was founded in 1991
  • Member premiums are typically between $30k and $120k
  • Member retention hovers around 95%
  • Due to recent growth, 20% of members were acquired in the past 5 years

Using this information, you construct a simple data model that you will base your demo dataset on.   Your dataset will have member, line of coverage, and date dimensions with monthly revenue and budget facts.

Generating the Data

Date Dimension

A date dimension will help us build our fact tables.  Thankfully, code already exists for many databases to build a date dimension.  In this example, we will be using MySQL.

Line of Coverage Dimension

Next, we create our line of coverage dimension, which includes the coverage name and the start and end dates for when the coverage was offered.  We use an arbitrary high date of 2099-12-31 to represent coverages that are currently being offered.

Member Dimension

To build our member dimension, we will start with an existing list of companies with various attributes about those companies.  For our member dimension we will keep the company name, city, state, type (public/private), and category (sector).  Then we will create additional attributes which will allow us to build our fact tables.  We will use Excel to build these attributes, though we could instead use the mathematical functions in MySQL.

When building our custom attributes, we will typically use two techniques:

  • Generating random numbers:  In Excel, to generate random numbers we will use either the RAND() or RANDBETWEEN() functions.  To generate random values within a given range, we can use the expression LOW + RAND()*(HIGH-LOW), where LOW and HIGH are the range bounds.  This expression even works for decimal and negative bounds.  Alternatively we can use the RANDBETWEEN() function to generate random integers within a given range.
  • Randomly distributing attributes across records:  To randomly distribute attributes across members, we will use RANDBETWEEN() to assign each member a random integer between 1 and 100.  We can then use this random number to apply attributes to a certain percentage of members.   Let’s use our member join year as an example.  We know that our customer started business in 1991, and that 20% of their members joined in the last 5 years.  Assuming our demo date is in late 2017, the formula for our join year would be IF(random_integer<21, RANDBETWEEN(2013,2017), RANDBETWEEN(1991,2012)).  Using this formula, approximately 20% of our members will be randomly assigned a join year in the past five years, while the remaining members will be randomly assigned a join year between 1991 and 2012.

Using the two techniques described above, we add the following the following attributes:

  • join_date: The join year was assigned as mentioned above, with a random join month and day.
  • exit_date: With the average member retention rate hovering around 95%, we give 5% of members an exit date with the rest receiving the high date id of 2099-12-31.
  • coverage_id: For the sake of simplicity, each member will only belong to one line of coverage.  Using our join dates and knowledge of the business, we designate coverage ids to our members.  In this case, a majority of members will get the oldest products, general liability and worker’s compensation coverage, with the least number of members getting the short-lived equipment breakdown coverage.
  • base_premium: This attribute, along with premium_growth_rate and budget_error_factor, will be used to generate our fact tables.  This represents the member’s annual premium at a fixed point in time, and is a random value between $30k and $120k.
  • premium_growth_rate: As member premiums are rarely static over time, we give members a random premium growth rate between -2% and +5%.
  • budget_error_factor: Since line of coverage managers’ premium budgets are never 100% accurate, we assign a random budget error factor between -5% and +10%.

Revenue Fact

We will leverage attributes from our dimensions to generate our monthly premium revenue allocation fact.  The query below will create a fact table that has one record per member per month.  There will only be records for when a member was active and when their respective coverage was active.  Additionally, the revenue will grow or decline over time, which will produce more interesting charts in your BI tool demo.

Budget Fact

Finally, we build upon our revenue fact to create our budgeted premium fact.  The budget numbers will be off from the revenue numbers by the budget_error_factor on the member dimension.

Outcome

A good demo with realistic data should result in an engaging discussion with the customer, where they start to picture what insights are possible with their own data and how the tool can improve their decision making. Even with our simple demo data model, when coupled with a modern BI solution, users can now see how easy it would be for them to determine relevant metrics such as premium revenue by industry or line of coverage, budget variance to actual, member retention rates, and lost revenue.

Some additional benefits of our demo data are that it can be reused for user training before the data warehouse is built, or it can be used to compare multiple tools simultaneously.  Probably the biggest benefit, however, is that users will be excited about the implementation of the tool, evangelize what they’ve seen, and help drive adoption throughout the organization.

Things to Consider

Here are some tips and tricks to keep in mind when building your dataset:

  • Use integer primary keys on all your tables, and add foreign key constraints to improve performance
  • Throw in a few outliers to make things more interesting
  • Avoid using ranges that will average out to zero, such as -10% to +10% budget error factor
  • The goal is to make a realistic, usable demo in a short time, not build the entire company’s data model
  • Don’t forget to remind the customer that the data is fake!
After careers in proprietary trading and business valuation, Sean finally found his calling as a data analytics consultant. He graduated the University of Illinois with a degree in Engineering and is a certified Looker data analyst. In his free time, Sean enjoys working on cars, watching the Green Bay Packers, and exploring the local restaurants and pubs of Chicago.

 

Contact Us

National Office Telephone | Mon-Fri 8:30am-5:30pm CT