Learning Data Science From A BI BackgroundData science, a continuum of the predictive analytics and data mining spaces, is an interdisciplinary science of supervised learning and applying engineering, business-systems, and statistical methods to the art of discovering patterns in data.From the perspective of a business intelligence expert, this is complicated and overwhelming. While both areas follow similar methodologies and value by improving decision making and decision management, they differ in the skills, tools, and knowledge required. In terms that we understand, instead of reporting and doing analysis on the past, predictive analytics is forecasting the likelihood of future events.The reason this is becoming increasingly popular is because we now have the hardware and processing power to compliment the statistics, theories, and expert knowledge we’ve had for a long time. This post will serve as a basic overview of what I know, what I’ve learned, what motivates me to learn more, and areas we need to grow as BI experts to move into this space.What You Need To Get StartedSince data science is a science, not just data analysis, it relies on the scientific method. The scientific method is an iterative process focused around the ability to reproduce findings. It involves formulating a question, generating hypothesis, gathering data, testing hypothesis, and communicating results. For many of us, this is a familiar concept because of the iterative nature of the agile methodology we have all come to know and love.The industry standard data science/data mining methodology is the Cross Industry Standard Process for Data Mining (CRISIP-DM). This methodology looks very familiar to anyone acquainted with the general agile methodology:Figure 1: Phases of CRISP-DMThe methodologies described above, when put in the data science context, require many skills. A thorough knowledge of the business, data, and audience is needed to generate questions, formulate hypotheses, gather data, and effectively communicate results or take action. Equally as important, a strong technical and programming knowledge is required to prepare data and create models. And last but not least, a strong statistical and math understanding is vital to correctly analyze and test data to ensure that the models created are reproducible and significant.Taking action on any model will only be beneficial if the right question was asked, the right hypothesis was tested, the right data was selected and prepared in the right way, and (arguably most importantly) the right methods used to test the model. See below for an excellent breakdown of the scientific method vs. data science skills. I also highly recommend reading the article Figure 2 comes from.Figure 2: Scientific Method and Data ScienceSkills To AcquireMany of the skills we have practiced as business intelligence experts lend themselves well to the data science realm, such as product design, programming, structured and unstructured data, big data, visual display of information, cloud management and general business development. From what I have found, if you’re ready to dive into the predictive world – there are a few skills you need to have in your tool-belt that fall outside the business intelligence tool belt:Programming & TechnologyMath & ModelingStatistics– R – Python – MapReduce – Hadoop – Machine learning – NLP– Algorithms – Bayesian Statistics– Data Mining – Scientific Method and Experimental Design – Statistical modeling – Forecasting modelsTable 1: A breakdown of the skills summarized in Figure 2 from a survey conducted of 490 data science professionals. A couple months ago I was fortunate to be able to attend John Elder’s training on an Introduction to Data Mining and I have woven many of the concepts he presented into this blog. The diagram below helped me visualize not only data mining skills that he kept referencing but also the distribution of knowledge from each sector of the interdisciplinary approach.Figure 3: Discipline Interlock Source: Elder Research Introduction to Data Mining TrainingFirst Steps To Take – As a company or as an individual1. UNDERSTAND THE METHODOLOGYIt’s really easy to get absorbed in data and models and tools. The CRISP-DM and/or other iterative methodology should be a constant reminder of the underlying question or hypothesis.2. MAKE AN INTERDISCIPLINARY TEAMNot all of us can have all the skills outlined in Table 1 (although I would not object to being a clairvoyant polymath!). Together we can create a good profile, see Figure 3 for a representation of areas of expertise that can be represented.3. START WITH FREE TOOLSPerhaps start with some of the open source and/or free tools and languages out there like R, Python, WEKA, and KNIME.4. ATTEND AND PARTICIPATE IN CONFERENCESCheck out some of these upcoming ones listed below. There are so many!The Data Science Conference – Chicago, IL – Nov 12-13 2015 AND Apr 21-22 2016 The first and only vendor-free, sponsor-free, and recruiter-free data science conference. This conference is for business analytics professionals working on data science, big data, data mining, machine learning, artificial intelligence, or predictive modeling who want to attend an event without being prospected by other attendees.Predictive Analytics Innovation Summit – Chicago, IL – Nov 11-12 2015 Achieve Actionable Data Insight. With 40+ Industry Speakers & 250+ delegates, the Predictive Analytics Innovation Summit is the largest gathering of business executives at the forefront of predictive analytics initiatives.Predictive Analytics World for Business – San Francisco, Chicago, and New York – 2016 Predictive Analytics World is the leading cross-vendor event for predictive analytics professionals, managers and commercial practitioners.Big Data Innovation Summit – Las Vegas, NV – January 28-29 2016 Making Your Data Actionable. With internationally renowned speakers and world-class delegates in attendance this summit boasts some of Big Data’s most influential individuals to educate you, inspire you and gives you unprecedented access to incredible networking opportunities.Open Data Science Conference West – San Francisco, CA – Nov 14-15 20155. PRACTICE, PRACTICE, PRACTICE, AND LEARN! The Open Source Data Science Masters has tons of online courses referenced as well as books. Two of the larger data science specific courses are: Coursera’s introduction to data scienceHarvard’s CS109 Data ScienceCoursera has an entire Data Science Specialization series Books: Applied Predictive Modeling by Max Kuhn and Kjell JohnsonR Programming for Data Science by Roger D. PengPython Data Science Essentials by Alberto Boschetti and Luca Massaron Participate in data challenges or projects: Kaggle has new competitions added daily and some even have a $$ prize. Although us novices likely will not win, we can look at the code from the winners and learn from them! DataKind brings together top data scientists with leading social change organizations to collaborate on cutting-edge analytics and advanced algorithms to maximize social impact. Get involved! Check out their list of current projects.A Fun (Small) ExampleWay back in the day I worked at a coffee shop in college and for a project for one of my classes I wanted to use stochastic forecasting to determine the optimal brewing volumes for the new brewing equipment we had just received. The model forecasted coffee demand by volume accounting for the fluctuation of customers by time of day, day in the year, weather, etc. for about 2 years of data from the point of sales system.The optimization portion of the project took into account the opportunity loss of a false negative (running out of coffee and giving a customer a more expensive drink for the price of a coffee) and a false positive (having too much coffee and dumping it out). As you may have noticed, the coffee industry is VERY particular about coffee. A few of the constraints of this problem: coffee could only be out for <2 hours and it took anywhere from 4-8 minutes to brew a pot of coffee (depending on the volume).I delivered a recommendation of the 3 pre-set volume amounts for the new brewing equipment and they still use them to this day. If I were to do this again today now knowing more BI principles and learning more about predictive analytics, I would consolidate this information into not just a single deliverable but a model that continues to grow as their data, customer base, and demand changes.Inspiring (And Fun) Ideas That Keep Me InterestedPredict fall foliage peak days in different areas of the world Data: Use weather data with actual rain totals, sunshine days, altitudes, types of trees, and historical peak days.Predict colorful sunsets Data: Using current cloud coverage and season of the year, predict when a colorful and vibrant sunset will occur in any area (could also be a fun app). Simulate and predict the effect of streamlined composting and recycling in all US major cities; impact on environment 5 years ahead, 10 years ahead; Determine the effect of charging people for not recycling (plus it would add more jobs, determine effect on economy 5+ years) Data: Not sure where to start with this one! Any suggestions?