CRISP-DM: why use the methodology for your Data Science Projects

Published: December 20, 2024

Explaining the complexity of data mining, or as the technology has evolved, data science projects can be likened to solving a well-known puzzle called Word Search.

A Word Search, a word hunt puzzle, involves a rectangular grid of letters. The primary objective is identifying and highlighting all the words hidden within the grid. Often, these words are thematically linked by a given context, such as a title or a paragraph, which groups the words into a coherent scenario.

 Imagine we are given a word search puzzle with just a list of words to find. One could eventually derive a scenario from grouping these words. For example, words like “Sun”, “Sand”, “Beach”, “Mountains”, “Airplane”, “Family”, and “Joyful” can be linked to a “Vacation” theme.

To make things more interesting, let’s remove the list of words and provide only the letters grid. Our brain can find known words in any direction, including diagonals, upside down, or backward. We can exclude combinations like XH, YY, or ZZ by searching for commonly used combinations of letters and syllables. Once we start finding an initial set of words and assume the other words are linked by a specific theme, it becomes easier to find other words and derive the theme. We might make mistakes and derive a wrong or misleading theme, but these risks must be mitigated.

A data mining project usually starts with a business need—similar to finding the theme in our word puzzle—and goes as technical as understanding advanced algorithms (summarization, regression, clustering, neural networks, deep learning, etc.) that can be used to solve the customer’s problem. In our word puzzle analogy, studying the probabilities of given letters being together could facilitate finding valid words.

Complex data mining problems arose in the late 1990s, leading practitioners to seek a standardized and repeatable way of tackling these projects. In 1999, version 1.0 of a document called CRISP-DM was released, which has been used as a standard to develop data mining and data science projects.

What is CRISP-DM?

Published in 1999 to standardize data mining processes across industries, the Cross-Industry Standard Process for Data Mining – CRISP-DM – has become the most common methodology for data mining, analytics, and data science projects.

A poll conducted by Data Science Process Alliance in early 2024, with 109 respondents, showed that CRISP-DM is still the most commonly used methodology.

Regardless, maintaining high quality in a data science team requires enforcing a standard methodology. This applies to traditional data analytics projects and advanced endeavors, including recommenders, text, image, and language processing, deep learning, and AI projects.

From this perspective, a CRISP-DM strategy comprises six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. These phases naturally describe the data science life cycle.

A CRISP-DM distinguishes between a reference model and a user guide. The reference model presents an overview of phases, tasks, and their outputs, describing what to do in a data mining project. The user guide provides detailed tips and hints for each phase and task, depicting how to conduct a data mining project.

Before explaining each phase, it is important to note that the methodology is flexible and can be performed iteratively. Data science projects are now heavily based on software, making methodologies such as Agile suitable for them. Combining Agile with CRISP-DM ensures consistency and quality in delivering data science projects. Conversely, projects that need to follow a sequential flow and iterate less could still benefit from CRISP-DM when combined with traditional waterfall methodology.

CRISP-DM phases

Business Understanding 

This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan.

  • Determine Business Objectives: Understand what the customer wants to accomplish. Tools like Design Thinking or Value Proposition Design can be useful here.
  • Assess Situation: Detailed fact-finding about resources, constraints, assumptions, and other factors.
  • Determine Data Mining Goals: Convert business requirements into technical data science terms.
  • Produce Project Plan: Combine this task with your project management methodology, whether Agile or a more detailed project plan.

Data Understanding 

This phase starts with initial data collection and proceeds with activities that enable familiarity with the data, identification of data quality problems, and discovery of first insights.

  • Collect Initial Data: List sources, acquisition methods, and any problems encountered.
  • Describe the Data: Include format, storage, access time, security, consistency, etc.
  • Explore Data: Use tools to visualize data and extract initial information.
  • Verify Data Quality: Iteratively assess feasibility with the customer.

Data Preparation 

This phase involves cleaning, transforming, and organizing raw data to make it suitable for analysis.

  • Select the Data: Collect and join data from various sources, perform correlation tests, and decide on fields to use.
  • Clean the Data: Address noise in the data and correct or remove it.
  • Construct Data: Add transformed data to your model.

Modeling 

In this phase, proper modeling techniques are selected and applied, and their parameters are calibrated.

  • Select Modeling Technique: Choose the appropriate technique for the data science problem.
  • Generate Test Design: Build and assess the model, revising parameters as necessary.

Evaluation 

This phase aims to thoroughly evaluate the model and review the steps executed to create it and ensure it achieves the business objectives.

  • Identify Business Issues: Ensure no important factors have been overlooked.
  • Trust and Use the Results: Confirm the data science results are reliable.

Deployment 

Once the model is done and the evaluation is complete, the final deliverable should be deployed.

  • Deploy the Model: This can involve static reports, real-time queries, or background analysis for real-time reporting.

Why consider CRISP-DM for your operation?

The Data Science Process Alliance has listed quite a few other methodologies also commonly used by companies. Some examples are SEMMA, KDD, OSEMN ( pronounced as “awesome”), Team Data Science Process (TDSP), just to name a few.

However, CRISP-DM is popular and used for traditional data mining and AI-based data science applications. It provides a flexible methodology that aligns with agile principles and practices, enabling data scientists to deliver high-quality results.

AI and out-of-the-box tools are increasingly used in data science projects. Integrating business-specific information into these tools to enhance the models is key to a successful outcome. Using a methodology like CRISP-DM ensures consistency and helps meet customer business goals effectively.

References

  • CRISP-DM 1.0 – Step-by-step data mining guide. Pete Chapman (NCR), Julian Clinton (SPSS), Randy Kerber (NCR), Thomas Khabaza (SPSS), Thomas Reinartz (DaimlerChrysler), Colin Shearer (SPSS) and Rüdiger Wirth (DaimlerChrysler) – offline, but still accessible through Web Archive.
  • Data Science Process Alliance articles.
  • Data Science Central
  • Microsoft Team Data Science Process (1) and (2)
  • KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW, Ana Azevedo and M.F. Santos
  • KDNuggets

Software Developer at e-Core

e-Core

We combine global expertise with emerging technologies to help companies like yours create innovative digital products, modernize technology platforms, and improve efficiency in digital operations.

Skip to content