Data Science Workflow

Building a minimal Data Science Project in DataCards.

DataCards is a real-time collaboration platform where data experts and business stakeholders co-create dynamic, interactive dashboards. Built on interconnected Python notebooks that pull data from all your sources and APIs, it cuts through complex, fragmented data and bridges the gap between analysts and decision-makers.

This white paper outlines a lightweight, reproducible workflow for data science projects using DataCards, with a focus on modularity, scalability, and stability in a collaborative environment based on Python Notebooks. The approach is designed to reduce memory footprint, prevent redundancy, and improve project maintainability by dividing responsibilities across focused notebooks.

Context: In many data science workflows, kernels die, environments reset, or cloud instances time out, leading to productivity loss and frustration.
Problem: Repeated environment setup, data imports, and memory-heavy processes slow down iteration and hinder reproducibility.
Solution: We propose a minimal DataCards project structure that separates concerns and leverages DataCards for efficient data usage and persistence between notebooks.

High-Level Data Science Workflow Overview

Every successful data science project follows a structured approach that transforms raw data into actionable insights. The standard workflow typically includes:

Environment Setup & Dependencies - Establishing a reproducible environment with all necessary tools and libraries
Data Acquisition & Preparation - Loading, cleaning, and structuring data from various sources
Exploratory Data Analysis - Understanding data patterns, relationships, and quality through filtering and analysis
Modeling & Business Logic - Applying statistical models, machine learning algorithms, or business rules to extract insights
Results Communication - Visualizing findings and presenting results to stakeholders in an accessible format

DataCards Process Overview

DataCards revolutionizes this traditional workflow by enabling real-time collaboration between data experts and business stakeholders through interconnected Python notebooks. The platform addresses common pain points in data science projects:

Memory Management: Prevents kernel crashes and reduces RAM consumption through modular design
Reproducibility: Ensures consistent results across team members and environments
Collaboration: Bridges the gap between technical analysts and business decision-makers
Persistence: Maintains data and variables across notebook sessions using DataCards variables

The DataCards approach divides the traditional workflow into specialized, interconnected notebooks that work together seamlessly while maintaining separation of concerns. This modular structure allows teams to iterate quickly, recover from failures efficiently, and scale their analysis as projects grow.

Process View of a standard DataCards project (whole overview)

1. Installation/Data Upload Notebook

This Notebook is used to install all external libraries and upload static files needed for the project.

Purpose:

Centralize environment setup
Ensure reproducibility
Avoid repetitive setup steps
Act as a one-stop restart recovery when kernels die or VMs reset

Steps:

Install Libraries and Packages in one notebook if necessary (not on the screenshot, but best practice)
Import Libraries (happens per Notebook)
Save files to the project’s filesystem and save them as DataCards variables to reuse them in future notebooks

2. Main Data Handling Notebook

This notebook is used to load, clean, and prepare the data for further analysis or modeling.

Purpose:

Why separate it from the installation notebook:

Keeps the environment setup and data logic modular
Enables faster reruns after crashes or kernel restarts

Note: If data handling is too complex to be in one notebook, separate the sequence into data load, cleaning & augmentation. Another alternative is a notebook per data source.

3. Input/Filter Cards Notebook

This notebook contains all the filtering setup of the main data sources.

Purpose:

Why separate filter cards:

Separation of concerns Filtering logic is distinct from raw data cleaning or final presentation. Keeping it in its own notebook keeps everything modular and easier to debug.
Enables dynamic logic for card generation Cards might be grouped by status, region, project, etc. This notebook becomes the bridge between raw data and personalized outputs — especially useful if filters evolve.
RAM Concerns: Reducing the number of notebooks to one notebook to save memory.

4. Business Logic/Model Notebook(s)

This notebook contains the main transformation needed for the end goal. Moreover, using several notebooks here can push the performance as calculations can be performed in parallel in separate notebooks.

Purpose:

Isolate computational complexity Holds the core analytical/predictive code, separate from data preparation and visualization.
Enable iterative modeling Lets you try different modeling approaches without repeating data loading or plotting.
Improve maintainability Allows changes to business logic or model architecture without affecting other workflow components.

5. Visualization Notebook/Result Display

This notebook can display the result in every possible form. In the process below, there is a notebook that aggregates the simulation results and publishes the aggregation as a variable. This variable is then used by a visualization notebook.

Purpose:

Communicate findings effectively Turn complex insights into clear visuals that stakeholders can grasp quickly.
Support decision-making Present results in business-ready formats without needing technical context.
Separate presentation from computation Update visuals independently of heavy processing steps.

Trade-off between the number of notebooks/RAM/positioning

Note: Depending on what you do in each notebook, ~20 notebooks (even without user code execution) can already consume around 4 GB of RAM — clean separation is essential for stability and reproducibility.

Approach	Position set in FE (automatically set in card store)	Position set in code (set in notebook)
Many notebooks (≈1 card per notebook)	Pros: Max isolation, easy to reorder in frontend. Cons: Highest RAM (more notebooks). Use when: Experiment-heavy, designing projects	Pros: Per-card logic in code. Cons: Highest RAM + positions fixed in code; harder to change and find. Use when: Deterministic, code-driven layout per card, clear separation between cards on a logic level needed
Few notebooks (grouped) (many cards per notebook)	Pros: Low RAM. Cons: FE cannot set per-card positions for cards emitted by a single notebook. Use when: not implemented/cannot be used now	Pros: Lowest RAM, positions repeatable from code. Cons: Least flexible for design; drag&drop useless; code changes needed to reorder. Use when: Stable layouts, strict determinism. (Default)