Developing an AI system is not a single act — it is a structured, iterative lifecycle that spans from a business problem on a whiteboard to a production system that runs continuously, generates real outputs, and must be governed, maintained, and improved over its operational life. Understanding this lifecycle end-to-end is essential for anyone involved in commissioning, building, governing, or evaluating AI systems.
This article presents the standard, universally applicable process for AI system development — drawing on established frameworks including CRISP-DM, Google's ML guidelines, Microsoft's Responsible AI framework, and the ISO 42001 AI Management System standard. It covers every phase from initial business problem framing through model development, deployment, and ongoing governance, with the tools, practices, and governance considerations relevant to each stage.
The AI Development Lifecycle — An Overview
The ten phases described in this article follow a broadly sequential flow, but in practice AI development is highly iterative. Phase 6 (model training) often reveals gaps in Phase 4 (feature engineering) that require returning to the data layer. Phase 7 (evaluation) may reveal that the Phase 1 problem was too ambiguously defined. Phase 9 (monitoring) may trigger a complete restart from Phase 2 when the production data distribution has shifted significantly from the training data. This iterative reality is not a failure — it is the expected pattern of AI development that teams must plan and govern for.
Phase 1: Business Problem Definition & Feasibility
- Problem translation: Convert the business need into a precise, measurable AI problem statement. "Improve customer retention" becomes "predict which customers are likely to churn in the next 90 days with sufficient accuracy to enable cost-effective intervention." The machine learning framing requires specificity that business requirements rarely include.
- AI appropriateness assessment: Determine whether the problem actually requires AI. Many problems that organisations want to solve with AI are better solved with conventional analytics, rules-based automation, or process redesign. AI is appropriate when: there is sufficient data, the problem is too complex for explicit rule-writing, and the value of a learned predictive function exceeds its development cost.
- Success criteria definition: Define exactly what "success" means in measurable terms before the project begins. What accuracy metric? What performance threshold? What business outcome metric (revenue impact, cost reduction, error rate reduction)? Define the minimum viable performance threshold below which the system should not be deployed.
- Data availability preliminary assessment: Is sufficient, relevant, labelled data available or obtainable? Data availability is the most common make-or-break constraint on AI feasibility. Many technically excellent AI project designs fail because the data required to train them does not exist at the required quality, volume, or recency.
- Risk and ethical pre-screening: Identify upfront whether the AI system will make decisions affecting people, whether it operates in a regulated domain, what the consequences of errors are, and whether there are potential bias or fairness concerns. This pre-screening determines the level of governance required.
- Stakeholder alignment: Ensure all key stakeholders — business sponsors, data owners, legal/compliance, affected operational teams, and potential end users — are aligned on what the system will do and the constraints it must respect.
- AI project charter with defined problem statement, success criteria, and constraints
- AI feasibility assessment document
- Initial AI risk assessment (required by ISO 42001 and EU AI Act)
- Data availability and quality initial assessment
- Go/no-go decision with documented rationale
- Skipping this phase and jumping directly to data collection or model building
- Defining success in technical model terms (accuracy %) rather than business outcome terms (revenue impact, error reduction)
- Failing to involve domain experts, legal/compliance, or affected end users in the problem definition
- Underestimating the data requirements — "we must have enough data" is not the same as having labelled, representative, quality-controlled data at the required volume
Phase 2: Data Discovery & Strategy
- Data source inventory: Catalogue all internal and external data sources that may be relevant to the problem. Internal sources: transactional databases, CRM systems, ERP records, sensor data, logs, documents. External sources: public datasets, purchased data, open-source training datasets, third-party APIs.
- Data quality assessment: Evaluate each data source on: completeness (are there significant gaps?), accuracy (is the data factually correct?), consistency (is the same concept represented consistently across sources?), timeliness (is the data current and regularly updated?), and representativeness (does the data represent all relevant subpopulations?)
- Data gap analysis: Identify the gap between the data you have and the data you need. What labels are missing? What demographic groups are underrepresented? What time periods are absent? What historical events are missing from the training data?
- Data governance and legal review: Assess the legal basis for using each data source in AI training. Personal data requires a lawful basis under GDPR. Third-party data requires licence terms that permit AI training use. Copyrighted content raises training data IP issues. This review should be completed before data is collected or used, not after.
- Data strategy design: Define the data pipeline architecture — how data will flow from source to training, including collection, storage, preprocessing, versioning, and access controls.
Phase 3: Data Collection, Ingestion & Labelling
- Existing internal data: Extract from operational systems — the most cost-effective source if data quality is adequate and appropriate consent or legitimate interest exists
- Public and open-source datasets: Many ML tasks have established benchmark datasets (ImageNet for computer vision, SQuAD for question answering, Common Crawl for language). Using established datasets enables comparison with published baselines but may not match your specific domain.
- Synthetic data generation: Where real data is insufficient, scarce, or privacy-constrained, synthetic data generated by generative models or rule-based simulators can supplement training data. Synthetic data is increasingly used in healthcare, finance, and security where real data is sensitive.
- Active learning: Rather than labelling all data upfront, active learning techniques identify the most informative unlabelled examples for human annotation — dramatically reducing the labelling cost to achieve a given performance level.
- Annotation guidelines: Written instructions for annotators that define the labelling task precisely, include examples of edge cases, and address potentially ambiguous scenarios. Without clear guidelines, different annotators produce inconsistent labels for the same examples.
- Inter-annotator agreement (IAA): Have multiple annotators label the same examples independently, then measure agreement (Cohen's Kappa, Fleiss' Kappa). Low IAA signals ambiguous guidelines or genuinely difficult labelling tasks where a single "ground truth" may not exist.
- Gold standard examples: A set of examples with known-correct labels, used to quality-control annotators during the labelling process — flagging annotators whose accuracy on gold standards falls below threshold.
- Bias-aware annotation: Ensure annotator diversity and implement guidelines that explicitly address sensitive attributes to prevent label bias from being introduced by annotators' unconscious prejudices.
Phase 4: Data Preparation & Feature Engineering
- Missing value handling: Decide how to address missing data — imputation (filling with mean, median, mode, or model-predicted values), exclusion (removing records with missing values), or flagging (creating an indicator variable that signals missingness). The choice depends on the missingness mechanism: missing completely at random, missing at random, or missing not at random.
- Outlier treatment: Identify and address extreme values — whether true anomalies that represent errors, or genuine extreme observations that the model must handle. Blindly removing outliers can introduce bias if extreme values are genuinely present in the deployment distribution.
- Deduplication: Remove duplicate records that would artificially inflate the influence of certain observations on model training.
- Type normalisation: Ensure consistent data types, formats, and encodings across the dataset — particularly for date/time fields, categorical encodings, and text normalisation.
- Numeric transformations: Log transformations for skewed distributions; normalisation/standardisation for scale-sensitive algorithms; polynomial features for capturing non-linear relationships
- Categorical encoding: One-hot encoding, target encoding, embedding representations for high-cardinality categoricals. The choice matters significantly for model performance and interpretability.
- Temporal features: Extracting meaningful patterns from timestamps — day of week, time since last event, rolling averages, trend indicators — that capture temporal dynamics the raw timestamp alone doesn't express
- Text features: Tokenisation, TF-IDF representations, word embeddings (Word2Vec, GloVe) or contextual embeddings (BERT, transformers) for natural language inputs
- Domain-specific features: Features derived from domain knowledge — the most valuable features are often those that encode expert understanding of what signals matter for the prediction task
- Feature selection: Removing irrelevant and redundant features through correlation analysis, mutual information, or model-based importance scores to reduce dimensionality and prevent overfitting
Phase 5: Model Selection & Architecture Design
- Problem type mapping: Classification (predict a category), regression (predict a value), clustering (find natural groupings), ranking (order items by relevance), generation (produce new content), anomaly detection (identify unusual patterns), recommendation (suggest items). Each maps to a different class of algorithms.
- Interpretability requirements: High-stakes regulated domains (credit, healthcare, criminal justice) often require interpretable models whose decisions can be explained to affected individuals and regulators. Linear models, decision trees, and rule-based systems are inherently interpretable. Neural networks and ensemble methods are more powerful but require post-hoc explanation methods (LIME, SHAP).
- Data volume and velocity: The available training data size influences architecture choice. Deep learning requires large datasets; gradient boosting (XGBoost, LightGBM) often performs better on tabular data with moderate training sets; transformer architectures require massive pre-training data but can be fine-tuned on smaller domain-specific datasets.
- Inference latency requirements: Real-time applications (fraud detection, content moderation) require sub-100ms inference. Batch applications (monthly customer segmentation) can tolerate minutes. Architecture complexity and deployment infrastructure must be matched to latency requirements.
- Transfer learning vs. training from scratch: For most organisations, fine-tuning a pre-trained foundation model (Llama, BERT, ViT) on domain-specific data is far more efficient than training from scratch. Training from scratch is only appropriate when pre-trained models don't cover the domain, when the data distribution is fundamentally different from pre-training data, or when proprietary data constraints prevent using public models.
| Problem Type | Common Architectures | Best For |
|---|---|---|
| Tabular Classification/Regression | XGBoost, LightGBM, CatBoost, Random Forest, Logistic Regression, Neural Tabular | Enterprise data (CRM, ERP, financial), fraud, churn, credit |
| Natural Language | Transformer LLMs (BERT, RoBERTa, Llama fine-tuned), GPT variants, Sentence Transformers | Classification, NER, QA, summarisation, generation |
| Computer Vision | CNN (ResNet, EfficientNet), Vision Transformers (ViT), YOLO series, SAM | Image classification, object detection, segmentation |
| Time Series | LSTM, Transformer-based (Temporal Fusion Transformer), N-BEATS, Prophet | Demand forecasting, anomaly detection, financial prediction |
| Recommendation | Matrix Factorisation, Two-Tower Neural, Graph Neural Networks | Product recommendation, content personalisation |
| Generative | Diffusion Models, VAEs, LLMs (GPT-style), GANs | Image/text/audio generation, synthetic data, summarisation |
Phase 6: Model Training & Optimisation
- Loss function selection: The mathematical function that quantifies how wrong the model's predictions are. Cross-entropy for classification, mean squared error for regression, custom loss functions for domain-specific objectives. The loss function defines what the model optimises for — it must be carefully chosen to align with the business objective, not just mathematical convenience.
- Optimiser selection: Adam, AdamW, and SGD with momentum are the most common gradient descent variants. Learning rate scheduling (reducing learning rate over training) and gradient clipping (preventing unstable large gradients) are critical for training stability.
- Batch training and epochs: Training on minibatches of data rather than the full dataset; an epoch is a complete pass through the training data. Early stopping — halting training when validation performance stops improving — prevents overfitting while reducing compute waste.
- Regularisation: Techniques to prevent overfitting (memorising training data rather than learning generalisable patterns): L1/L2 weight regularisation, dropout (randomly disabling neurons during training), data augmentation (creating modified training examples), batch normalisation.
- Transfer learning and fine-tuning: For large pre-trained models, fine-tuning adjusts the model's weights on domain-specific data. Techniques include full fine-tuning (adjusting all weights), parameter-efficient fine-tuning (LoRA, prefix tuning — adjusting a small subset of weights), and retrieval-augmented approaches (keeping the base model frozen and augmenting with retrieved context).
- Grid search: Exhaustive search over a defined parameter grid — thorough but computationally expensive for large spaces
- Random search: Random sampling of hyperparameter combinations — often more efficient than grid search for high-dimensional spaces (Bergstra & Bengio, 2012)
- Bayesian optimisation: Uses a probabilistic model to guide search toward promising hyperparameter regions — most efficient for expensive training runs (Optuna, Hyperopt)
- Neural Architecture Search (NAS): Automated search over model architectures — used primarily for deep learning where architecture choices (depth, width, connection patterns) matter significantly
Phase 7: Model Evaluation & Testing
- Classification metrics: Accuracy, precision, recall, F1-score, ROC-AUC, precision-recall curves. Choose metrics based on the cost of different error types — false negatives in cancer detection are more costly than false positives; false positives in fraud detection must be balanced against false negative costs.
- Regression metrics: MAE (Mean Absolute Error), RMSE (Root Mean Square Error), MAPE (Mean Absolute Percentage Error), R². Choose based on whether large errors should be penalised disproportionately (RMSE) or treated linearly (MAE).
- Business metric evaluation: Translate model performance metrics into business outcomes. How much revenue is saved per additional percentage point of fraud detection recall? What is the cost of false positive rate in the recruitment model? The business metric is the ultimate evaluation criterion.
- Confidence calibration: If the model produces probability scores, verify that predicted probabilities correspond to actual event frequencies (calibration). A model that says "70% probability" should be correct approximately 70% of the time when it says that.
- Disaggregated performance evaluation by all relevant protected characteristics (race, gender, age, disability, geography)
- Fairness metric evaluation (demographic parity, equal opportunity, equalised odds) — see the companion article on AI Bias for detail
- Proxy variable audit — identify features that may serve as proxies for protected characteristics
- Adversarial testing — actively probe the model for discriminatory outputs by systematically varying protected characteristics
- Distribution shift testing: Evaluate performance on data from different time periods, geographies, or subpopulations from the training distribution to assess how performance degrades outside the training domain
- Adversarial robustness testing: For systems where adversarial inputs are a concern (computer vision, NLP, security), test model resistance to adversarially crafted inputs
- Out-of-distribution detection: Test whether the model can recognise inputs that are fundamentally different from its training distribution — and ideally flag them for human review rather than producing confident wrong predictions
- A/B testing framework: For production deployments, plan the A/B test structure that will validate business impact before full rollout
Phase 8: Deployment & MLOps
- REST API serving: The model is exposed as an HTTP endpoint — the most common enterprise deployment pattern. Any application can call the endpoint with input data and receive predictions. Implementations: FastAPI, Flask, BentoML, Triton Inference Server.
- Batch inference: The model runs periodically on a batch of records (nightly customer scoring, weekly demand forecasting) rather than responding to real-time requests. Lower latency requirements; often more cost-effective for large-scale inference.
- Edge deployment: The model runs on the device (phone, IoT sensor, embedded system) rather than in the cloud. Requires model compression (quantisation, pruning, distillation) to fit within device memory and compute constraints. Enables offline operation and privacy (data never leaves the device).
- Embedded inference: The model is embedded directly within an application binary rather than served via a separate service — used for latency-critical applications and simple deployment environments.
- Shadow mode / canary deployment: The new model runs in parallel with the existing system, receiving the same inputs but with its outputs not yet used for actual decisions. Allows comparison of new vs. old model behaviour in production before cutover.
- CI/CD for ML: Continuous integration and deployment pipelines adapted for ML — automated testing of data pipelines, model training, evaluation against quality thresholds, and deployment upon passing gates
- Containerisation: Docker containers packaging the model, its dependencies, and serving code — ensuring consistent behaviour across development, staging, and production environments
- Model registry: The central store of all model versions with their metadata, performance metrics, and deployment status — the single source of truth for what is running in production
- Infrastructure as Code: Terraform, Helm charts, or CDK templates defining the model serving infrastructure — enabling consistent, reproducible deployment and disaster recovery
- Blue-green deployment: Maintaining two identical production environments and switching traffic atomically between them — enabling zero-downtime model updates and instant rollback on failure
Phase 9: Monitoring, Maintenance & Retraining
- Data drift: Detecting when the statistical distribution of incoming data has shifted significantly from the training data distribution. Tools: PSI (Population Stability Index), KL divergence, Wasserstein distance, Jensen-Shannon divergence on feature distributions.
- Concept drift: The relationship between inputs and outputs has changed even when input distributions are stable. The factors that predicted fraud last year may no longer predict it this year because fraudsters adapt. Requires monitoring outcome labels (when available with lag) against model predictions.
- Model performance metrics: Ongoing tracking of accuracy, precision, recall, and calibration metrics in production. Requires access to ground truth labels — either via delayed feedback (loan outcome known months after credit decision) or via sampling strategies.
- Fairness metrics: Continuous monitoring of performance disparities across demographic groups — not just at deployment but throughout the model's operational life. Fairness can degrade as populations shift even when aggregate performance is stable.
- Infrastructure and operational metrics: Inference latency, throughput, error rates, resource utilisation — the operational health of the serving infrastructure.
- Business outcome metrics: Ultimately, the most important monitoring is whether the AI system is continuing to deliver the business value it was deployed to create.
- Scheduled retraining: Retrain on a defined schedule (monthly, quarterly) with new production data — appropriate when drift is gradual and predictable
- Trigger-based retraining: Retrain when monitoring metrics cross defined thresholds — more responsive than scheduled but requires well-calibrated trigger definitions
- Continuous learning: Online learning systems update model parameters continuously as new data arrives — most applicable to recommendation systems and other rapidly evolving domains
- Human-in-the-loop retraining: Incorporating human corrections and feedback into model retraining — particularly important for NLP tasks where model outputs are reviewed by experts
Phase 10: Governance, Ethics & Responsible AI
- Phase 1 governance: AI risk classification (EU AI Act risk tiers), AI impact assessment initiation, legal basis review for personal data use, ethical review for high-risk applications
- Phases 2–4 governance: Data protection impact assessment (DPIA) for personal data, data provenance and lineage documentation, bias audit of training data, legal review of data acquisition
- Phases 5–6 governance: Model development documentation (architecture decisions, training configuration, dataset versions), human oversight design, explainability approach selection
- Phase 7 governance: Independent bias and fairness evaluation, safety testing, regulatory compliance pre-deployment review, deployment gate sign-off by authorised individuals
- Phase 8 governance: Model card publication, user transparency notifications, human override mechanisms implementation, deployment documentation
- Phase 9 governance: Ongoing bias monitoring, incident response procedures, complaints handling mechanism, periodic governance audit