How much did DeepSeek V3 cost to train compared to GPT-4?

DeepSeek reported training V3 for approximately $6 million. OpenAI's GPT-4 is estimated at $100 million in compute costs. DeepSeek achieved this through architectural efficiency innovations: multi-head latent attention (MLA), mixture-of-experts routing, and FP8 mixed precision training. The V3 model also used roughly one-tenth the compute of Meta's Llama 3.1 405B.

What is the difference between DeepSeek and Qwen?

DeepSeek is a Chinese AI startup focused on reasoning efficiency. Their R1 model (January 2025) was the first open reasoning model matching OpenAI's o1. Qwen (Tongyi Qianwen) is Alibaba's model family, ranging from 0.5B to 72B+ parameters. Qwen focuses on breadth: multilingual capability, coding, math, and multimodal tasks. Both are open-weight and have outperformed Western models on multiple benchmarks.

What does continual learning mean for enterprise AI products?

Today, updating an enterprise AI model means full retraining at significant cost. Continual learning would allow incremental updates: add new product catalog items without retraining the recommendation model, update policy documents without rebuilding the compliance assistant. The 2025 breakthroughs (MESU, Meta FAIR sparse memory, Google nested learning) show a 24%+ reduction in forgetting. Production-ready enterprise tooling is still 12-18 months away.

Should companies use DeepSeek or Qwen instead of OpenAI?

For many use cases, yes. DeepSeek V3 and Qwen 2.5-Max match GPT-4o on coding and reasoning benchmarks at a fraction of the API cost. The main considerations: data privacy (where models are hosted), reliability of the provider (uptime, SLAs), and whether your use case is censorship-sensitive. For self-hosted deployments where you control the infrastructure, open-weight Chinese models are a serious option today.

AI Frontier 2025: Continual Learning & DeepSeek

Q: What is catastrophic forgetting in AI?

Catastrophic forgetting happens when a neural network, trained on new tasks, overwrites the weights that encoded previous knowledge. Fine-tune a medical model on legal documents and it forgets medicine. The network has no mechanism to preserve old learning while acquiring new learning. This is why production models get retrained from scratch rather than incrementally updated.

DeepSeek trained V3 for $6M. GPT-4 cost around $100M. That's a 16x difference. And it changed the economics of enterprise AI overnight.

In January 2025, DeepSeek released R1. It matched GPT-4o on reasoning benchmarks. The training cost was $6M. Most enterprise AI teams assumed frontier-level capability required frontier-level budgets. That assumption is no longer true.

63% of new fine-tuned models now use Chinese base models like Qwen and DeepSeek. That number was close to 0% two years ago. Open-weight models have closed the gap with closed models faster than anyone predicted.

What This Means for Enterprise AI

You don't need to pay OpenAI rates for every inference. Self-hosted open-weight models running on cloud GPUs can hit 80-90% of GPT-4o performance at 10-30% of the cost. The trade is ops overhead. For high-volume use cases, that trade usually makes sense.

We now have a standard recommendation: start with managed APIs like OpenAI or Anthropic to validate your use case fast. Once volume justifies it, evaluate open-weight alternatives. The migration path exists. Plan for it from the architecture phase.

Model	Training Cost	Parameters	License	Benchmark vs GPT-4o
DeepSeek V3	~$6M	671B (37B active)	MIT	Comparable or better on coding/math
Qwen 2.5-Max	Not disclosed	72B+	Apache 2.0	Outperforms on multilingual + coding
DeepSeek R1	Not disclosed	671B (37B active)	MIT	Matches o1 on reasoning benchmarks
Llama 3.1 405B	~$40-60M est.	405B	Llama community	Comparable on most tasks

Continual Learning: Why It Matters

AI models have a catastrophic forgetting problem. Train them on new data and they forget old skills. Four research breakthroughs in 2025 made real progress here. Google's nested learning, MESU's Bayesian approach, Meta FAIR's sparse memory technique, and Neural ODE extensions each cut forgetting rates by 24% or more.

For enterprise AI this means models that can update continuously from new data without full retraining. A fraud detection model that learns from new fraud patterns as they emerge. A customer support model that absorbs product updates without a full retraining cycle. That's the end state. We're not fully there yet.

The research is real. The enterprise tooling is 12-18 months behind. What you can act on now: track Meta FAIR's sparse memory approach. It's the most compatible with existing fine-tuning workflows and the most likely to ship in Hugging Face tooling first.

What to Do With This

Watch open-weight model releases. Test DeepSeek V3 and Qwen 2.5 against your specific use case before committing to API spend. The cost savings at scale can be 3-5x. Build your architecture to support model swapping. Don't hardcode to a single provider.

The cost floor for frontier AI dropped significantly in 2025. Your AI product budget goes further than it did in 2024. The question isn't "can we afford good AI" anymore. It's "are we choosing the right models for each task?"

Frequently Asked Questions

What is catastrophic forgetting in AI?

When you fine-tune a neural network on new tasks, gradient updates overwrite the weights that encoded previous knowledge. Fine-tune a medical model on legal documents and it forgets medicine. This is why production models get retrained from scratch rather than incrementally updated.

How much did DeepSeek V3 cost to train?

DeepSeek reported approximately $6 million. OpenAI's GPT-4 is estimated at $100 million. The efficiency came from multi-head latent attention, mixture-of-experts routing, and FP8 mixed precision training.

Should we use DeepSeek or Qwen instead of OpenAI?

Benchmark them against your actual tasks first. For coding, math, and multilingual workflows, open-weight models are competitive today. Main considerations: where models are hosted (data privacy), provider SLAs, and whether your use case is censorship-sensitive.

What does continual learning mean for enterprise products?

Right now, updating an enterprise model means full retraining. Mature continual learning means incremental updates: add new catalog items without retraining the recommendation model, update compliance policies without rebuilding the assistant. Production tooling is 12-18 months behind the papers.

Sources

Nature Communications: Bayesian Continual Learning (MESU), 2025
Scientific Reports: Neural ODEs and Memory-Augmented Transformers, 2025
DeepSeek Technical Report: DeepSeek-V3, 2025
Alibaba Cloud: Qwen 2.5 Technical Report, 2025

Building AI products and need to choose the right model stack? Book a free strategy call. We've shipped 50+ AI products across fintech, healthcare, and media. See our AI development services or read about agentic AI for autonomous workflows. Or explore our generative AI services.

AI Frontier 2025: Continual Learning and the DeepSeek Effect