Rule of thumb datascale • Musings & Mutterings

Data scale, examples, and when to use which

Architecture	Rough labeled data scale	Good example use cases	When it’s a good choice vs others
RNNs	~10³–10⁵ sequences (small–medium). Past that, Transformers usually win if compute allows.	Short time‑series forecasting (sensor data), small chatbots, sequence tagging where sequences are not very long.	Use when sequences are short/medium, data is limited, you want simple recurrent code, and you don’t need huge context or pretraining; otherwise consider LSTM/Transformer.
LSTMs	~10⁴–10⁶ sequences; can do well on moderate data without billion‑token corpora.	Speech recognition for a specific domain, log/anomaly detection, small/medium NLP tasks where long‑term dependencies matter but you don’t have Transformer‑scale data.	Use when you need better memory than vanilla RNNs but don’t want the full complexity of Transformers, or when you have streaming/online inputs and moderate data.
CNNs	~10³–10⁶ images (or similar grid data); strong even at 10³–10⁴ with good augmentation.	Image classification (medical images, defect detection), object detection/segmentation, audio spectrograms, some text tasks (char‑CNNs) where locality is key.	Use when data is spatial/local (images, 2D signals), you want strong inductive bias + efficiency, especially on edge/mobile or when dataset is not huge; they often outperform Transformers on small/medium vision data.
Transformers	From ~10⁵ labeled examples (small tasks) up to 10⁸–10¹² tokens for large LLMs; they shine when you can pretrain or reuse a pretrained model.	NLP (LLMs, translation, summarization), large‑scale vision (Vision Transformers), multimodal (text+image/video).	Use when you have large data or a good pretrained checkpoint, need long‑range/global context, or want a single backbone across text/vision/multimodal. On small data without pretraining they can overfit and be overkill.
GNNs	Often 10³–10⁶ nodes/graphs; data is usually structurally rich but not huge in samples.	Molecules and chemistry (nodes=atoms, edges=bonds), social/recommendation graphs, knowledge graphs, program graphs.	Use when data is naturally a graph (you care about relationships more than sequence or grid), and you need per‑node/edge/graph predictions; otherwise CNN/RNN/Transformer is simpler.