Rule of thumb datascale
View more content with the tag
rule-of-thumb ,
View more content with the tag
datascale ,
View more content with the tag
architectures
Data scale, examples, and when to use which
| Architecture | Rough labeled data scale | Good example use cases | When it’s a good choice vs others |
|---|---|---|---|
| RNNs | ~10³–10⁵ sequences (small–medium). Past that, Transformers usually win if compute allows. | Short time‑series forecasting (sensor data), small chatbots, sequence tagging where sequences are not very long. | Use when sequences are short/medium, data is limited, you want simple recurrent code, and you don’t need huge context or pretraining; otherwise consider LSTM/Transformer. |
| LSTMs | ~10⁴–10⁶ sequences; can do well on moderate data without billion‑token corpora. | Speech recognition for a specific domain, log/anomaly detection, small/medium NLP tasks where long‑term dependencies matter but you don’t have Transformer‑scale data. | Use when you need better memory than vanilla RNNs but don’t want the full complexity of Transformers, or when you have streaming/online inputs and moderate data. |
| CNNs | ~10³–10⁶ images (or similar grid data); strong even at 10³–10⁴ with good augmentation. | Image classification (medical images, defect detection), object detection/segmentation, audio spectrograms, some text tasks (char‑CNNs) where locality is key. | Use when data is spatial/local (images, 2D signals), you want strong inductive bias + efficiency, especially on edge/mobile or when dataset is not huge; they often outperform Transformers on small/medium vision data. |
| Transformers | From ~10⁵ labeled examples (small tasks) up to 10⁸–10¹² tokens for large LLMs; they shine when you can pretrain or reuse a pretrained model. | NLP (LLMs, translation, summarization), large‑scale vision (Vision Transformers), multimodal (text+image/video). | Use when you have large data or a good pretrained checkpoint, need long‑range/global context, or want a single backbone across text/vision/multimodal. On small data without pretraining they can overfit and be overkill. |
| GNNs | Often 10³–10⁶ nodes/graphs; data is usually structurally rich but not huge in samples. | Molecules and chemistry (nodes=atoms, edges=bonds), social/recommendation graphs, knowledge graphs, program graphs. | Use when data is naturally a graph (you care about relationships more than sequence or grid), and you need per‑node/edge/graph predictions; otherwise CNN/RNN/Transformer is simpler. |