Table 3 Select Large Language Models (LLM)
From: Can AI help with the hardest thing: pro health behavior change
Product Acronym | Released | Company | Transformer Capabilities | Applications | LLM Training Parameters |
|---|---|---|---|---|---|
BERT | 2017 | bidirectional encoder representations from transformers | Encoder reads text achieving contextual relationship for next sentence prediction | 110–340 M | |
GPT-3 | 2020 | OpenAI | generative pre-trained transformers-3 | Decoder with 2048 token long word fragments in context for predicting the next token | 175 B |
LaMDA | 2020 | large language models (LLM) for dialog applications | Transformer-based neural model of 2.8 T tokens predicts next token in given contexts | 137 B | |
DALL-E | 2021 | OpenAI | transformer language model (focused version of GPT-3) | Decoder model generates images using 1280-token-long text caption-image pairs | 12 B |
PaLM | 2021 | pathways language model (6144-chip TPU accel. cluster) | Autoregressive decoder parallel task computing for logical inference, joke explaining | 540 B | |
ChatGPT | 2022 | OpenAI | general purpose chatbot-focused model of GPT-3.5 | Trained on 300B tokens to emulate text writing; intuitive user interface on many topics | 175 B |
Galactica (GAL) | 2022 | MetaAI | open-source LLM for scientific knowledge (5 models) | Tokenizes scientific information in curated corpus to write papers, solve equations, etc. | 250 M-120 B |
GPT-4.0 | 2023 | OpenAI | multimodal, 60% less likely than ChatGPT to hallucinate | Accepts prompts composed of both images and text, returning textual responses | 175 B |
LlaMA-3.3 | 2024 | MetaAI | multimodal, open source on the cloud (business workflows) | Interprets charts, maps, image texts; multilingual understanding (customer svc, marketing) | 70 B |
Qwen-QwQ-32B | 2025 | Alibaba | mixture of experts (MoE) model, 32 token context windows | Enterprise applications; mathematical reasoning and coding; efficient (less computing) | 72B |
GPT-4.5 | 2025 | Open AI | advanced unsupervised learning; hierarchical token processing | High “EQ” for creative insights, following user intent for problem solving, writing, etc. | 6 T (?) |
DeepSeek-R1 | 2025 | DeepSeek | MoE sub models activated by chain-of-thought inputs | Understands long-form content; rapidly does complex math, finance and coding tasks | 671 B |
Definitions | |||||
Transformer | Attention mechanism that learns contextual relationships between words (and sub-words) in text; generalizes across domains and tasks | ||||
Encoder | Reads an entire sequence of words at once; input sequence of tokens is embedded into vectors to be processed by the neural network | ||||
Decoder | Receives encoder output and prior timestep decoder output; replaces text with tokens (masking) to train the predictive model for the task | ||||
Bidirectional | Text reading models that learn the context of a word based on all of its surroundings (not by R-to-L or L-to-R directional reading) |