by Michele Laurelli
A comprehensive dictionary of artificial intelligence terms and concepts
Proportion of correct predictions out of total predictions made.
A mathematical function applied to a neuron's output to introduce non-linearity into the network.
A layer that applies a non-linear activation function element-wise to its input.
Adaptive learning rate optimizer that adapts rates per parameter based on historical gradients.
An adaptive learning rate optimization algorithm combining momentum and RMSprop.
Training technique improving model robustness by including adversarial examples.
Deep CNN that won ImageNet 2012, pioneering deep learning in computer vision.
A step-by-step procedure or formula for solving a problem or performing a task.
Field of computer science focused on creating systems capable of performing tasks requiring human intelligence.
A technique allowing models to focus on specific parts of the input when producing output.
Computed similarity between query and key vectors before softmax normalization in attention.
A scalar value indicating how much focus to place on a specific part of the input when producing output.
Area under ROC curve, measuring binary classifier quality across all thresholds.
A neural network trained to reconstruct its input, learning compressed representations in the process.
A pooling operation that computes the average value from each window of the feature map.
An algorithm for training neural networks by calculating gradients of the loss function with respect to weights.
A gradient descent variant that computes gradients using the entire training dataset in each iteration.
Normalizes layer inputs using batch statistics to stabilize and accelerate training.
The number of training examples used in one iteration of model training.
Number of training examples processed together in one forward/backward pass.
Decoding algorithm keeping top B most likely sequences at each step.
An additional learnable parameter in neural networks that allows shifting the activation function.
Metric for machine translation quality comparing n-gram overlap between generated and reference translations.
Masking technique preventing attention to future positions in autoregressive models.
Prompting technique encouraging LLMs to show intermediate reasoning steps before answering.
A supervised learning task where the goal is to predict discrete class labels for input data.
A multimodal model trained to understand relationships between images and text.
An unsupervised learning task that groups similar data points together based on their features.
A deep learning architecture specialized for processing grid-like data such as images, using convolutional layers.
Large-scale dataset for object detection, segmentation, and captioning with 330k images.
A field of AI enabling computers to derive meaningful information from visual inputs like images and videos.
A table used to evaluate classification model performance by showing true vs predicted classes.
Self-supervised learning contrasting positive pairs against negative pairs.
A mathematical operation that slides a filter/kernel over input data to extract features.
A layer in CNNs that applies convolution operations to extract spatial features from input data.
Resampling technique evaluating model performance by splitting data into multiple train-test folds.
Techniques to artificially increase training data size by creating modified versions of existing data.
A collection of data examples used for training, validation, or testing machine learning models.
A subset of machine learning using neural networks with multiple layers to learn hierarchical representations of data.
Factorizes standard convolution into depthwise and pointwise convolutions for efficiency.
Loss function based on Dice coefficient, commonly used for segmentation tasks.
Generative models that learn to create data by reversing a gradual noising process.
Convolution with gaps between kernel elements, increasing receptive field without adding parameters.
Techniques to reduce the number of features in data while preserving important information.
A regularization technique that randomly deactivates neurons during training to prevent overfitting.
A regularization technique that stops training when validation performance stops improving.
A dense vector representation of discrete entities (words, images) in a continuous space.
An architecture where an encoder processes input into a representation and a decoder generates output from it.
One complete pass through the entire training dataset during model training.
One complete pass through the entire training dataset during training.
The difference between predicted output and true label, indicating model's mistakes.
A problem where gradients become extremely large during training, causing unstable updates and divergence.
The harmonic mean of precision and recall, providing a single balanced metric.
An individual measurable property or characteristic of data used as input to a model.
The process of transforming raw data into numerical features that machine learning models can process.
The output of applying a convolution filter to an input, representing detected features.
A model's ability to learn from a small number of examples, typically 1-10 examples per class.
A small matrix of learnable weights that slides over input during convolution to detect specific features.
The process of adapting a pre-trained model to a specific task by continuing training on task-specific data.
Loss function addressing class imbalance by down-weighting easy examples.
Large-scale models trained on broad data that can be adapted to a wide range of downstream tasks.
A neural network layer where every neuron is connected to every neuron in the previous and next layers.
A framework where two neural networks compete: a generator creates fake data and a discriminator tries to distinguish real from fake.
A family of large language models developed by OpenAI that use transformer architecture for text generation.
Direction and magnitude of steepest increase in loss function with respect to parameters.
An optimization algorithm that iteratively adjusts parameters to minimize a loss function by following the gradient.
Normalizes by dividing channels into groups and normalizing within each group.
Simplified RNN variant with gating mechanisms, similar to LSTM but fewer parameters.
Large-scale image dataset with 14M images across 20k categories, used for ILSVRC competition.
CNN building block applying multiple filter sizes in parallel and concatenating results.
The process of using a trained model to make predictions on new data.
Data fed into a model or neural network for processing.
The first layer of a neural network that receives raw input data.
Normalizes each sample independently, commonly used in style transfer.
The target output or ground truth associated with a training example in supervised learning.
A hyperparameter controlling how much model weights are updated during training.
Hyperparameter controlling how much to adjust model weights during training.
A strategy for adjusting the learning rate during training to improve convergence and performance.
A neural network with billions of parameters trained on massive text datasets to understand and generate human language.
Parameter-efficient fine-tuning adding trainable low-rank matrices to frozen weights.
A measure of how wrong the model's predictions are, used to guide training.
A function that measures the difference between predicted and actual values, guiding model optimization.
Automatic translation of text from one language to another.
Loss function measuring average absolute difference between predicted and actual values.
Pre-training task where random tokens are masked and model predicts them from context.
A 2D array of numbers arranged in rows and columns.
A pooling operation that takes the maximum value from each window of the feature map.
Data augmentation creating synthetic examples by mixing pairs of training samples.
Efficient CNN architecture using depthwise separable convolutions for mobile deployment.
A mathematical representation learned from data that makes predictions or decisions.
An optimization technique that accelerates gradient descent by accumulating a velocity vector in directions of persistent reduction in the loss.
Loss function measuring average squared difference between predicted and actual values.
AI systems that can process and relate information from multiple modalities like text, images, audio, and video.
NLP task identifying and classifying named entities (persons, organizations, locations) in text.
A computational model inspired by biological neural networks, consisting of interconnected nodes (neurons) that process information.
Basic computational unit in neural networks that receives inputs, applies weights and activation, produces output.
Pre-training task predicting whether sentence B follows sentence A.
Text generation sampling from smallest set of tokens whose cumulative probability exceeds threshold P.
A computer vision task that identifies and localizes objects within an image using bounding boxes.
The process of adjusting model parameters to minimize the loss function and improve performance.
The result produced by a model after processing input data.
The final layer of a neural network that produces predictions or outputs.
When a model learns training data too well, including noise, resulting in poor generalization to new data.
Adding extra pixels around the border of input data to control output size in convolution operations.
Learnable values in a model that are optimized during training (weights and biases).
Simplest neural network with single layer, binary classifier invented in 1950s.
Measurement of how well a probability model predicts a sample, for evaluating language models.
A downsampling layer in CNNs that reduces spatial dimensions while retaining important features.
A technique to inject position information into transformer inputs since transformers lack inherent sequence order.
Metrics for classification: Precision is correct positives / predicted positives; Recall is correct positives / actual positives.
The output produced by a trained model when given new input data.
The practice of designing effective text prompts to guide large language models toward desired outputs.
Removing unnecessary weights/neurons to reduce model size and computational cost.
Reducing precision of weights/activations to lower memory and computation.
Three vectors used in attention mechanisms to compute weighted combinations of input elements.
NLP task where model extracts or generates answers to questions based on context.
A technique that enhances LLM outputs by retrieving relevant information from external knowledge bases.
The region of input space that affects a particular neuron's activation in a neural network.
A supervised learning task where the goal is to predict continuous numerical values.
Techniques to prevent overfitting by adding constraints or penalties to the model during training.
A machine learning paradigm where agents learn by interacting with an environment and receiving rewards or penalties.
An activation function that outputs the input if positive, otherwise zero: f(x) = max(0, x).
A CNN architecture that uses residual connections (skip connections) to enable training of very deep networks.
Adaptive learning rate optimization algorithm using moving average of squared gradients.
A neural network architecture designed for sequential data, with connections that loop back to previous states.
A single numerical value, a zero-dimensional tensor.
An attention mechanism used in deep learning models that allows a neural network to weigh the importance of different parts of an input relative to each other.
Learning paradigm where models create supervision signal from unlabeled data.
A computer vision task that assigns a class label to every pixel in an image.
NLP task determining emotional tone or opinion expressed in text.
A gradient descent variant that updates weights using gradients from a single random training example at a time.
An activation function that maps inputs to values between 0 and 1: f(x) = 1/(1 + e^(-x)).
An activation function that converts a vector of values into a probability distribution summing to 1.
Reading comprehension dataset with 100k+ questions on Wikipedia articles.
The number of pixels by which a filter moves across the input during convolution or pooling operations.
A machine learning paradigm where models learn from labeled training data with input-output pairs.
Parameter controlling randomness in text generation by scaling logits before softmax.
Data held out for final evaluation of a trained model, never seen during training.
NLP task condensing long text while preserving key information.
The process of breaking text into smaller units (tokens) like words, subwords, or characters for processing.
Text generation technique sampling from only the K most likely next tokens.
The subset of data used to train a machine learning model.
A technique where knowledge learned from one task is applied to a different but related task, reducing training time and data requirements.
A neural network architecture based entirely on attention mechanisms, without recurrent or convolutional layers.
Loss function learning embeddings by minimizing distance between anchor-positive and maximizing anchor-negative.
A CNN architecture designed for biomedical image segmentation, featuring an encoder-decoder structure with skip connections.
A machine learning paradigm where models find patterns in unlabeled data without explicit supervision.
A generative model that learns a probabilistic latent space representation of data.
A portion of data held out during training to tune hyperparameters and prevent overfitting.
A problem in deep networks where gradients become extremely small, preventing effective learning in early layers.
An ordered array of numbers representing a point in multi-dimensional space.
Deep CNN architecture using small 3x3 filters throughout, emphasizing depth.
Transformer architecture adapted for computer vision by treating image patches as tokens.
Learnable parameters in neural networks that determine the strength of connections between neurons.
Methods for setting initial values of neural network weights before training begins.
Dense vector representations of words that capture semantic and syntactic relationships.