Missing but Necessary Software Infrastructure for AI on Apple Silicon: the Triton Language

However, despite these solid hardware foundations and provided libraries, the Apple ecosystem lacks a crucial element: the ability for researchers and developers to fully leverage the Apple GPU by writing custom optimized kernels.

CUDA is not available on macOS (NVIDIA GPUs have not been supported on Mac since 2018), and Apple has also deprecated OpenCL/OpenGL in favor of Metal.

Deep learning frameworks like PyTorch and TensorFlow have introduced support for the Metal (MPS) backend to utilize the Apple GPU in typical tensor operations; for example, PyTorch since version 1.12 allows training on M1 GPUs via the MPS API.

This has been an important step, but it remains confined to the use of predefined operators. If a developer wanted to implement a new operator or a custom variant of a neural layer on Mac, they currently lack tools equivalent to those available on NVIDIA GPUs.

In summary, the Apple Silicon environment offers hardware power but does not (yet) provide an open and flexible means to program that hardware at the lowest levels. This contrasts with the NVIDIA ecosystem, where there is a rich software infrastructure for AI: the CUDA language and a myriad of libraries, as well as DSLs like Triton, which allow custom GPU kernels to be written achieving extreme performance without needing to master all hardware details.

To contextualize the difference, let's consider the underlying GPU architectures.

Figure - Simplified architecture of an NVIDIA GPU (CUDA): multiple SMs (Streaming Multiprocessors) with ALUs (green squares) and local memories (cache, shared memory) are connected via Load/Store units to a global memory of the device separate from system memory.
As shown above, in NVIDIA GPUs each SM represents the fundamental computing unit, with its own cores and cache, and all SMs access a global VRAM (distinct from CPU RAM) via a high-bandwidth interconnect. Threads are organized into blocks within each SM and share a fast local shared memory within the SM, while global memory has higher latency. This architecture requires explicit management of data transfers between host (CPU) and device (GPU) and optimization of data locality to maximize the use of fast on-chip memories.

Figure - Simplified architecture of an Apple GPU (Metal): multiple graphics cores equivalent to SMs (each with ALUs, “Control,” and local memory) are connected to a single Unified System Memory, that is, the unified memory shared with the CPU.
In the case of Apple, each GPU core (computing unit analogous to an SM) also has vector ALUs and cache, and threads organized into threadgroups (the equivalent of CUDA blocks). There is also a shared local memory for threadgroups (called threadgroup memory, conceptually similar to NVIDIA's shared memory). The key difference is that there is no separate VRAM: the GPU and CPU access the same unified physical memory. This simplifies the host-side programming model (no explicit cudaMemcpy needed), but implies that memory coherence and contention must be managed carefully to avoid bottlenecks when CPU and GPU concurrently access data. Apple, by tightly controlling hardware and software, has managed to effectively leverage UMA in many use cases. However, from a GPU developer's perspective, Metal is more “low-level” compared to CUDA: launching a kernel requires manually setting up pipelines, buffers, command encoders, etc., while CUDA offers more automated constructs (e.g., the syntax <<< >>> for launching). Apple favors fine control over resources at the expense of a bit more complexity for the developer.

In summary, Apple Silicon hardware has all the necessary components to compete with traditional GPUs (parallel cores, hierarchical memory with cache and local memory, SIMT execution for 32-thread warps). The Achilles' heel lies in the software available to external developers: those working on macOS do not have access to a GPU development kit comparable to CUDA. Apple promotes the use of Metal Performance Shaders (MPS) and now MLX to achieve out-of-the-box performance, but these tools do not allow for custom user-written kernels at a lower level. The practical consequence is that many developers/researchers, to prototype custom optimizations of deep learning models, still have to rely on Linux systems with NVIDIA GPUs, even if they own a powerful Mac. This inefficiency is what makes a language like Triton “absent but necessary” for AI on Apple Silicon.

The Triton language: high-performance GPU programming in Python

Triton is an open-source language and compiler initially developed by researcher Philippe Tillet and later extended by OpenAI (with the release of version 1.0 in 2021) to make high-performance GPU programming more accessible[13]. It is a DSL embedded in Python: the developer writes GPU kernels as Python functions decorated (@triton.jit), using NumPy-like APIs provided by triton.language (for example, vectorized operations on tensors) – and the system compiles just-in-time this code into a highly optimized GPU kernel. The goal of Triton is to allow those who are not CUDA experts to approach the maximum performance of the hardware, automating many of the manual optimizations typical of CUDA development. A frequently cited result is that with ~25 lines of Triton, one can write a matrix multiplication kernel in FP16 as efficient as the assembly implementation of cuBLAS provided by NVIDIA. In some cases, OpenAI researchers have achieved with Triton specialized kernels that are 2 times more efficient than the equivalent operators in PyTorch. This highlights the value of having a flexible tool: new algorithms and ideas can translate into optimized GPU implementations quickly, without having to wait for them to be added to official libraries.

From the perspective of the programming model, Triton adopts an SPMD (Single Program, Multiple Data) model similar to CUDA thread blocks. Each launched kernel spawns many parallel instances (program instances) that execute the same code on different data, similarly to threads organized in a grid. Within each instance, Triton allows vector operations on small arrays (e.g., fixed-size blocks) through constructs similar to thread warps that work in lockstep. The significant difference compared to CUDA is that Triton automates a series of optimizations and resource management, leaving the programmer with only higher-level decisions (such as tiling the algorithm, block size, etc.). In a simplified comparison: in CUDA, the developer must manually manage aspects like memory coalescing, use of shared memory, instruction scheduling, and synchronizations;

Triton, on the other hand, automatically handles DRAM access coalescing, cache/shared memory management, and intra-SM (intra-core) scheduling, freeing the programmer from these low-level details. The latter must still take care of dividing the work among SMs (grid size, etc.), which ensures algorithmic flexibility. In other words, Triton offers a middle ground between high-level frameworks and bare-metal CUDA code: it allows writing specialized kernels in Python as if they were vector operations on batches of elements, and behind the scenes, the compiler takes care of translating everything into efficient GPU code.

On the implementation side, Triton is built on solid compilation foundations. The current version uses LLVM and MLIR (Multi-Level IR) as a backend: the Python code is first converted into a high-level Triton-IR, on which classical optimizations (common subexpression elimination, constant propagation, loop unrolling, etc.) and GPU-specific optimizations (like prefetching, matrix tiling, access coalescing) are applied. The optimized IR is then lowered to LLVM IR and from there translated into the target GPU's ISA. Until recently, Triton officially supported only NVIDIA GPUs: the backend emits PTX (the intermediate bytecode of CUDA) which is then compiled just-in-time by the NVIDIA driver into executable machine code (CUBIN). This process directly leverages the NVIDIA NVCC compiler present in the drivers, ensuring that the final code is highly optimized for the specific GPU. Essentially, Triton “speaks” natively to NVIDIA GPUs using their own language (PTX), but allows it to be generated from much more concise Python code.

One of the reasons for Triton's success is its integration with deep learning frameworks and its extensibility. Triton was initially used as a standalone library (importable into any Python project to launch kernels on tensors, for example, integrating with PyTorch/CuPy). Today, its importance has increased with the advent of compilers in libraries: PyTorch 2.0 adopts Triton in its TorchInductor backend to generate high-performance kernel fusions. In practice, when PyTorch 2 automatically generates an optimized kernel that combines multiple operations, it often uses Triton internally to produce the corresponding GPU code, achieving superior performance to standard operators in various cases. This architectural choice highlights that Triton is now considered a staple of the GPU ecosystem: a reusable component to accelerate workloads on GPUs without manually writing kernels in CUDA C++.

It is important to emphasize that Triton is evolving towards multi-platform support. Although born in a CUDA-centric environment, thanks to MLIR abstraction, there are now work-in-progress efforts for other backends. For example, AMD has worked to integrate it into its ROCm stack: recent news indicates that ROCm 7.0 includes support for Triton 3.3.0. In other words, the Triton compiler can now generate code for AMD GPUs, leveraging the LLVM/ROCm infrastructure (presumably through MLIR dialects specific to GCN or via SPIR-V). This is a fundamental step: if Triton becomes vendor-agnostic, potentially a Triton kernel written in Python could be compiled on various GPU architectures (NVIDIA, AMD… and perhaps others in the future) with minimal changes or entirely automatically. And this is where the discussion about Apple becomes central: will Apple Silicon GPUs ever be a supported target by Triton? What challenges need to be overcome to get there?

Absence of Triton on Apple Silicon: problems and limitations

As of now, Triton does not support Apple GPUs. The situation is clearly recognized by developers: "at the moment you cannot run Triton on Apple GPU (Metal) because Triton does not have a Metal/Apple GPU backend". This limitation arises from both technical and historical reasons. As highlighted, Triton was designed around CUDA and NVIDIA GPUs; its backend was heavily tied to the NVIDIA ecosystem (PTX, architectural specifics like 32-thread warps, Tensor Core, etc.). Apple Silicon breaks this paradigm: the integrated GPUs of Macs use a proprietary architecture (AGX) with a graphics/compute API (Metal) completely different from CUDA. There is no detailed public documentation of the ISA of Apple GPUs, nor an open equivalent of PTX for Metal – the only official way to program the GPU is to use the Metal Shading Language (MSL) compiler provided by Apple. This means that to support Apple, Triton would need to add a backend capable of generating Metal shaders or emit some compatible IR (for example, SPIR-V, which could then potentially be translated into Metal via MoltenVK – but Apple does not natively support Vulkan/SPIR-V, relying on MoltenVK only to bring Vulkan applications). In short, this is a non-trivial backend problem: Triton is “baked-in” to NVIDIA according to the words of a developer who attempted to create an Apple backend, who noted how Triton's new architecture was heavily dependent on NVIDIA-centric assumptions.

The practical consequences of this absence are felt in the experience of ML developers on Mac. A glaring example is PyTorch 2.0: as mentioned, its compilation engine TorchInductor leverages Triton to generate efficient kernels. However, on Apple MPS, such optimizations are not available. Users trying torch.compile on MPS devices found that the Inductor backend did not work, having to fall back on less efficient modes. A PyTorch engineer in 2023 explained: "Inductor support for MPS essentially depends on Triton's support for MPS. Inductor generates Triton kernels that then run on GPU devices; currently, Triton is focused on NVIDIA GPUs, so…”. In other words, as long as Triton does not support Apple GPUs, PyTorch cannot easily optimize those workloads. This has forced the PyTorch team to seek alternative routes: indeed, in 2025, experimental support appeared for generating native Metal kernels in TorchInductor (without going through Triton). While this is promising – now torch.compile can produce kernels directly in MSL – it is still an ad hoc solution limited to PyTorch. It does not offer end-users the ability to write their own kernels at will; simply, the framework performs some optimized fusions. The absence of “pure” Triton on Apple Silicon thus leaves a void: researchers cannot implement new GPU algorithm ideas on Mac without radically changing ecosystems (rewriting in C++/Metal with all the associated burdens, or using alternatives like JAX+TPU, etc.). In fact, those using Mac for AI development often find themselves prototyping on CPU or on subsets of data, losing the advantage of GPU parallelism, or must run intensive workloads on remote servers with NVIDIA GPUs.

It is worth delving into why bringing Triton to Apple is not trivial at all, highlighting the main technical challenges:

Missing Metal backend: as mentioned, Triton does not have a code generator for Metal. The compiler currently generates PTX and invokes the NVIDIA JIT; an Apple backend would instead need to generate Metal Shading Language code on the fly and pass it to Apple's runtime compiler (via the Metal framework). This requires developing an entirely new codegen within Triton, capable of translating Triton-IR into syntactically correct and well-optimized MSL kernels for the Apple architecture. The Triton team has not done this so far, also because Apple GPU was not on the roadmap priorities – literally, "Apple's proprietary GPUs are not yet on the roadmap". Until recently, Triton's user base revolved around Linux/NVIDIA; requests for Apple were few and mainly from the Mac community.
Integration with proprietary drivers: the NVIDIA ecosystem offers established tools for JIT compilation (CUDA drivers reliably compile PTX). On Apple, one would need to leverage the Metal API: Apple allows compiling MSL strings at runtime via MTLDevice.newLibraryWithSource(...). This is positive – it means that a JIT Metal is possible – but it is less explored territory in the ML world. Additionally, the Metal compiler operates at a higher level (shading language) and may not expose all the fine-tuning optimization levers that a dedicated backend would want to control. Essentially, it would involve relying on Apple's compiler to generate machine code; Triton would need to “trust” Metal for low-level details, focusing on generating efficient MSL. This introduces uncertainties about performance: the Triton team has the know-how to optimize on NVIDIA due to prior knowledge, while on Apple it would need to gain experience on microarchitectural characteristics (for example, how to best leverage threadgroup memory on Apple, what the actual warp size is and its implications, etc.).

Architectural and memory model differences: although concepts like 32-thread warps and shared memory also exist on Apple, the Unified Memory changes some assumptions. On NVIDIA, Triton assumes there is separate “slow” global memory and separate host memory – which entails explicit distinctions between device/host pointers. In the Apple environment, a pointer can refer to valid unified memory for both CPU and GPU; the boundary is more blurred. Additionally, Apple does not have dedicated “constant” memory or separate texture caches: constant buffers in Metal also reside in unified memory (but with optimized access paths). The Triton backend would therefore need to manage (or ignore) certain optimizations designed for the NVIDIA hierarchy differently, while also leveraging Apple-specific features like the possibility of coherent CPU-GPU access to data (e.g.: in some cases, it might be useful for the CPU to directly prepare data in aligned structures for the GPU kernel in shared memory, etc.). These differences require in-depth study and adaptation of optimization passes.
Vendor support and documentation: AMD has managed to get Triton working on its GPUs largely because AMD itself (or an affiliated community) invested in development, providing specifications and integrating it into ROCm. So far, there have been no similar initiatives for Apple. Apple tends to prefer closed solutions (its focus is MLX/CoreML), and may not have a direct interest in investing in Triton (which would effectively promote open-source ML on Mac outside its control). An interesting signal, however, is that Apple has recently sponsored efforts to make MLX portable to CUDA. In other words, Apple seems to want to attract developers to use MLX on Mac and then allow code to be exported to NVIDIA GPUs for production. If we reverse the perspective, analogous support in reverse (running “CUDA-like” code on Apple GPUs) could fall within Apple's interest in expanding Mac adoption in AI. However, as it stands, anyone wanting to work on a Triton backend for Apple would be doing so in the dark, without public documentation on the GPU ISA (unless through reverse engineering like that of some low-level graphics blogs) and without official Apple support. This significantly raises the barrier to entry.

Ultimately, the absence of Triton on Apple Silicon is both a matter of development priorities (no one has yet implemented the necessary backend) and a matter of closed ecosystem. Until Apple opens its GPGPU stack more, or until a sufficient number of dissatisfied developers join forces to achieve a port, Apple GPUs will remain a partially isolated island in the sea of accelerated computing platforms.

Attempts and alternative approaches to bridge the gap

Despite the difficulties, there have been some attempts and parallel developments aimed at mitigating the absence of Triton on Apple or providing similar tools:

Experimental Triton builds on macOS: Some community members have tried to compile Triton on Mac ARM to explore the possibilities. For example, a user reported successfully building the Triton wheel on a MacBook M2 after fixing minor build bugs (differences in arm64 vs aarch64 targets). The build succeeds with minimal patches and passes unit tests on the CPU side, but naturally cannot run GPU tests since there is no NVIDIA GPU present. As the user themselves noted, “getting a working build does not unlock all the supported capabilities of the language: Apple silicon GPUs are not supported at the moment. But I wanted at least a native Triton to explore the hardware-agnostic aspects and learn”. These builds thus run only in CPU emulation: Triton indeed includes a CPU interpreter (mainly for debugging and testing purposes) but without any acceleration. They are mainly useful for studying Triton IR or using Triton as an “offline” PTX code generator on Mac (without executing it). They are useful intellectual exercises but do not solve the central problem.
Attempts at non-NVIDIA backends in Triton: As mentioned, some have tried to add support for M1. In 2023, a contributor reported: “I am working on an Apple silicon backend, but the project has undergone significant architectural changes; from my tests, I cannot get even ROCm to work, it seems that NVIDIA GPUs are currently the only ones functioning. The new design is quite tailored to NVIDIA... I hope it becomes more abstract, I wouldn’t want all the work on M1 to go to waste”. This GitHub issue (#2048) is still open and underscores how, at least at that time, the refactoring of Triton around MLIR had temporarily rendered alternative backends in the works non-functional. However, this can also be viewed positively: such architectural changes (late 2022) aimed to make Triton more modular and prepared for multiple backends. Indeed, today we see the fruits on AMD. Thus, work on the M1 backend could resume in the future on more solid foundations. So far, however, there have been no substantial pull requests upstream adding Apple support. It is possible that there are non-public experimental branches or internal efforts (for example, some interested research group), but nothing official has emerged.
PyTorch approach: Metal codegen in Inductor: As already mentioned, PyTorch developers have not remained entirely inert. Over the course of 2024-2025, they began implementing a native Metal codegen within TorchInductor. This project is significant because, in effect, it replaces Triton's role (for the specific case of PyTorch on Mac) with a new dedicated component. The generated code is no longer PTX but MSL: Inductor contains a codegen module/mps.py that translates the optimized PyTorch graph into a compilable Metal shader. Judging by the comments, this support is still experimental and incomplete, but has reached the stage where a simple model can be compiled and launched on M1 GPUs via Metal. In perspective, if it matures, it would mean that PyTorch on Mac could perform operator fusion and other optimizations without needing Triton. However, this remains limited to the PyTorch scope and does not offer a generic public API. Additionally, duplicating optimization logics that Triton already offers is inefficient in the long term. It would be more ideal if PyTorch could also use Triton on Mac; but, not having it, they had to implement a surrogate.
Apple MLX (Machine Learning eXperience): In 2023, Apple surprised by opening the code of MLX, a numerical framework similar to NumPy/JAX optimized for Apple Silicon. MLX leverages highly optimized kernels (also for the ANE, Neural Engine, in addition to GPU/CPU) and in some demos has shown significant performance boosts for deep learning models on Mac. However, MLX does not directly expose a way to write custom kernels like Triton – it is a library of predefined operations, albeit very comprehensive. A user on GitHub asked Apple if they intend to support “writing Metal kernels via Python” in MLX, similar to Numba CUDA. Currently, there is no indication that this is in the plans (the issue remains a simple request). Apple seems to aim more at providing all the necessary primitives already optimized internally, rather than giving users the freedom to program the GPU. However, the emerging strategy is interesting: as mentioned earlier, Apple is developing a CUDA backend for MLX, so that code written using MLX on Mac can then be “exported” to NVIDIA clusters for accelerated execution. This inverse-CUDA approach indicates that Apple wants to reduce vendor lock-in to encourage the use of its framework (knowing that in production models run on NVIDIA GPUs). It is amusing to think that Triton has the opposite goal – to allow those developing in a CUDA environment to run on other accelerators by writing portable code. In an ideal world, the two approaches would converge: if Apple opened the door to Triton, we could write a kernel once and run it anywhere – on Mac during development and on NVIDIA/AMD GPUs in production, without modifications. For now, however, we have MLX for Mac (with potential export) and Triton for NVIDIA/AMD (with potential import if Apple were supported one day).
Multi-platform solutions in other languages: It is worth noting that the idea of a portable GPU programming language is not science fiction – there are already examples. In the Julia realm, for instance, the community has created Metal.jl for direct targeting of Apple GPUs, as well as the KernelAbstractions.jl package that allows writing a kernel in Julia and executing it on different backends (CUDA, Metal, CPU) almost transparently. This demonstrates that a common abstraction over heterogeneous GPUs is possible: Julia does this by leveraging the fact that it can interface with various drivers (CUDA, Metal) with native bindings. In the Python world, projects like SYCL/DPC++ from Intel aim for something similar for CPU/GPU, but have not gained traction in AI like Triton. Mojo, an emerging new language for high-performance computing compatible with Python, also promises portability across devices (including Apple cores) thanks to a powerful unified backend compiler – but it is still in its early and proprietary phase. In summary, the absence of Triton on Apple does not mean that achieving something analogous is impossible: simply, as of today, the implementation is lacking in that context. If Julia can launch kernels on Metal and even on WebGPU, nothing theoretically prevents Triton (or its successor) from including an Apple backend and realizing the vision of a universal language for programming GPUs regardless of the manufacturer.

Future prospects and conclusions

The crucial question is: how can this gap in the Apple Silicon ecosystem be bridged? Several paths are possible, not mutually exclusive:

Apple directly supports Triton: A desirable scenario would be direct collaboration. Apple could provide resources (engineers, documentation) to help implement the Metal backend in Triton. For example, it could release a specific SDK to compile compute kernels offline, or collaborate with OpenAI/PyTorch to define an optimal pathway. Considering that Apple has sponsored parts of MLX for CUDA, it is not impossible that they also evaluate the reverse path. From Apple's perspective, embracing Triton would mean making Macs more attractive to researchers: an experienced user could develop and optimize models directly on MacBook Pro, fully leveraging the GPU, without having to adapt to different languages. This would increase the value of the Mac platform in open-source ML. Of course, Apple would have to accept losing some control in favor of open source, which is not a given.
Community open-source effort: If Apple does not move, the community could still proceed. The arrival of AMD support in Triton shows that the codebase can now be extended to new architectures. A motivated team of developers could work on an Apple backend. It would likely require reverse-engineering some details (for example, understanding best practices for thread distribution across Apple cores, how to manage tile memory, etc., based on few public references and empirical testing). An interesting possibility is to leverage the MoltenVK infrastructure: if Triton generated SPIR-V (as it was considering in the past for Intel/AMD), MoltenVK could translate it into Metal calls. However, MoltenVK is designed for graphical workloads and does not guarantee exposing all compute features at peak performance. More likely, MSL would need to be generated directly. In any case, such a project would require months of work and skills in both compilers (MLIR, LLVM) and Metal GPU programming – a rather rare skill set. It could emerge from academic environments (for example, research groups wanting to use Mac clusters for parallel computing) or from companies focusing on on-device ML.
Evolution of high-level frameworks: Meanwhile, major frameworks like PyTorch and TensorFlow could continue to improve their native support for Apple, reducing the need for Triton on the end-user side. PyTorch, with its Metal codegen, could progressively cover more operators and use cases, until providing performance close to that of Triton+CUDA for many standard neural networks. Apple, for its part, could enhance Core ML Tools to compile models into highly optimized executables (perhaps leveraging the Neural Engine combined with the GPU). In practice, the “Apple solution” could be: not directly giving a Triton to users, but making Triton less necessary because the framework takes care of everything. This walled garden philosophy works well for many developers (those who prefer turnkey solutions), but still leaves the “power user” segment dissatisfied who would like to tweak specific optimizations. Thus, even with better automatic backends, the pure research would still have the desire for a low-level tool.
New intermediate abstractions: Another perspective is that a standard or intermediate layer emerges that allows portability of high-performance code across different GPUs. For example, the OpenXLA project (portable ML compiler for various accelerators) or initiatives to define standardized MLIR dialects for accelerated compute could include support for Apple GPUs. If Triton does not fill the void, perhaps a successor or competitor could. As of today, however, Triton is unique in its simplicity and effectiveness, so any alternative would at least need to be inspired by it.

The Triton language represents exactly the type of software infrastructure that is missing in the AI ecosystem on Apple Silicon. Its absence deprives researchers on Mac of the freedom to experiment freely on their hardware, forcing them to circumvent the problem with suboptimal solutions. We have seen how Triton provides a level of fine control and optimization over GPU computing that has been decisive in the NVIDIA world to push the state of the art forward (in terms of performance and prototyping speed). Bringing these capabilities to Apple Silicon would mean unlocking the full potential of the integrated GPUs of Macs in machine learning, preventing them from being underutilized or relegated to secondary roles. There is a whole audience of advanced developers who would benefit: think of teams developing cutting-edge models, who could iterate and tune directly on the laptop while traveling; or university researchers who could leverage Mac Studio pools as programmable mini-GPU clusters. Currently, many of these scenarios are impractical without Triton (or an equivalent).

The hope is that in the near future this gap will narrow. Recent developments – with AMD supported, with Apple MLX open source, with PyTorch experimenting with new backends – raise hopes for greater openness and interoperability. Perhaps we will see a Triton-metal project emerge, or Apple itself could surprise by integrating a similar mechanism into its stack (perhaps a “Metal JIT Compiler” Python-facing). In the meantime, the discussion itself is useful: highlighting documentation, issues, and ongoing attempts helps to raise awareness of what is missing and why it is important. Apple Silicon has brought a wave of innovation in hardware: it would be paradoxical not to be able to fully leverage it due to software shortcomings.