Understanding the Elixir Machine Learning Ecosystem

Understanding the Elixir Machine Learning Ecosystem

An introduction to Machine Learning in Elixir through a glossary of libraries.

Edit (March 23, 2024): Added sections about the instructor_ex and langchain libraries

In my previous article I wrote about the process of transitioning into the Elixir Machine Learning ecosystem and included some arguments as to why I believe now is a good time to make that move. That article generated some buzz, even reaching the #2 spot on Hacker News for a brief time. This prompted some lively discussion about the benefits or drawbacks about using Elixir for you machine learning applications, and I believe that much of the discussion was driven from a lack of understanding of the state of the Elixir machine learning ecosystem, possibly due to a lack of open educational materials on the subject. I also did a poor job setting the stage with that article for having readers outside of the Elixir community, evidenced by the fact that some people were confused at to what Nx was.

Others commented on some libraries that I had not mentioned in my previous article, and seeing the feedback it became obvious that there is certainly an appetite for an centralized resource where people can be introduced to these libraries.

With Elixir ML moving at such a rapid pace it is very likely that many articles or resources are out of date, so I will attempt to keep this updated to the best of my ability. In this article, I will attempt to bridge the gap and offer a glossary of machine learning libraries as well as explain the core technologies that undergird the stack.

This article is NOT meant to be a tutorial on any specific techniques or libraries. For a great (the best?) resource on Machine Learning in Elixir, check out Sean Moriarity's book of the same name published by PragProg.

Elixir-Nx

Elixir-Nx is an organization that houses most of the Elixir core machine learning libraries. It started after José Valim (creator of Elixir) came across Sean Moriarity's first book Genetic Algorithms in Elixir. Valim explained in a podcast that prior to that book he had not considered using Elixir for machine learning. Valim admitted that he did not actually read the book, but that the title alone drew enough intrigue from him that he reached out to Moriarity about exploring development of a machine learning ecosystem. After putting together a core team, they decided that the the first step would be implementing a numerical computing library which would serve as the foundational library for the rest of the ecosystem, after which they could build higher level libraries such as Axon and Scholar, which we will discuss later.

Nx

Nx is the foundational numerical computing library in Elixir. It can be compared to NumPy in Python. Simply put, it is a tensor creation and manipulation library that offers mainly granular linear algebraic operations that can be performed on the tensors. Nx's primary data structure is the Tensor struct. Nx.Container's are any module that implements said protocol to manipulate and operate on Tensor's, and by default Nx implements the Container protocol for Tuple, Map, Integer, Float, Complex and Tensor, while also providing an Any implementation that you can derive from using the @derive attribute.

Nx currently ships with three backends: the Elixir binary backend, EXLA, and Torchx. The binary backend uses Elixir/Erlang binaries for its underlying storage, while EXLA uses Google's XLA and Torchx uses Facebook's PyTorch / LibTorch. EXLA and Torchx are both supported using Native Implemented Functions (NIFs), enabling GPU support for the tensor operations. With the recent release of Huggingface's Candle library ML framework in Rust, there very well might be another addition to this list in the future. Nx will use the binary backend by default, so be sure to include the additional library as well as setting it as your backend if you wish to take advantage of the native libraries.

When compiling for one of the native backends, there is a concept of a "numerical definition" which are implemented using defn and deftransform (as well as their private counterparts defnp and deftransformp. Any functions written within these definitions will be added to the compiler's evaluation graph and will impose certain restrictions on the functions you write. You can use functions from Nx on tensors outside of these definitions, but they will not be compiled to the evaluation graph and as such will not get the added performance benefits.

Nx also provides very nice abstractions such as Nx.Serving and Nx.Batch which can be used for conveniently serving models in a distributed manner as is expected of Elixir applications.

Axon

Axon is the deep-learning (DL) neural network library written by Sean Moriarity which just adds DL-specific abstractions on top of Nx. The design of Axon is largely inspired by PyTorch, with the Axon.Loop construct stemming from PyTorch Ignite. The three high-level APIs exposed by Axon are its Functional API, Model Creation API, and Training API. Axon also includes APIs for model evaluation, execution and serialization. Axon ships with pre-made layers, loss functions, metrics, etc., but also gives the user the ability to add custom implementations as well. Hooks are included in the training cycle to allow custom behavior during training.

Bumblebee

Bumblebee is a library of pre-trained transformer models akin to Python's Huggingface Transformers library. All models in Bumblebee are built on top of Axon, and as such, can be manipulated in the same way you would an Axon model. You can perform inference using the default pre-trained model or you can boost the model by training your own data for improved performance in a specific domain. You can see the list of all models included in Bumblebee on the sidebar of the documentation, but here is a sample just to name a few: Bart, Bert, Whisper, GPT2, ResNet, and StableDiffusion.

Scholar

Scholar is a traditional machine learning library for Elixir, comparable to much of the functionality found in Python's SKLearn. In the words of its documentation, "Scholar implements several algorithms for classification, regression, clustering, dimensionality reduction, metrics, and preprocessing." Scholar is divided into its Model modules and its Utility modules. It includes models for linear/logistic regression, liner/bezier/cubic interpolation, PCA, Gaussian/Multinomial Naive-Bayes, and more. Some of its utilities includes: distance/similarity/clustering metrics as well as preprocessing functions such as normalization and encoding.

Explorer

Explorer is a 1-D series and N-D tabular dataframe exploration and manipulation library built on top of the Rust Polars library. According to its README, "The API is heavily influenced by Tidy Data and borrows much of its design from dplyr." In Explorer, Series are one dimensional and are similar to an Elixir List but they can only contain items of a single type. A Dataframe is simply a way to work on mutiple Series whose lengths are the same. Oftentimes this will be as a 2-D tabular dataframe similar to a CSV or a spreadsheet.

Explorer high-level features are:

  • Simply typed series: :binary, :boolean, :category, :date, :datetime, :float, :integer, :string, and :time.
  • A powerful but constrained and opinionated API, so you spend less time looking for the right function and more time doing data manipulation.
  • Pluggable backends, providing a uniform API whether you're working in-memory or (forthcoming) on remote databases or even Spark dataframes.

Scidata

Scidata houses sample datasets that enables easy training and testing of models on industry-standard datasets such as MNIST, CIFAR, IMDB Reviews, Iris, Wine, and more. In Python, many machine learning libraries have their own datasets API such as SKLearn's Toy Datasets, Keras Datasets, and PyTorch Datasets. Scidata has a very simple API that can be loaded into Nx Tensors after download. Scidata separates each dataset into its own module and has separate download_test functions to download a test set as opposed to downloading the whole set. Scidata also provides utilities to allow you to use the Scidata API to download custom datasets.

EXGBoost

EXGBoost is the library I wrote to provide Elixir bindings to the XGBoost API. EXGBoost implements NIF bindings to the C++ API that XGBoost supplies. XGBoost is a C++ gradient-boosted decision tree library. The official XGBoost project provides APIs for the following languages / technologies: Python, JVM, R, Ruby, Swift, Julia, C, C++, and a CLI. Gradient Boosted Decision Trees are a form of ensemble learning mostly used for classification or regression tasks on tabular data. Moriarity wrote an introductory article on the library here, and I wrote a bit about the process of writing the library here.

EXGBoost consumes Nx.Tensor's for training, but the model that is outputted from the training uses the Booster struct to represent the model and cannot be used with constructs such as Nx.Serving unless you compile the model into tensor operations using the accompanying Mockingjay library I wrote. Once the models are compiled to tensor operations they can only be used for inference operations.

Ortex

The Ortex README summarizes it quite succinctly:

Ortex is a wrapper around ONNX Runtime (implemented as bindings to ort). Ortex leverages Nx.Serving to easily deploy ONNX models that run concurrently and distributed in a cluster. Ortex also provides a storage-only tensor implementation for ease of use.

ONNX models are a standard machine learning model format that can be exported from most ML libraries like PyTorch and TensorFlow. Ortex allows for easy loading and fast inference of ONNX models using different backends available to ONNX Runtime such as CUDA, TensorRT, Core ML, and ARM Compute Library.

Livebook

Livebook is Elixir/Erlang's interactive notebook solution. Livebook is comparable to Jupyter Notebooks, although the Livebook project has certainly not confined itself to the same design decisions as Jupyter. Livebook embraces the functional nature of Elixir by allowing Forks within a Livebook, where a new section is derived from a previous section and starts with the same state as the forked section.

Livebook also has the concept of Smart Cells which allow you to write templates for interactive cells that can be reused. Smart Cells are powered by a companion library to Livebook called Kino. As explained in the Livebook tutorial:

In a nutshell, Kino is a library that you install as part of your notebooks to make your notebooks interactive. Kino comes from the Greek prefix "kino-" and it stands for "motion". As you learn the library, it will become clear that this is precisely what it brings to our notebooks.

Kino can render Markdown, animate frames, display tables, manage inputs, and more. It also provides the building blocks for extending Livebook with charts, smart cells, and much more.

Livebook also allows you to configure your runtime and even hook into a running instance of IEx. Lastly, one of my favorite design decisions of Livebook is that they save as normal markdown files, enabling very easy sharing. You can write entire blog posts in markdown which can be run as Livebooks (refer to the EXGBoost article I linked above for an example)!

Instructor_ex

Instructor_ex is a self-described "spiritual port" of the Python instructor library It implements the simple yet extremely powerful idea of coercing Large Language Models (LLMs) such as OpenAIs GPT-4 to output answers according to a defined Ecto Schema. You provide the structure of the desired output, including types and a description, and it will help coerce the LLM's response to conform to that schema, providing validation and auto-retrying when an answer deviates from the schema.

The documentation showcases many examples of how you can use this, such as to perform Text Classification, Question & Answering (with citations), and even Extracting Text from Images using GPT-4 Vision. As you can see from these examples, instructor_ex makes using LLMs for much more approachable for programmatic tasks.

Jason Liu, the author of the original Python Instructor library, has a bevy of writings about the underlying motivation and concepts behind instructor which you can read about here.

💡
You can see an example of instructor_ex in action at this Livebook App hosted on HuggingFace Spaces

https://huggingface.co/spaces/acalejos/livebook-apps

Elixir Langchain

Elixir Langchain is a port of the very popular LangChain libraries (Python / TS), which are libraries that aim to provide a modular abstraction to many common Large Language Model (LLM) tasks and features.

The Elixir-LangChain project describes itself as follows:

LangChain is a framework for developing applications powered by language models. It enables applications that are:

  • Data-aware: connect a language model to other sources of data
  • Agentic: allow a language model to interact with its environment

The main value props of LangChain are:

  1. Components: abstractions for working with language models, along with a collection of implementations for each abstraction. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not
  2. Off-the-shelf chains: a structured assembly of components for accomplishing specific higher-level tasks

Off-the-shelf chains make it easy to get started. For more complex applications and nuanced use-cases, components make it easy to customize existing chains or build new ones.

LangChain offers off-the-shelf chains for typical tasks with LLMs, but also allows building custom chains with the library primitives, as well as allows for having LLMs invoke custom Elixir functions using its function calling feature.

This makes it much easier to programmatically interact with LLMs by providing a data structure to represent the conversation from the client-side, allowing you to compose interactions with the LLM.

Mark Ericksen, the author of Elixir LangChain, posted a writeup for an example project using the library, so I suggest you read that if you're interested.


Summary

Python Library Elixir Library Description
NumPy Nx Numerical definitions, tensors and tensor operations
TensorFlow / PyTorch Axon Deep Learning / Neural Networks
Transformers Bumblebee Pretrained transformer models
SKLearn Scholar Traditional Machine Learning
Pandas Explorer Tabular dataframe manipulation
SKLearn Datasets Scidata Sample datasets
XGBoost EXGBoost Gradient-Boosted Decision Trees
Jupyter Notebooks Livebook Interactive Notebooks
ONNX Runtime Ortex ONNX Runtime
Instructor instructor_ex Structured, Ecto outputs with OpenAI (and OSS LLMs)
LangChain Elixir Langchain Framework for developing applications powered by language models

Conclusion

This was by no means an exhaustive list of all libraries in Elixir that are useful during machine learning tasks, but I did try to cover all of the most prominent libraries that are currently available. The Elixir ML ecosystem is alive and well, albeit still quite young. It has made great strides the past few years but still has much room to grow, so you should be encouraged to contribute yourself! I did not have much experience in Elixir before beginning to contribute to the ML ecosystem, and I would implore anyone who is looking for ways to get started with open-source contributions to look no further than Elixir. If you're new to Elixir and have made it this far, then you will probably pick it up quickly, but nonetheless you should check out my previous article about 5 Tips for Elixir Beginners. That's all I have for now. Thanks for reading, and consider subscribing to this website if you like this kind of content.

Comments