AI-Powered FFmpeg

The Best Way to Use FFmpeg

FFmpeg is one of the most powerful tools out there, but it can be daunting to use — until now.

Andrés Alejos

30 May 2024

Introduction

As I was working on a previous fun little demo of mine that used ffmpeg, I was reminded as to why, despite how powerful of a tool it is, I always find myself wincing at the thought of having to use it. It almost seems to be a universally shared experience among developers — once in a blue moon you find yourself needing to convert a video to a GIF, or downsample the resolution of an image, or any other variety of tasks that ffmpeg handle. You already have it installed from the last time you needed, but yet again, you've unlearned everything about it that you learned last time. So inevitably you start Googling again and see those purple links serving as a faint reminder of all you've forgotten.

You don't have to just take my word for it either. In a recent video, YouTuber ThePrimeagen made similar comments regarding use of ffmpeg. Now, with the advent of AI and LLM tools, it has become much easier to use command-line tools, and ffmpeg is certainly one of the biggest beneficiaries of this. Still though, the experience remains far from perfect. Now your new ffmpeg workflow might look like:

graph TD A[Ask ChatGPT a question about how to perform a certain operation in `ffmpeg`] --> B[Copy the output] B --> C[Paste the output to the terminal] C --> D[Get a wall of text with output] D -.-> E[Maybe ask ChatGPT to parse it since it's so dense] D --> F[Open the file] E --> F F --> G[Check if it worked the way you expected] G --> H{Did it work?} H -->|Yes| I[End] H -->|No| A

It seemed to me that there should be a more integrated solution. iTerm2's new LLM integration features (show off in the above linked video) are a step in the right direction, but there's still a certain amount of feedback provided by ffmpeg in the case of failures which shouldn't have to be manually fed back to the LLM and iterated manually.

So, let's go ahead and fix that! I'll show to make an interactive multimedia editor powered by AI and ffmpeg, that'll handle those aforementioned concerns for you.

You can give this a star on GitHub here: https://github.com/acalejos/CinEx

Here's a demo of the final product in case you're curious:

0:00

/4:53

Install Dependencies

Let's start by discussing the dependencies we'll use here. There's two main dependencies, and two less important dependencies. I'll go ahead and describe each in detail:

Kino - This is the library that provides the capacity to make interactive experiences in Livebook, which is the platform I'll be using. You can read more about Livebook here.
Instructor - Coerces responses from LLMs into JSON where we can also provide a schema and a set of validation functions that the responses must conform to. This makes it easy to use responses from LLMs within a data pipeline.
- You will need to provide a configuration to tell Instructor which LLM you are using by supplying an adapter. For this, I will be using the OpenAI adapter and using an environment variable to store my API key. Do note that although in the code it refers to the environment variable as LB_OPENAI_TOKEN, Livebook itself will actually prepend LB_ to your created tokens, so when you make a token in the left sidebar you would only need to call it OPENAI_TOKEN.
erlexec - Provides a more powerful way (over the standard library's System module) to run executables from Elixir (I should not it is actually an Erlang library, but you can call Erlang directly from Elixir). This is how we will capture stdout and stderr separately (whereas System.cmd at best will merge stderr into stdout)
Exterval - Supports writing real-valued intervals using a ~i sigil. These intervals implement the Enumerable protocol, and we will use them for a validation with Instructor. You can easily drop this library altogether, but I wrote it and know that it's just a single file so am comfortable leaving it in.

Mix.install(
  [
    {:kino, "~> 0.12.3"},
    {:instructor, "~> 0.0.5"},
    {:erlexec, "~> 2.0"},
    {:exterval, "~> 0.2.0"}
  ],
  config: [
    instructor: [
      adapter: Instructor.Adapters.OpenAI,
      openai: [api_key: System.fetch_env!("LB_OPENAI_TOKEN")]
    ]
  ]
)

Upload Struct

First, we'll start by making the Upload module which is in charge of operations related to the uploaded / generated media.

This has helper functions and guards to determine allowed file types, and includes the function to turn an Upload into a renderable Kino.

defmodule Upload do
  defstruct [:filename, :path]

  @video_types [:mp4, :ogg, :avi, :wmv, :mov]
  @audio_types [:wav, :mp3, :mpeg]
  @image_types [:jpeg, :jpg, :png, :gif, :svg, :pixel]

  defguard is_audio(ext) when ext in @audio_types
  defguard is_video(ext) when ext in @video_types
  defguard is_image(ext) when ext in @image_types
  defguard is_valid_upload(ext) when is_audio(ext) or is_video(ext) or is_image(ext)

  def accepted_types, do: @audio_types ++ @video_types ++ @image_types

  defp to_existing_atom(str) do
    try do
      {:ok, String.to_existing_atom(str)}
    rescue
      _ in ArgumentError ->
        {:error, "#{inspect(str)} is not an existing atom"}

      _e ->
        {:error, "Unknown Error ocurred in `String.to_existing_atom/1`"}
    end
  end

  def ext_type(filename) do
    with <<"."::utf8, rest::binary>> <- Path.extname(filename),
         {:ok, ext} <- to_existing_atom(rest) do
      ext
    end
  end

  def to_kino(upload = %__MODULE__{path: path}) do
    content = File.read!(upload.path)

    case ext_type(path) do
      ext when is_audio(ext) ->
        Kino.Audio.new(content, ext)

      ext when is_video(ext) ->
        Kino.Video.new(content, ext)

      ext when is_image(ext) ->
        Kino.Image.new(content, ext)
    end
  end

  def new(filename, path) do
    %__MODULE__{filename: filename, path: path}
  end

  def generate_temp_filename(extension \\ "mp4") do
    random_string = :crypto.strong_rand_bytes(8) |> Base.encode16()
    temp_dir = System.tmp_dir!()
    Path.join(temp_dir, "temp_#{random_string}.#{extension}")
  end
end

Setup State Management

Next, we'll set up some simple state management Agents.

We will make agents to track the form state for the UI and the history of videos so we can undo, reset, and track previous prompts.

We need to track input values using the FormState Agent since we are not using a Kino.Control.Form, which means using Kino.Input.read/1 will not work for our use case of repeated reads for changing input states. So instead we just listen for change events and store the values as state.

The EditHistory Agent is just a simple queue to track history which we essentially use more as a stack. We store 2-Tuples of {upload::%Upload{},prompt::String.t()}, but currently only the uploads are actually used downstream. The original, unmodified media is the first element in the queue, and its prompt is nil as there was no prompt used to generate it.

defmodule FormState do
  use Agent

  def start_link(_init) do
    Agent.start_link(fn -> %{prompt: "", retries: 2, debug: false, explain_outputs: true} end,
      name: __MODULE__
    )
  end

  def update(key, value) do
    Agent.update(__MODULE__, fn state -> Map.put(state, key, value) end)
  end

  def get(key) do
    Agent.get(__MODULE__, fn state -> Map.get(state, key) end)
  end
end

defmodule EditHistory do
  use Agent

  def start_link(_init) do
    Agent.start_link(fn -> :queue.new() end, name: __MODULE__)
  end

  def push(%Upload{} = upload, prompt \\ nil) do
    Agent.update(__MODULE__, fn history ->
      :queue.snoc(history, {upload, prompt})
    end)
  end

  def undo_edit do
    Agent.get_and_update(__MODULE__, fn history ->
      popped = :queue.liat(history)
      {:queue.last(popped), popped}
    end)
  end

  def current do
    Agent.get(__MODULE__, fn history ->
      :queue.last(history)
    end)
  end

  def original do
    Agent.get(__MODULE__, fn history ->
      :queue.head(history)
    end)
  end

  def previous_edit do
    Agent.get(__MODULE__, fn history ->
      popped = :queue.liat(history)

      unless :queue.is_empty(popped) do
        :queue.last(popped)
      else
        nil
      end
    end)
  end

  def reset do
    Agent.get_and_update(__MODULE__, fn history ->
      original = :queue.head(history)
      {original, :queue.from_list([original])}
    end)
  end
end

Now we add the agents to our supervision tree using Kino.start_child!/1 so that they are supervised in a way that lets their state be controlled by the evaluation state of the notebook. Kino's supervision tree is special in that way, since it's meant to work within Livebook.

Enum.each([EditHistory, FormState], &Kino.start_child!/1)

Setup Boilerplate

This module is just in charge of storing templates that we wil use to render logs (with levels), or the program output (from stdout and stderr). This uses EEx, which is a templating library that is part of Elixir's standard library.

Every cell in a Livebook notebook is rendered within its own iframe, so you really have a ton of flexibility of what you can output.

defmodule Boilerplate do
  def placeholder,
    do: """
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Video Preview Placeholder with Spinner</title>
        <style>
            .video-preview-placeholder {
                width: 100%;
                max-width: 640px;
                height: 0;
                padding-bottom: 56.25%; /* 16:9 aspect ratio */
                border: 2px dashed #ccc;
                display: flex;
                align-items: center;
                justify-content: center;
                background-color: #f9f9f9;
                color: #666;
                font-size: 20px;
                text-align: center;
                position: relative;
                box-sizing: border-box;
                margin: auto;
            }
            .spinner-container {
                display: flex;
                flex-direction: column;
                align-items: center;
                justify-content: center;
                position: absolute;
                top: 50%;
                left: 50%;
                transform: translate(-50%, -50%);
            }
            .spinner {
                border: 4px solid #f3f3f3;
                border-top: 4px solid #3498db;
                border-radius: 50%;
                width: 40px;
                height: 40px;
                animation: spin 2s linear infinite;
                margin-bottom: 10px;
            }
            @keyframes spin {
                0% { transform: rotate(0deg); }
                100% { transform: rotate(360deg); }
            }
            .message {
                font-size: 16px;
                color: #666;
            }
        </style>
    </head>
    <body>
        <div class="video-preview-placeholder">
            <div class="spinner-container">
                <%= if show_spinner do %>
                    <div class="spinner"></div>
                <% end %>
                <div class="message"><%= message %></div>
            </div>
        </div>
    </body>
    </html>
    """

  def log_template,
    do: """
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Log Level Message Box</title>
        <style>
            .message-box {
                width: 100%;
                border: 2px solid;
                padding: 20px;
                box-sizing: border-box;
                margin: 20px 0;
                border-radius: 5px;
                font-size: 18px;
                text-align: left;
            }
            .message-box.error {
                border-color: #f44336;
                background-color: #fdecea;
                color: #f44336;
            }
            .message-box.success {
                border-color: #4caf50;
                background-color: #e8f5e9;
                color: #4caf50;
            }
            .message-box.info {
                border-color: #2196f3;
                background-color: #e3f2fd;
                color: #2196f3;
            }
        </style>
    </head>
    <body>
        <div class="message-box <%= level %>">
            <%= message %>
        </div>
    </body>
    </html>
    """

  def stdout_template,
    do: """
    <!DOCTYPE html>
    <html lang="en">
    <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title><%= device %></title>
    <style>
    body {
        background-color: #1e1e1e;
        color: #c5c8c6;
        font-family: "Courier New", Courier, monospace;
        margin: 0;
        padding: 20px 20px 20px 5px;
    }
    .container {
        border: 1px solid #444;
        border-radius: 5px;
        overflow: hidden;
    }
    .header {
        background-color: #444;
        color: #c5c8c6;
        padding: 10px;
        font-weight: bold;
        text-transform: uppercase;
    }
    .output {
        background-color: #1d1f21;
        border-left: 4px solid <%= border_color %>;
        padding: 12px 12px 12px 5px;
        font-size: 16px;
        color: #c5c8c6;
        white-space: pre-wrap;
        word-break: break-all;
    }
    </style>
    </head>
    <body>
    <div class="container">
    <div class="header">
        <%= device %>
    </div>
    <div class="output">
        <%= output %>
    </div>
    </div>
    </body>
    </html>
    """

  def make_stdout(output, device, border_color \\ "gray") do
    Kino.HTML.new(EEx.eval_string(stdout_template(), binding()))
  end

  def make_log(message, level) do
    Kino.HTML.new(EEx.eval_string(log_template(), binding()))
  end
end

Setup Form

Now we setup the form using built-in Kinos. A Frame in Kino is really just a placeholder where we can render future Kinos. So here we setup corresponding frames for each thing we want to display over the lifespan of the application. If we don't want it to display at the start, we dont render anything into it. So you will notice the pattern here is to create an empty frame and then create a corresponding Kino widget, which will be rendered into its frame at some point throughout the lifespan of the application.

There are a few operations on frames you should know about since you'll see them used throughout this code:

Kino.Frame.render - Replaces all of the contents of the frame with the new content you specify, you
Kino.Frame.append - Appends the input content to the end of the existing frame. You will notice this is how things such as the logs and output are rendered.
Kino.Frame.clear - Clears out the content of the frame. You will mostly see this called either at the beginning or end of event listeners throughout this code.

original = Kino.Frame.new()
prompt = Kino.Input.textarea("Prompt")
upload = Kino.Input.file("Upload", accept: Upload.accepted_types())
errors = Kino.Frame.new(placeholder: false)
submit_button = Kino.Control.button("Run!")
submit_frame = Kino.Frame.new(placeholder: false)
undo_frame = Kino.Frame.new(placeholder: false)
reset_frame = Kino.Frame.new(placeholder: false)
undo_button = Kino.Control.button("Undo")
reset_button = Kino.Control.button("Reset")
output = Kino.Frame.new(placeholder: false)
logs = Kino.Frame.new(placeholder: false)
debug_checkbox = Kino.Input.checkbox("Verbose Mode")
debug_frame = Kino.Frame.new(placeholder: false)
explain_checkbox = Kino.Input.checkbox("Explain Outputs", default: true)
explain_frame = Kino.Frame.new(placeholder: false)
retries = Kino.Input.number("# Retries", default: 2)

Kino.Frame.render(
  original,
  Kino.HTML.new(
    EEx.eval_string(Boilerplate.placeholder(),
      message: "Upload Media to Get Started",
      show_spinner: false
    )
  )
)

inputs = Kino.Layout.grid([prompt, retries], columns: 2, gap: 10)

buttons =
  Kino.Layout.grid([submit_frame, undo_frame, reset_frame, explain_frame, debug_frame],
    columns: 7,
    gap: 1
  )

Kino.Layout.grid([original, inputs, upload, buttons, output, logs])

FFMPEG Instructions

Now we will implement the two modules that will be used to interact with the LLM using Instructor.

Remember how I mentioned that oftentimes ffmpeg will return large chunks of text as output and it can be a bit difficult to parse through and interpret which part of it is relevant to what you want?

Well this first module, which I've appropriately named Alfred, is in charge of helping you interpret those results (if you so choose).

Alfred will be called to help explain the contents of stdout and stderr whenever the resulting command (either ffmpeg or ffprobe) writes to them. It will pass along the relevant context, including the original task that you asked to be done, as well as the command that was called that resulted in those outputs.

Alfred also provides you with a confidence metric which tells you how confident it is in the provided explanation. It ranges from 0 to 10 in increments of 0.5, but of course you can tune that as you wish.

Alfed will only be called upon if the explain_outputs form toggle is enabled, AND if the resulting command actually output to either stdout or stderr. If it output to both, it will recieve both and give an explanation that incorporates all of the information.

The main components needed here for Instructor are the embedded_schema, which defined the structure that must be returned, the validate_changeset function which defines additional validations that will be performed on the resulting structured response, and the set of prompts (Instructor even uses the @doc field defined for the embedded_schema).

defmodule Alfred do
  use Ecto.Schema
  use Instructor.Validator
  import Ecto.Changeset
  import Exterval
  @confidence_interval ~i<[0,10]//0.5>

  @system_prompt """
  You are the companion Agent to another Agent whose job is to product execve-styled arguments
  for programs given a specific prompt. Your job is to interpret and explain the output
  of the command after it has been run. You will be given the prompt / task that originally
  generated the command, then you will be given the command that was run, along with the
  output that was generated. You do not need to re-explain what the task was or regurgitate
  what the command was. You only need to explain what the output means within the context
  of the task. If the task / prompt was a question, you should determine whether the provided
  output directly answers the question and if it does not you should answer it based on the
  output. If the output is not relevant to the prompt this should also be noted.


  You will also provide a confidence score about how confident you are about the above explanation.
  The confidence score is separate from the explanation.
  """

  @primary_key false
  @doc """
  ## Field Descriptions:
  - explanation: Explanation of the output given the context of the task and command that was run
  - confidence: Rating from 0 to 10 in increments of 0.5 of how confident you are in your answer,
    with higher scores being more confident.
  """
  embedded_schema do
    field(:explanation, :string)
    field(:confidence, :float)
  end

  @impl true
  def validate_changeset(changeset) do
    changeset
    |> validate_inclusion(:confidence, @confidence_interval)
  end

  def execute(prompt, command, retries, outputs \\ [stdout: nil, stderr: nil]) do
    Instructor.chat_completion(
      model: "gpt-4o",
      response_model: __MODULE__,
      max_retries: retries,
      messages:
        [
          %{
            role: "system",
            content: @system_prompt
          },
          %{
            role: "user",
            content: """
            Here's the prompt that generated the command: #{inspect(prompt)}
            """
          },
          %{
            role: "user",
            content: """
            Here's command: #{inspect(command)}
            """
          },
          Keyword.get(outputs, :stdout) &&
            %{
              role: "user",
              content: """
              stdout: #{inspect(Keyword.fetch!(outputs, :stdout))}
              """
            },
          Keyword.get(outputs, :stderr) &&
            %{
              role: "user",
              content: """
              stderr: #{inspect(Keyword.fetch!(outputs, :stderr))}
              """
            }
        ]
        |> Enum.filter(& &1)
    )
  end
end

Now we define the module that is the star of the show. The AutoFfmpeg agent will receive a task (prompt) from the user as well as the input type (eg. mp4, mp3,png, etc.), and has to decide three things:

What program to use to accomplish the given task (between ffmpeg and ffprobe)
The set of arguments, formatted as a list of strings, to use to accomplish the task
The output file type given the list of allowed types (or null if the task doesn't write to a new file)

You will also notice the field(:output_path, :string, virtual: true), which is a field that the LLM is not required to output, but sets it aside for us to use later on. We will use this to store information to be used once the successfuly response is generated.

Most of the important work here actually happens within the validate_changeset callback. This callback is invoked by Instructor after a response is received and coerced into the schema. Any errors found during validation are stored in the changeset and are used to retry the request to the LLM, providing the error to steer the LLM towards our desired result. This is that feedback loop I mentioned above that is normally lacking.

Whereas most implementations of validate_changeset would use functions provided by Ecto.Changeset, we need to implement a very custom validation since our validation includes actually trying to run the generated ffmpeg/ffprobe command.

Here is a diagram of how the calls to the LLM using Instructor work with the validation function.

graph TD A[Call LLM] --> B[Receive schema from LLM] B --> C[Format command from schema fields] C --> D[Run the command] D --> E{Switch on the output code} E -->|Success: exit code 0| F[Check output conditions] F --> G[Emit stdout if available] F --> H[Emit stderr if available] F --> I[Call Alfred if explain_outputs] G --> J[Return the changeset] H --> J I --> J E -->|Fail: nonzero exit code| K[Extract an error message according to precedence] K --> L[stderr] K --> M[stdout] K --> N[Generic unknown error message] L --> O{Retry count < N?} M --> O N --> O O -->|Yes| A[Call LLM] O -->|No| P[Return the error]

Hopefully now you can see the advantages to using this approach as opposed to just using ChatGPT or manually performing this workflow. This self-correct capability gained from using Instructor dramatically increases the speed in which you can go from prompt to desired result.

defmodule AutoFfmpeg do
  use Ecto.Schema
  use Instructor.Validator

  @system_prompt """
  You are a multimedia editor and your job is to receive tasks for multimedia editing and use
  the programs available to you (and only those) to complete the tasks. You will return arguments
  to be passed to the program assuming that the input file(s) has already been passed. You do not need to
  call the binary itself, you are only in charge of generating all subsequent
  arugments after inputs have been passed. Assume the output file path will be appended
  after the arguments you provide.

  You have access to the following programs: ffmpeg and ffprobe

  So assume the command already is composed of something like
  `ffmpeg -i input_file_path [..., args, ...] output_file_path` and you then pass arugments
  to complete the given task. You will also be provided the input file for context, but you
  should not include inputs in your arguments. Use the given file extension to determine how
  to form your arugments. You will also provide the output file
  extension / file type, since depending on the task it could differ from the input type. If the
  given task does not result in an operation that writes to a file, (eg. asking for timestamps
  where it is silent would result in writing to stdout), the extension would be `null`.

  If the command is such that it will output to stdout, you should output as JSON when
  possible.
  """

  @doc """
  ## Field Descriptions:
  - program: the executable program to call
  - arguments: execve-formatted arguments for the  command
  - output_ext: The extension (filetype) of the outputted file
  """
  @primary_key false
  embedded_schema do
    field(:program, Ecto.Enum, values: [:ffmpeg, :ffprobe])
    field(:arguments, {:array, :string})

    field(:output_ext, Ecto.Enum,
      values: [
        :mp4,
        :ogg,
        :avi,
        :wmv,
        :mov,
        :wav,
        :mp3,
        :mpeg,
        :jpeg,
        :jpg,
        :png,
        :gif,
        :svg,
        :pixel,
        :null
      ]
    )

    field(:output_path, :string, virtual: true)
  end

  @impl true
  def validate_changeset(
        changeset,
        %{
          upload_path: upload_path,
          debug: debug,
          debug_frame: debug_frame,
          output_frame: output_frame,
          prompt: prompt,
          explain: explain,
          retries: retries
        }
      ) do
    program = Ecto.Changeset.get_field(changeset, :program)
    program_args = Ecto.Changeset.get_field(changeset, :arguments)
    input_args = ["-i", upload_path]
    output_ext = Ecto.Changeset.get_field(changeset, :output_ext)

    output_args =
      cond do
        program == :ffprobe ->
          []

        output_ext == :null ->
          ["-f", "null", "-"]

        true ->
          [Upload.generate_temp_filename(Atom.to_string(output_ext))]
      end

    command =
      Enum.join([Atom.to_string(program) | input_args ++ program_args ++ output_args], " ")

    if debug do
      message = """
      <strong>Prompt:</strong> <em>#{prompt}</em><br><br>
      <strong>Command:</strong> <code>#{command}</code>
      """

      Kino.Frame.append(
        debug_frame,
        Boilerplate.make_log(
          message,
          :info
        )
      )
    end

    case :exec.run(command, [
           :sync,
           :stdout,
           :stderr
         ]) do
      {:ok, result} when is_list(result) ->
        outputs =
          [:stdout, :stderr]
          |> Enum.map(fn device ->
            if Keyword.has_key?(result, device) do
              output = Enum.join(Keyword.fetch!(result, device), "")
              Kino.Frame.append(output_frame, Boilerplate.make_stdout(output, device))
              {device, output}
            else
              {device, nil}
            end
          end)

        if explain do
          case Alfred.execute(prompt, command, retries, outputs) do
            {:ok, %Alfred{explanation: explanation, confidence: confidence}} ->
              Kino.Frame.append(
                output_frame,
                Boilerplate.make_stdout(
                  "<strong>Explanation:</strong> #{explanation}\n\n<strong>Confidence:</strong> #{confidence}",
                  :alfred,
                  "green"
                )
              )

            {:error,
             %Ecto.Changeset{
               errors: [
                 explanation: {error, _extras}
               ],
               valid?: false
             }} ->
              Kino.Frame.append(
                debug_frame,
                Boilerplate.make_log("Trouble providing explanation: #{inspect(error)}", :error)
              )
          end
        end

        if program == :ffmpeg && output_ext != :null do
          [output_path] = output_args
          Ecto.Changeset.put_change(changeset, :output_path, output_path)
        else
          changeset
        end

      {:error, result} when is_list(result) ->
        debug &&
          Kino.Frame.append(
            debug_frame,
            Boilerplate.make_log("Something Went Wrong! Retrying...", :error)
          )

        error =
          cond do
            Keyword.has_key?(result, :stderr) ->
              Keyword.fetch!(result, :stderr) |> Enum.join("")

            Keyword.has_key?(result, :stdout) ->
              Keyword.fetch!(result, :stdout) |> Enum.join("")

            Keyword.has_key?(result, :exit_signal) ->
              "Error resulted in exit code #{Keyword.fetch!(result, :exit_signal)}"

            true ->
              "Unexpected error occurred!"
          end

        Ecto.Changeset.add_error(
          changeset,
          :arguments,
          error,
          status: Keyword.get(result, :exit_status)
        )
    end
  end

  def execute(prompt, %{upload_path: upload_path} = context, retries) do
    Instructor.chat_completion(
      model: "gpt-4o",
      validation_context: Map.put(context, :prompt, prompt) |> Map.put(:retries, retries),
      response_model: __MODULE__,
      max_retries: retries,
      messages: [
        %{
          role: "system",
          content: @system_prompt
        },
        %{
          role: "user",
          content: """
          Here's the editing task: #{inspect(prompt)}
          """
        },
        %{
          role: "user",
          content: """
          Here's input file type: #{inspect(Upload.ext_type(upload_path))}
          """
        }
      ]
    )
  end
end

Listeners

The last thing to do is to setup the actual application lifecycle. We will simply pass all possible input Kinos into a Kino.Control.tagged_stream which will let us listen to events and match to events according to which input emitted the event. Then we perform all operations for each each event. Let's break down how we handle each event:

:explain - The explain_outputs toggle was changed, so we just update the state of that form field
:retries - The retries number input was changed, so we just update the state of that form field
:debug - The debug_checkbox toggle was changed, so we just update the state of that form field
:prompt - The prompt textarea input was changed, so we just update the state of that form field
:upload - This accepts a file upload, verifies that its extension / file type is supported, and then pushes it to the history state, and renders the media into the original frame, which is just the frame showing the current media.
:submit - Gets the prompt from the state. Passes the prompt to AutoFfmpeg to get the commands from the LLM to complete the task. Checks for the return value (which is the result of all retries upon failure), and on success pushes the new output to the history state (if there is a new output since some tasks only output to stdout / stderr) and renders that output. On failure will render an error to the logs frame.
:reset - Sets the history state to only the head of the history, which will be the original video, and renders it. This effectively undoes any edits that were applied.
:undo - Pops the most recent entry of the history, which just reverts to the previous version of the media, and renders it.

import Upload

[
  upload: upload,
  submit: submit_button,
  reset: reset_button,
  undo: undo_button,
  prompt: prompt,
  debug: debug_checkbox,
  retries: retries,
  explain: explain_checkbox
]
|> Kino.Control.tagged_stream()
|> Kino.listen(fn
  {:explain, %{type: :change, value: value}} ->
    FormState.update(:explain_outputs, value)

  {:retries, %{type: :change, value: value}} ->
    FormState.update(:retries, value)

  {:debug, %{type: :change, value: value}} ->
    FormState.update(:debug, value)

  {:prompt, %{type: :change, value: prompt}} ->
    FormState.update(:prompt, prompt)

  {:upload,
   %{
     type: :change,
     value: %{
       file_ref: file_ref,
       client_name: filename
     }
   }} ->
    Kino.Frame.clear(logs)
    Kino.Frame.clear(output)
    ext_type = Upload.ext_type(filename)

    unless is_valid_upload(ext_type) do
      Kino.Frame.render(
        logs,
        Boilerplate.make_log(
          "File must be of one of the following types: #{inspect(Upload.accepted_types())}",
          :error
        )
      )
    else
      file_path =
        file_ref
        |> Kino.Input.file_path()

      tmp_path = Upload.generate_temp_filename(ext_type)
      _bytes_copied = File.copy!(file_path, tmp_path)
      upload = Upload.new(filename, tmp_path)
      Upload.to_kino(upload) |> then(&Kino.Frame.render(original, &1))
      EditHistory.push(upload)
      Kino.Frame.render(debug_frame, debug_checkbox)
      Kino.Frame.render(explain_frame, explain_checkbox)
      Kino.Frame.render(submit_frame, submit_button)
      Kino.Frame.clear(undo_frame)
      Kino.Frame.clear(reset_frame)
    end

  {:submit, %{type: :click}} ->
    Kino.Frame.clear(logs)
    Kino.Frame.clear(output)

    prompt = FormState.get(:prompt) |> String.trim()

    if prompt == "" do
      Kino.Frame.append(logs, Boilerplate.make_log("Prompt cannot be empty!", :error))
    else
      Kino.Frame.render(
        original,
        Kino.HTML.new(
          EEx.eval_string(Boilerplate.placeholder(), message: "Working...", show_spinner: true)
        )
      )

      {%Upload{} =
         current_upload, _old_prompt} = EditHistory.current()

      num_retries = FormState.get(:retries)

      case AutoFfmpeg.execute(
             prompt,
             %{
               upload_path: current_upload.path,
               debug: FormState.get(:debug),
               debug_frame: logs,
               output_frame: output,
               explain: FormState.get(:explain_outputs)
             },
             num_retries
           ) do
        {:ok, %AutoFfmpeg{output_path: output_path}} ->
          FormState.get(:debug) &&
            Kino.Frame.append(logs, Boilerplate.make_log("Success!", :success))

          unless is_nil(output_path) do
            new_upload = Upload.new(current_upload.filename, output_path)
            EditHistory.push(new_upload, prompt)
            Upload.to_kino(new_upload) |> then(&Kino.Frame.render(original, &1))
          else
            Upload.to_kino(current_upload) |> then(&Kino.Frame.render(original, &1))
          end

          Kino.Frame.render(undo_frame, undo_button)
          Kino.Frame.render(reset_frame, reset_button)

        {:error,
         %Ecto.Changeset{
           changes: %{
             arguments: _arguments,
             output_ext: _output_ext
           },
           errors: [
             arguments: {error, [status: _status]}
           ],
           valid?: false
         }} ->
          Upload.to_kino(current_upload) |> then(&Kino.Frame.render(original, &1))

          Kino.Frame.append(
            logs,
            Boilerplate.make_log("Failed after #{num_retries} attempts!", :error)
          )

          Kino.Frame.append(logs, Boilerplate.make_log(error, :error))

        {:error, <<"LLM Adapter Error: ", error::binary>>} ->
          Upload.to_kino(current_upload) |> then(&Kino.Frame.render(original, &1))
          {error, _binding} = error |> Code.eval_string()

          Kino.Frame.append(
            logs,
            Boilerplate.make_log("Error! Reference the error below for details", :error)
          )

          Kino.Frame.append(logs, Kino.Tree.new(error))

        {:error, <<"Invalid JSON returned from LLM: ", error::binary>>} ->
          Upload.to_kino(current_upload) |> then(&Kino.Frame.render(original, &1))
          Kino.Frame.append(logs, Boilerplate.make_log(error, :error))
      end
    end

  {:reset, %{type: :click}} ->
    {%Upload{} = original_upload, nil} = EditHistory.reset()
    Upload.to_kino(original_upload) |> then(&Kino.Frame.render(original, &1))
    Kino.Frame.clear(logs)
    Kino.Frame.clear(output)
    Kino.Frame.clear(reset_frame)
    Kino.Frame.clear(undo_frame)

  {:undo, %{type: :click}} ->
    Kino.Frame.clear(logs)
    Kino.Frame.clear(output)

    case EditHistory.undo_edit() do
      nil ->
        Kino.Frame.append(logs, Kino.Text.new("Error! Cannot `Undo`. No previous edit."))

      {%Upload{} = previous_upload, _previous_prompt} ->
        Upload.to_kino(previous_upload) |> then(&Kino.Frame.render(original, &1))
        Kino.Frame.clear(logs)

        if EditHistory.previous_edit() == nil do
          Kino.Frame.clear(reset_frame)
          Kino.Frame.clear(undo_frame)
        end
    end
end)

And that's all there is to it! Now you can either interact with the application after manually running all cells in the Livebook or you can deploy it as an application (from the left sidebar) and run it that way. If you choose to deploy it, make sure to click the checkbox in the deployment configuration for Only render rich outputs to make it more like a standalone application.

Conclusion

Now you have a working AI-powered ffmpeg tool to quickly iterate over ffmpeg commands using natural language. I've found myself using this quite a lot now, and with the debug mode turned on you can see the generated ffmpeg commands and learn a bit while you're at it.

I want to draw special attention to how much tools like Livebook and Kino allow you to deploy usable applications extremely quickly. They take care of many of the concerns that you might not want to focus on when trying to deploy an inital version of an application, or just trying to iterate through the idea.

Now if you wanted to turn this into a full-fledged application you certainly could, but at least for a non web developer like myself, these tools allow me to still create good user-friendly tools at a fast pace.

You could realistically get abot 85% of this solution only using the Instructor and Kino libraries, with erlexec adding a bit by separating stdout from stderr.

This still has several shortcomings and thus will not be the best fit for all tasks. For those simple tasks mentioned at the top of this post though I think it can be a great tool in your kit.

Of course you can also tune the prompts, the chosen models (right now it's using OpenAI's gpt4o but some benchmarks show gpt4-preview to outperform gpt4o on coding tasks -- although this might not classify as a coding task), or how you compose the arguments.

Also, as it stands right now, this can only handle one input and one output at a time, but could be altered to handle multiple of each. As of the time of this writing, Kino.Input.File only allows one upload at a time, which is the main reason only one input is supported. The main reason only one output is supported is because I wanted this to automatically render the output, and didn't want to worry about how to display multiple.

If you want to see these features, feel free to request them as Issues on the repo or make PRs!

That's all I have for now. If you enjoyed this article, please consider subscribing or follow me to see more!

Elixir Productivity Machine Learning

Introduction

Install Dependencies

Upload Struct

Setup State Management

Setup Boilerplate

Setup Form

FFMPEG Instructions

Listeners

Conclusion

Comments

Join the newsletter to receive the latest updates in your inbox.

You might also like

Creating My Own Website Using Github Pages

From Python to Elixir Machine Learning