Inspect cluster workload

7min
|
Nomad

The Web UI can be a powerful companion when monitoring and debugging jobs running in Nomad. The Web UI will list all jobs, link jobs to allocations, allocations to client nodes, client nodes to driver health, and much more.

List jobs

The first page you will arrive at in the Web UI is the Jobs List page. Here you will find every job for a namespace in a region. The table of jobs is searchable, sortable, and filterable. Each job row in the table shows basic information, such as job name, status, type, and priority, as well as richer information such as a visual representation of all allocation statuses.

This view will also live-update as jobs get submitted, get purged, and change status.

Filter job view

If your Nomad cluster has many jobs, it can be useful to filter the list of all jobs down to only those matching certain facets. The Web UI has four facets you can filter by:

Type: The type of job, including Batch, Parameterized, Periodic, Service, and System.
Status: The status of the job, including Pending, Running, and Dead.
Datacenter: The datacenter the job is running in, including a dynamically generated list based on the jobs in the namespace.
Prefix: The possible common naming prefix for a job, including a dynamically generated list based on job names up to the first occurrence of -, ., and _. Only prefixes that match multiple jobs are included.

Inspect an allocation

In Nomad, allocations are the schedulable units of work. This is where runtime metrics begin to surface. An allocation is composed of one or more tasks, and the utilization metrics for tasks are aggregated so they can be observed at the allocation level.

Monitor resource utilization

Nomad has APIs for reading point-in-time resource utilization metrics for tasks and allocations. The Web UI uses these metrics to create time-series graphs for the current session.

When viewing an allocation, resource utilization will automatically start logging.

Review task events

When Nomad places, prepares, and starts a task, a series of task events are emitted to help debug issues in the event that the task fails to start.

Task events are listed on the Task Detail page and live-update as Nomad handles managing the task.

View rescheduled allocations

Allocations will be placed on any client node that satisfies the constraints of the job definition. There are events, however, that will cause Nomad to reschedule allocations, (e.g., node failures).

Allocations can be configured in the job definition to reschedule to a different client node if the allocation ends in a failed status. This will happen after the task has exhausted its local restart attempts.

The end result of this automatic procedure is a failed allocation and that failed allocation's rescheduled successor. Since Nomad handles all of this automatically, the Web UI makes sure to explain the state of allocations through icons and linking previous and next allocations in a reschedule chain.

Unhealthy driver

Given the nature of long-lived processes, it's possible for the state of the client node an allocation is scheduled on to change during the lifespan of the allocation. Nomad attempts to monitor pertinent conditions including driver health.

The Web UI denotes when a driver an allocation depends on is unhealthy on the client node the allocation is running on.

Preempted allocations

Much like how Nomad will automatically reschedule allocations, Nomad will automatically preempt allocations when necessary. When monitoring allocations in Nomad, it's useful to know what allocations were preempted and what job caused the preemption.

The Web UI makes sure to tell this full story by showing which allocation caused an allocation to be preempted as well as the opposite: what allocations an allocation preempted. This makes it possible to traverse down from a job to a preempted allocation, to the allocation that caused the preemption, to the job that the preempting allocation is for.

Review task logs

A task will typically emit log information to stdout and stderr. Nomad captures these logs and exposes them through an API. The Web UI uses these APIs to offer head, tail, and streaming logs from the browser.

The Web UI will first attempt to directly connect to the client node the task is running on. Typically, client nodes are not accessible from the public internet. If this is the case, the Web UI will fall back and proxy to the client node from the server node with no loss of functionality.

Not all browsers support streaming HTTP requests. In the event that streaming is not supported, logs will still be followed using interval polling.

Restart or stop an allocation or task

Nomad allows for restarting and stopping individual allocations and tasks. When a task is restarted, Nomad will perform a local restart of the task. When an allocation is stopped, Nomad will mark the allocation as complete and perform a reschedule onto a different client node.

Both of these features are also available in the Web UI.

Force a periodic instance

Periodic jobs are configured like a cron job. Sometimes, you might want to start the job before the its next scheduled run time. Nomad calls this a periodic force and it can be done from the Web UI on the Job Overview page for a periodic job.

Submit a new version of a job

From the Job Definition page, a job can be edited. After clicking the Edit button in the top-right corner of the code window, the job definition JSON becomes editable. The edits can then be planned and scheduled.

Since each job within a namespace must have a unique name, it is possible to submit a new version of a job from the Run Job screen. Always review the plan output.

Monitor a deployment

When a system or service job includes the update stanza, a deployment is created upon job submission. Job deployments can be monitored in realtime from the Web UI.

The Web UI will show as new allocations become placed, tallying towards the expected total, and tally allocations as they become healthy or unhealthy.

Optionally, a job may use canary deployments to allow for additional health checks or manual testing before a full roll out. If a job uses canaries and is not configured to automatically promote the canary, the canary promotion operation can be done from the Job Overview page in the Web UI.

Stop a job

Jobs can be stopped from the Job Overview page. Stopping a job will gracefully stop all allocations, marking them as complete, and freeing up resources in the cluster.

Continue your exploration

Now that you have explored operations that can be performed with jobs through the Nomad UI, learn how to inspect the state of your cluster using the Nomad UI.

Keyboard shortcuts

Inspect the cluster