Workflow Resilience | Advanced Workflows

1. Introduction

Workflow resilience in GitHub Actions refers to the ability of a CI/CD pipeline to withstand and recover from failures. This involves designing workflows that can handle errors, retries, and fallbacks gracefully.

2. Key Concepts

**Error Handling:** Mechanisms to catch and manage errors in workflows.
**Retries:** Automatically retrying failed jobs or steps.
**Timeouts:** Setting time limits for jobs to prevent indefinite hanging.
**Job Dependencies:** Ensuring jobs depend on the successful completion of previous jobs.

3. Implementing Resilience

3.1 Error Handling

Use the `if: failure()` condition to handle errors gracefully:


jobs:
  example_job:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Run script
        run: ./script.sh
        continue-on-error: true

      - name: Handle failure
        if: failure()
        run: echo "Script failed, taking alternative action"

3.2 Retries

Configure job retries using the `retry` keyword:


jobs:
  retry_job:
    runs-on: ubuntu-latest
    steps:
      - name: Run a command
        run: ./unstable_script.sh
        retry: 3

3.3 Timeouts

Set a timeout for jobs to prevent them from running indefinitely:


jobs:
  timeout_job:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - name: Long running process
        run: ./long_process.sh

3.4 Job Dependencies

Use `needs` to define job dependencies:


jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Build the project
        run: ./build.sh

  test:
    runs-on: ubuntu-latest
    needs: build
    steps:
      - name: Run tests
        run: ./test.sh

4. Best Practices

Implement comprehensive error handling in all steps.
Use retries cautiously to avoid infinite loops.
Set reasonable timeouts based on expected execution time.
Define clear job dependencies to ensure logical execution order.
Regularly review and update workflows for efficiency and reliability.

5. FAQ

What is the maximum number of retries I can set?

You can set a maximum of 10 retries for a job or step.

Can I set different timeouts for different jobs?

Yes, you can set individual timeouts for each job in your workflow.

What happens if a job fails after all retries?

The workflow will be marked as failed, and subsequent jobs that depend on it will not run.

Workflow Resilience in GitHub Actions