Jobs
Jobs are computational tasks that run on machines in your Compute Share organization.
Overview
A job represents a containerized workload that gets distributed to available machines in your organization for execution.
Job Lifecycle
Jobs progress through these stages:
- Created - Job is submitted but not yet queued
- Queued - Waiting for an available machine
- Assigned - Assigned to a machine and starting
- Running - Currently executing
- Completed - Finished successfully or with error
Creating Jobs
Job Configuration
Jobs are defined in YAML files. Here's a basic example:
name: data-processing-job
image: python:3.11
command: python process.py --input data.csv
environment:
DATA_SOURCE: s3://mybucket/data
resources:
cpu: 2
memory: 4096 # MB
timeout: 3600 # seconds
Required Fields
name- Unique identifier for the jobimage- Docker image to usecommand- Command to execute inside the container
Optional Fields
environment- Environment variablesresources- Resource requirements and limitstimeout- Maximum execution timedatasets- Input datasets to mountoutputs- Output artifacts to collect
Submitting Jobs
Via CLI
Submit a job using the CLI:
# Submit from a config file
compute-share submit job.yaml
# Submit with inline config
compute-share submit --name "quick-test" --image "ubuntu:latest" --command "echo Hello"
Project-based Jobs
Organize related jobs into projects:
project: ml-training
jobs:
- name: preprocess-data
image: python:3.11
command: python preprocess.py
- name: train-model
image: tensorflow/tensorflow:latest
command: python train.py
depends_on:
- preprocess-data
Managing Jobs
Viewing Jobs
Monitor jobs from the Jobs dashboard:
- Active Jobs - Currently running or queued
- Completed Jobs - Finished executions with status
- Job History - Full history with logs and metrics
Job Details
Click on any job to view:
- Current status and progress
- Assigned machine
- Resource usage
- Execution logs
- Output artifacts
Canceling Jobs
Cancel a running or queued job:
compute-share cancel <job-id>
Or from the dashboard:
- Navigate to the job detail page
- Click Cancel Job
Resource Requirements
Specifying Resources
Define CPU and memory needs:
resources:
cpu: 4 # Number of CPU cores
memory: 8192 # Memory in MB
gpu: 1 # Number of GPUs (optional)
disk: 10240 # Disk space in MB
Resource Matching
Jobs are only assigned to machines that meet the requirements:
- Available CPU cores ≥ requested CPU
- Available memory ≥ requested memory
- GPU count matches (if requested)
- Sufficient disk space
Working with Data
Input Datasets
Mount datasets from your organization:
datasets:
- name: training-data
mount: /data/input
mode: ro # read-only
Output Artifacts
Collect results after job completion:
outputs:
- name: trained-model
path: /output/model.pkl
- name: metrics
path: /output/metrics.json
Environment Variables
Pass configuration via environment:
environment:
MODEL_TYPE: "transformer"
BATCH_SIZE: "32"
LEARNING_RATE: "0.001"
Monitoring Jobs
Real-time Logs
View logs as the job executes:
compute-share logs <job-id> --follow
Metrics
Track job performance:
- Runtime - Execution duration
- CPU Usage - Actual CPU consumption
- Memory Usage - Peak memory usage
- Exit Code - Process exit status
Alerts
Set up notifications for:
- Job completion
- Job failures
- Long-running jobs
- Resource threshold violations
Job Patterns
Batch Processing
Run multiple similar jobs:
for file in data/*.csv; do
compute-share submit --name "process-$(basename $file)" \
--image "python:3.11" \
--command "python process.py $file"
done
Parallel Workflows
Execute independent jobs concurrently:
jobs:
- name: task-1
image: worker:latest
command: process --chunk 1
- name: task-2
image: worker:latest
command: process --chunk 2
- name: task-3
image: worker:latest
command: process --chunk 3
Sequential Pipelines
Chain dependent jobs:
jobs:
- name: fetch-data
command: fetch.sh
- name: transform-data
command: transform.sh
depends_on: [fetch-data]
- name: analyze-data
command: analyze.sh
depends_on: [transform-data]
Troubleshooting
Job Stays Queued
Possible reasons:
- No machines meet resource requirements
- All suitable machines are busy
- Machine health checks are failing
- Check machine availability in dashboard
Job Fails Immediately
Common causes:
- Invalid Docker image
- Command not found in container
- Missing environment variables
- Insufficient resources on assigned machine
Job Times Out
Solutions:
- Increase timeout value
- Optimize job efficiency
- Request more resources
- Split into smaller jobs
Best Practices
Resource Requests
- Request only what you need
- Don't over-provision
- Test resource usage with small runs
- Monitor and adjust based on metrics
Error Handling
- Include proper error handling in scripts
- Set appropriate timeouts
- Log errors clearly
- Use exit codes meaningfully
Efficiency
- Minimize container image size
- Cache dependencies when possible
- Use appropriate base images
- Clean up temporary files
Next Steps
- Learn about machine management
- Manage your organization
- Review quickstart for examples