Issue:

  1. Users are having trouble getting a single GPU for interactive mode, and their jobs are queued for extended periods.
  2. Some jobs are using a massive number of nodes (like say more than 32 average nodes per job), which can lead to resource scarcity for other users.
  3. There is a concern that resources are not being distributed fairly among users.

Possible Solutions:

  1. Priority Scheduling for Interactive Jobs: One suggested solution is to implement priority scheduling for interactive jobs. This means giving higher priority to interactive jobs so that they can preempt existing jobs on other partitions. This can help ensure that users requiring immediate access for interactive tasks can get the resources they need promptly.
  2. Fair Resource Allocation: It’s important to ensure a fair distribution of resources among users. Currently, there is a concern that a single user is utilizing more than 100 nodes, potentially causing resource scarcity for others. Consider implementing policies or configurations that limit the number of nodes a single user can use simultaneously to prevent resource monopolization.
  3. Resource Usage-Based Priority: Implementing a policy where users lose priority based on submitting more jobs or utilizing more resources could be explored. This encourages efficient resource usage and prevents users from consistently dominating resources.
  4. User Education: Users should be educated about best practices for resource usage, such as using resources efficiently, releasing resources when they are no longer needed, and avoiding overuse.
  5. Resource Monitoring and Reporting: Implement a system to monitor resource usage and generate reports to identify resource usage patterns, which can help in optimizing resource allocation and ensuring fair usage.
  6. Communication and Collaboration: Encourage communication among users to coordinate resource usage, especially for large-scale jobs that may require a significant number of nodes. Collaboration and coordination can help minimize resource contention.

It’s important for cluster administrators and users to work together to implement policies and configurations that promote fair and efficient resource allocation while addressing the specific challenges and requirements of their cluster environment.