I am looking for the right tool to launch VM instances on a cloud platform only when I need them. Once launched, they should run a containerized script (let's call it a job) that runs for about 5 to 15 minutes (depending on the input data and hardware) before returning about 100 MiB of data (which could, for instance, just be uploaded to a bucket somewhere by the script itself).
For now, I have worked something out using AWS SageMaker training and processing jobs or GCP AI platform training jobs. My problem is that there a relatively long overhead (~5 minutes) before the script actually starts running, which about doubles the time it takes for me to get my results.
Neither GCP cloud run nor AWS lambda are options for me, since I need a GPU for my jobs to complete in a reasonable time. Renting a VM full-time does not seem like a good solution either since it wouldn't handle parallelization of jobs and it would be idle most of the time.
After spending a couple hours in Kubernetes docs, it is still unclear to me if (a) this could be the tool I need and (b) it is not completely overkill for my needs. So my questions are:
- Is Kubernetes the tool I am looking for?
- Have I overlooked the daunting UIs of GCP and AWS and missed something that would be suited for me?
- Should I hack something using EC2 or GCP CE's APIs to automatically turn on and off GPU-powered VMs on-demand? I am fairly sure that I could do that, but it feels like this is not going to be very robust, scalable or cost-efficient...
Thanks for reading me and in advance for your suggestions.