Learnability with time-sharing computational resource concerns

This article proposes `CoRE-learning' which introduces the `time-sharing' concept and enables `resource scheduling' in intelligent supercomputing facilities to be considered in machine learning theory for the first time.

of training a satisfactory model within certain time budget; this corresponds to user efficiency.On the other hand, computational resource should be wisely exploited; this corresponds to hardware efficiency.A learning theory with timesharing computational resource concerns won't assume that all received data can be handled in time, where scheduling is crucial.
For this purpose, we define Computational Resource Efficient Learning (CoRE-Learning) and present a theoretical framework.
First, we introduce the notion of machine learning throughput.Throughput is a basic concept in computer networking, defined as the amount of data per second that can be transferred [3]; it is also concerned in database systems to measure the average number of transactions completed within a given time [5].The introduction of throughput enables us to theoretically formulate the influence of computational resource and scheduling at an abstract level.
Our proposed machine learning throughput involves two components.The first component is data throughput.As illustrated in Figure 1, data throughput represents the percentage of data that can be learned per time unit.For example, half of the received data can be timely exploited in the time unit t 0 ∼ t 1 in Figure 1, corresponding to a data throughput η = 50%.In the time unit t 1 ∼ t 2 , the data volume doubles such that only 25% of received data can be timely exploited with current resource, and thus, η becomes 25%.While in the time unit t 2 ∼ t 3 , the resource doubles such that η becomes 50% again.It is evident that the influence of data volume as well as the computational resource budget can be involved by introducing the notion of data throughput into machine learning studies.The above discussion does not take into account that the difficulty of learning from the data may vary since unknown changes may occur; this is related to open-environment machine learning [7] and the consideration can be accommodated in further studies.
We call a machine learning task received by the supercomputing facility as a thread.It is associated with two time points: a beginning time and a deadline time, specifying lifespan of the thread.If the thread can be well learned (i.e., the performance reaches user's demands) within its timespan, we call it a successful thread, and otherwise a failure thread.Note that if we set the deadline time according to user's learning rapidity requirement about the thread, then a thread is successful if a satisfactory model can be learned within our given time budget.Now, we introduce the second component of machine learning throughput, i.e., thread throughput, defined as the percentage of threads that can be learned well in a time period, calculated by the percentage of successful threads in all threads during that time period.As illustrated in Figure 2, the thread throughput is κ = 60%.is the amount of data that can be handled at time t given the total budget of computational resource.K is the total number of threads in the task bundle, and T is the total number of timeslots.Note that if b i = b j (∀i ̸ = j), then all task threads arrive at the same time.
A learning algorithm L receives S as input.L will output , where s k is the switching time determined by the algorithm and M k is the learned model for the k-th thread.We use A t to denote the set of alive threads, i.e., The learning process proceeds as follows.
1: for time t = 1, . . ., T , the learner do 2: collects at most η k,t N t samples for thread k ∈ A t , where η k,t is data throughput for thread k at time t; if thread k completes, set s k ← t; 5: end for Now we introduce CoRE-learnability, with η and κ denoting data throughput and thread throughput, respectively.
) is (η, κ, L)-CoRE learnable, if there exists a computational resource scheduling strategy ψ that enables L to output {(s k , M k )} K k=1 running in polynomial time in 1/ϵ and 1/δ such that for some small ϵ and δ, with probability at least 1 − δ, where I succ (or I fail ) is the set of successful (or failure) threads, Condition (1) concerns data throughout, constraining that the overall resource quota of threads in the alive set never exceeds the maximum resource budget.Condition (2) concerns thread throughput, demanding the scheduling strategy ψ to enable L to learn as many threads well as possible: the learning of the thread should be completed before the deadline, as indicated by condition (2a); and the learning performance of the thread should be within a small error level, as indicated by condition (2b).The learning performance is measured by R k : H k → R, and R k (M k ) ≤ ϵ evaluates whether the learning performance is acceptable according to a predetermined ϵ when the algorithm exploits data received in the timeslot (b k , s k ) and completes learning by the time point s k .Note that Condition (1) is related to user efficiency, while Condition (2) is related to hardware efficiency; the scheduling strategy should balance the two aspects carefully.
The CoRE-learnability definition employs an (ϵ, δ)language similar to the PAC (Probably Approximately Correct) learning theory [6].It is noteworthy that, however, PAC learning theory focuses on learning from data sampled from an underlying data distribution, assuming that all training data can be exploited in time; thus, it allows for an arbitrarily small error ϵ and an arbitrarily high confidence 1 − δ given that the number of samples is sufficiently large (but still can be well exploited in time).In contrast, CoRE-learning theory considers the influence of the resource scheduling strategy ψ, and demands only acceptable (ϵ, δ) for L with (η, κ) throughput concerns.d k ) to be any real value, while in this figure we assume that they are rounded up for a better illustration.For a given algorithm L, the task bundle is (.5, .6,L)-CoRE learnable, because there exists a scheduling strategy ψ that enables L to successfully learn three out of the total five threads given a data throughput η = 50%.As Figure 3 shows, ψ allocates resource that can handle ηN = 32 data units equally to threads 1 and 3 in t 0 ∼ t 1 .Thread 1 continues to get resource that can handle 16 data units until it completes at t 3 ; the remaining resource that can handle 16 data units are allocated to threads 2 and 3 equally in t 1 ∼ t 3 .In t 3 ∼ t 4 threads 2 and 3 each receives resource that can handle 8 more data units because thread 1 does not require resource anymore.Thread 4 comes at t 4 , while ψ decides to allocate all resource to threads 3 and 4 as it feels pessimistic about thread 2. At t 5 , thread 5 comes, and because its lifespan is quite short, ψ decides to allocate it as many resource as possible, until the learning of thread 5 fails at t 7 .At t 6 , ψ feels very optimistic about thread 3, and therefore, it decides to give it all remaining resource, at the cost of sacrificing thread 4 temporarily.At t 7 there are only threads 2 and 4 alive.Finally, threads 2 and 5 fail for different reasons: thread 2 fails because of unsatisfactory learning performance, violating condition (2b), whereas thread 5 fails to complete before the deadline, violating condition (2a).
The resource scheduling strategy ψ is able to allocate resource adaptively, based on perceiving the learning status and foreseeing the learning progress of the threads.Intuitively, if L is based on gradient calculation, then the allocation of more computational resource to a task implies that more gradient calculations can be executed for that task.As Figure 4 illustrates, assume that the two task threads are allocated with the same amount of resource initially.At iteration τ 1 , ψ perceives that thread 1 arrives at a flat con- vergence area where its error has not significantly dropped during the past five rounds of gradient calculation, whereas thread 2 goes into a slope area with a faster error drop.Then, ψ decides to reduce the resource for thread 1 and reallocates them to thread 2. At the final iteration τ 3 , thread 2 reaches the status b rather than b ′ , with the sacrifice of thread 1 which reaches the status a rather than a ′ , leading to a better overall throughput of .5 (i.e., thread 2 is judged to be successful according to the threshold ϵ 0 ) rather than .0(i.e., neither threads reach ϵ 0 if the computational resource continue to be evenly allocated).Indeed, even if one considers another definition for thread throughput, such as defining it according to average error, the helpfulness of ψ is still visible from the improvement from ϵ a ′ +ϵ b ′ 2 to ϵa+ϵ b 2 .Merely maximizing thread throughput may lead ψ to prefer learning easier threads; this can be repaired by assigning priority or importance weights to threads when needed.
The CoRE-learning discussed in this article enables the influence and scheduling of computational resources be taken into account in learning theory.One of the fundamental goals is to, by introducing scheduling, enable the computational resource for machine learning to be used in a time-sharing style rather than the current elusive style.For example, even though the scaling law in training large language models is well-known, resources used to train such models are still used in an elusive way, leading to big waste because it is hard to pre-set a just-right amount.Distributed machine learning [4] tries to partition a learning task for distributed computing, where at each distributed site the resource is still exploited in an elusive way with a pre-set amount of resources, and the focus is on how to minimize the communication cost and guarantee the convergence by adequately synchronizing calculations.
Note that resource scheduling in machine learning is very different from that in other fields such as computer systems and databases.For example, the amount of resources required for accomplishing a task in computer systems and databases is generally known once the task is received, whereas in machine learning this information is unknown and can only be estimated by spying on the learning process online.This raises new research issues that might have been overlooked before, such as how to govern a machine learning process and estimate its status and progress online effectively and efficiently.It is even more complicated when noticing that the online governing and status estimation require communication and computational resources.Thus, CoRE-learning naturally involves an exploration-exploitation balance with resource scheduling.CoRE-learnability of concrete CoRE-learning algorithms can be proved once such algorithms are developed.

3 :
updates the model M k for thread k; 4:

Figure 3
Figure 3 presents an illustration, where the task bundle consists of K = 5 threads.For simplicity, assume that in each time unit N t = N = 64 data units can be handled.Note that CoRE-learning allows the beginning time b k and deadline time d k of the task thread T k = (D k , b k , d k) to be any real value, while in this figure we assume that they are rounded up for a better illustration.For a given algorithm L, the task bundle is (.5, .6,L)-CoRE learnable, because there exists a scheduling strategy ψ that enables L to successfully learn three out of the total five threads given a data throughput η = 50%.As Figure3shows, ψ allocates resource that can handle ηN = 32 data units equally to threads 1 and 3 in t 0 ∼ t 1 .Thread 1 continues to get resource that can handle 16 data units until it completes at t 3 ; the remaining resource that can handle 16 data units are allocated to threads 2 and 3 equally in t 1 ∼ t 3 .In t 3 ∼ t 4 threads 2 and 3 each receives resource that can handle 8 more data units because thread 1 does not require resource anymore.Thread 4 comes at t 4 , while ψ decides to allocate all resource to threads 3 and 4 as it feels pessimistic about thread 2. At t 5 , thread 5 comes, and because its lifespan is quite short, ψ decides to allocate it as many resource as possible, until the learning of thread 5 fails at t 7 .At t 6 , ψ feels very optimistic about thread 3, and therefore, it decides to give it all remaining resource, at the cost of sacrificing thread 4 temporarily.At t 7 there are only threads 2 and 4 alive.Finally, threads 2 and 5 fail for different reasons: thread 2 fails because of unsatisfactory learning performance, violating condition (2b), whereas thread 5 fails to complete before the deadline, violating condition (2a).The resource scheduling strategy ψ is able to allocate resource adaptively, based on perceiving the learning status and foreseeing the learning progress of the threads.Intuitively, if L is based on gradient calculation, then the allocation of more computational resource to a task implies that more gradient calculations can be executed for that task.As Figure4illustrates, assume that the two task threads are allocated with the same amount of resource initially.At iteration τ 1 , ψ perceives that thread 1 arrives at a flat con-

Figure 4 .
Figure 4.An illustration of adaptive resource allocation.