Novel Power-Aware Optimization Methodology and Efficient Task Scheduling Algorithm

The performance of central processing units (CPUs) can be enhanced by integrating multiple cores into a single chip. Cpu performance can be improved by allocating the tasks using intelligent strategy. If Small tasks wait for long time or executes for long time, then CPU consumes more power. Thus, the amount of power consumed by CPUs can be reduced without increasing the frequency. Lines are used to connect cores, which are organized together to form a network called network on chips (NOCs). NOCs are mainly used in the design of processors. However, its performance can still be enhanced by reducing power consumption. The main problem lies with task scheduling, which fully utilizes the network. Here, we propose a novel random fit algorithm for NOCs based on power-aware optimization. In this algorithm, tasks that are under the same application are mapped to the neighborhoods of the same application, whereas tasks belonging to different applications are mapped to the processor cores on the basis of a series of steps. This scheduling process is performed during the run time. Experiment results show that the proposed random fit algorithm reduces the amount of power consumed and increases system performance based on effective scheduling.

Online scheduling has been investigated recently. By contrast, run-time scheduling refers to the one performed by the operating system. It assists with the dynamic scheduling of tasks. The author [18] proposed his work based on the characteristics of NOCs. It also deals with the resource allocation process and the immigration of threads at run time. Considered the sequence of tasks being constructed in accordance with the user's habit, which is then used for the prediction and allocation of tasks. The author [19] performed task allocation based on the assumption that each processor core is associated with various levels of power consumption. Thus, the core having the lowest level of power consumption is mapped with the tasks. The author [20] performed scheduling based on the scenarios' transactions. According to run-time analysis, these approaches aim at scaling down the voltage and frequency is such a way that the entire system's power consumption can be reduced.
The tasks in NOCs are associated with certain requirements for communication [21] to achieve this target. Task scheduling determines traffic density. Hence, it is made an overview from the related works that, to estimate the performance of NOC, the scheduling algorithm plays a vital role. Power consumption can be reduced by an optimized scheduling algorithm [22]. The scheduling algorithm [23] proposed by the paper relies on the analysis of the traffic in an on-chip network at run time. The candidate optimization algorithm (COA) produces minimal network transmission delay with a small number of resources consumed [24] and minimal power consumption.

System Model
NOCs have various designs [25] and new exciting features, considering the general architecture, application, and algorithm provided. In this section, we take a deep dive on the system model and other related topics, including routing policy, network topology, and energy models. The above mentioned are the base and principal elements of the scheduling algorithm.

On-Chip Network Topology
NOCs have an unconventional design prototype of SOC, whereas SOCs have advantages over traditional bus communication. On-chip communication provides better performance and has great improvement on efficiency when connected with on-chip devices. Although several topologies, such as ring and mesh [26], have been recommended for NOCs, the mesh is considered to be the best performing choice for the design of NOCs, as depicted in Fig. 1.
The structure of a mesh is similar to that of a matrix. The tiles that are considered nodes are connected by wires, as shown in Fig. 1. Every tile in the mesh includes the following components: routers, processor cores, input/output (I/O) interfaces, and on-chip memory (either a cache or an SPM), as depicted in Fig. 1. Data forwarding and communication between interfaces is taken care of by the routers and I/O interfaces, respectively. Tiles may also contain processing elements (PE) that are used for special purposes and operations. The cache and SPM act as local memory for all cores.

Routing Policy
Routing policies determine the path to be taken from the initial source nodes to the destination nodes. Problems, such as deadlock and congestions, can be solved; certainly, the performance of an on-chip communication on an NOC can be improved to an extent by a routing algorithm. Routing algorithms generally fall under two categories: deterministic and adaptive routing algorithms.
By using a deterministic routing algorithm, one can determine the path traveled from the initial node to the destination node in a network. By contrast, adaptive routing uses the run-time environments to choose the routing direction. However, this has a great effect on the design of the router and leads to increased complexity.
Mesh-based NOCs can be determined with the help of the X-Y routing algorithm. In X-Y routing, the mesh runs in the X and Y directions, as shown in Fig. 2. The initial routing occurs in the X direction where the packets are forwarded, and then the routing proceeds with the Y direction. This prevents the deadlock from occurring for a given destination, in turn depicting high simplicity. In this study, the X-Y routing algorithm is chosen to achieve the least effect on the routing algorithm.  The important characteristic for transferring on NOC is the switching mode. One of the popular and effective switching modes used now is a packet-based switching technique. Here, the packets are divided into flits, which are considered to be the smaller chunks. These flits are then transferred via a network. This flit can be forwarded and buffered by a router if a single flit has enough buffers. This method is known to reduce network latency and save buffer space. Moreover, if any of the flits are found to be blocked, then the router forwards the other flits by which the network throughput is increased. In this study, the switching mode is wormhole switching. However, wormhole switching has its own disadvantage as it may lead to latency when traffic is heavy. However, these occurrences are rare and have minimal effect on the results. Fig. 1 shows the mesh topology for NOCs.

Task Model
In case of offline analysis, the application can be represented with the help of an application control graph (ACG) [16]. The application can be represented in the form of vertices V and edges E, where E represents the edge that connects the two vertices. Thus, the representation is of the form G(V,E). Each vertex of the task represents the task V i . The tasks cannot be subdivided into smaller partitions. The connection between the two vertices V i and V j can be represented using the edge notation E ij . The traffic between the two tasks can be represented using the function F(E ij ). The total traffic off each individual task V i and V j can be represented as F(E i ) and F(E j ), respectively. The ACG for the given application can be obtained on the basis of the CETA method. By contrast, real-time traffic for the tasks within the application can be obtained using the SIMICS method. The application set is represented as S = {S 0 , S 1, S 2,……. ,S m }. Here, each application in the application set has N(S i ) number of tasks. The task model is represented as M = {M 0 , M 1, M 2,……. ,M n }.

Energy Model
Several processor cores or elements are present within the NOC, which establishes communication among the tasks with the help of on-chip lines. These PEs consume energy for computing these tasks. The amount of energy consumed for computing these tasks is referred to as computation energy. The energy consumed in establishing communication between the tasks V i and V j can be represented as where R r represents the amount of energy that the router consumes. Similarly, R l represents the amount of energy that the lines, which are present between the tasks, consume. Fig. 3. represents the coordinate system for the NOC topology. The coordinates corresponding to the core A1 is represented as A1(0, 0). Thus, the coordinates corresponding to the tile Ai is represented as Ai(X,Y). Here, X and Y represent the horizontal and vertical coordinates, respectively. The distance between the cores is measured using the Manhattan distance as follows: where E l represents the energy that the line consumes by connecting the tasks. In other words, it represents the energy that the Manhattan distance consumes. Similarly, E r represents the energy that the router consumes. The value for R r (E ij ), where R r is the energy consumed by the router, is calculated.
Similarly, the value for R l (E ij ) is given by where R rout indicates the energy that the single bit of a router consumes, and R link indicates the energy that the single bit of a line connecting the tasks consumes.
By summing up the above two equations, the total energy consumed can be obtained as follows:  The total energy consumed by the application is given by where S k represents the set of active applications in the system.
The total energy that the system consumes during all the slots is given by The above equations reveal that the Manhattan distance and the amount of energy consumed have a great effect on the total energy consumed by the system. The focus of the study is to reduce the Manhattan distance because the energy consumed is considered a constant.

Proposed Model
The algorithm used for scheduling is discussed in this section. In case of single application, the algorithm maps the tasks to its corresponding computational units. Based on this the optimized algorithm can be defined for multiple applications.

Single-Application Algorithm Design
In case of single application, more than one tasks are partitioned. Compared with single-core chips, multitask applications can provide high parallelism. To boost the performance of the system, these tasks must be assigned to the corresponding cores. In this study, two scheduling algorithms, namely, random fit algorithm and XY routing fit algorithm, are used. In the case of the XY routing fit algorithm, the algorithm searches for the core that is idle in the X direction to be assigned to the unmapped task. If no such core is found in the X direction, then the algorithm looks for the same in the Y direction. The initial scheduling starts with the coordinate A0(0, 0). However, when the communication is larger, the algorithm has its own disadvantage as described below.
In this algorithm, on the basis of the Manhattan distance, the tasks are associated with the cores that are nearby. Thus, the initial task present is mapped to the core A0(0, 0). Similarly, the next task to be mapped to the core is the one having a lower Manhattan distance than the previous core A0(0, 0). In the same manner, all the other remaining tasks are mapped. The traffic caused can be reduced in this algorithm by reducing the Manhattan distance. However, the algorithm has its own disadvantages. Although optimization can be achieved with the help of this algorithm, communication energy remains high. Hence, the Manhattan distance must be reduced even further. To achieve this, the task with the highest traffic must be mapped to the nearby cores. Traffic is defined as follows: Thus, the one with the heaviest traffic is mapped to the nearest core A0(0, 0). The next task with the heaviest traffic slightly lesser than the previous one is mapped to the core that has the smallest Manhattan distance with the previously mapped core A0(0, 0). However, if the task with the heaviest traffic is found to communicate with a greater number of tasks, then the Manhattan distance for those tasks increase as per the scheduling algorithm. Assume the number of tasks in S k is N(S k ) . The core with the maximum core value A m is chosen as the initial point for task scheduling. Each core Ai has a set of chores A(Ai) that are mapped. Thus, the task that has the heaviest traffic is mapped to the core A m.
To map the remaining tasks with the next core, the tasks must satisfy the following two conditions: The core to be mapped must have neighbor relationship with the previously mapped core. The next core to be mapped should be the core that has the maximum A m value.
Similarly, while scheduling the tasks, the task with the highest traffic must be mapped with the previously mapped task. Examples for the scheduling of the applications are shown in the figures below. The total energy consumed by using the three algorithms, namely XY routing algorithm, random fit algorithm, and energy optimization (EO) algorithm, is calculated as follows: The above equations show that energy consumption decreases with the decrease in Manhattan distance. As a result, the communication efficiency is highly increased. Fig. 5 shows the proposed methodology of online scheduling. As mentioned in the diagram, traffic analysis is performed by static profiling. Static profiling marks the start of the online scheduling process. Initially, the EO algorithm is used to schedule the tasks to the cores. The on-chip network is then divided into regions, which are further divided into tasks. Run-time analysis is used for collecting profile information.

Multiple-Application Algorithm Design
With the number of applications increasing, the number of tasks being mapped to the network also increases. In such cases, the tasks that belong to the same application are gathered as clusters by the EO algorithm. For each application S i, a task set partition, which is represented as M(S i ), is created.
The network corresponding to the on-chip processor is represented as N(A,D). Here, C represents the processor core, which is represented as A = {A 0 , A 1 , ….., A i }. The total number of chores is represented as N(A), and the set of path is represented by D ij . The path between the two processors is represented as where R represents the router between the two cores A i and A j , and A(A i ) represents the set of processor chores that are connected to the given core A i .
NðA i Þ, then the tasks can be assigned, and partitioning is successful. If N ðAÞ ! P x i¼0 NðA i Þ, then the tasks cannot be assigned to the given network due to insufficient number of chores. In such cases, partitioning can be performed only if a few tasks are removed from the task list. The task to be removed can be determined as follows: Choose a task V x and remove all the tasks after V x that are present in the task list M.
Find an application S k containing the task M x. Check if M x is the last task of the application S k.
If M x is the last task, then assign the task to the network.
Each task is mapped to the core with the help of the EO algorithm. During the scheduling process, the mapped cores are not mapped again.

Scheduling Based on Energy Optimization
The communication overhead of the system can only be determined during its runtime. In case of offline scheduling. Fig. 6. the profiling information is inaccurate. However, in case of online scheduling, the communication overhead can be determined during the runtime with the help of profile information. If any changes are found in the traffic, then the tasks are remapped. The proposed algorithm becomes a complex process when the changes that happen to the different circumstances of the application are considered. Static profiling is used to analyze the traffic conditions of the application where the profile information is collected during the runtime of the tasks. The rescheduling process can be performed on the basis of the EO algorithm.

Output
The tasks are mapped to the applications in S.

Working
• Form a task-core pair such that the task is assigned to the corresponding core.
• Identify the last mapped task.
• Identify the task with heaviest traffic when it is mapped with the task mapped previously.
• Identify the corresponding core.
• Remove the core mapped to the task from the core list.
• Repeat the process by scheduling the unmapped cores.  for (a = 0;a<=c;a++){ for(b = 0;b<= N(S k );b++){ Form a task-core scheduling pair such that the task is mapped to that core. Find the task having highest traffic with previously assigned task.
If (task not found) Identify the task with the heaviest traffic in the application set S k .
Identify the task with heaviest traffic when it is mapped with the task mapped previously.
Identify the corresponding core.
Remove the core mapped to the task from the core list.
End A Do If (a new task appears) Identify the traffic between the new task and the previously mapped task.
If (traffic value > threshold assumed)

Results and Analysis
Simulation is used to test the proposed algorithm. The traffic can be obtained on the basis of the SIMICS method. SIMICS settings include CPU number: 16, frequency: 60 GHz, size of the cache: 32 KB, size of the disk: 4 GB, memory size: 1024 MB, and operating system used: Linux version 2.6.
The NOC simulator used for the simulation process is Gem 5. Its parameter settings include data cache size: 64 KB, instruction cache size: 64 KB, data cache associativity: 16 way, instruction cache associativity: 8 way, hit latency: 4-35 cycles, and memory size: 2 GB.
The traffic of some benchmark applications, such as FFM_3 and MPGenC, are obtained and used for performing analysis via CETA. The applications of the ACG are obtained with the help of CETA. The tasks present within the application can be obtained with the help of SIMICS. The on-chip network can be simulated with the help of the GEM 5 simulator. In this process, the communication overhead that occurs due to task migration is considered. Results show that the complexity of the EO algorithm is lower.

Algorithm 2: (continued)
The above graph reveals that the EO algorithm outperforms the XY routing and random fit algorithms in terms of energy consumption. This result is due to the balanced communication that exists among the tasks of an application. The number of flits transmitted can be reduced with the help of the EO and random fit algorithms. On the basis of the communication relationship, the tasks are assigned. Approximately 62% of the total flits can be reduced by using the proposed algorithm. This saves as the base for the optimization algorithm. Fig. 7 shows that as the traffic is reduced, the amount of energy consumed can also be reduced. As the traffic remains average, the amount of energy consumed remains consistent. The maximum latency of the various algorithms used is shown in the above figure, which reveals that although the number of flits is reduced, the latency remains maximum. Mapping the tasks on the basis of the Manhattan distance alone is difficult if they do not have intensive communication. Although the random fit algorithm does not provide effective communication compared with the EO algorithm, the tasks are established to have a balanced communication. However, the rescheduling process does not reduce the number of transmitted slits where the rescheduling process is performed during the runtime based on the profiling information. The online scheduling process is more flexible for the operations of the system. The online scheduling process involves the information to be gathered to remap the tasks that involve extra time loss. Fig. 8. shows that the maximum latency is difficult to reduce despite the reduction in the total number of slits. The average latency of the three mentioned algorithms is shown in the above graph diagram. If the communication is intensive, then the task efficiency is a guaranteed one. In the case of the random fit and EO algorithms, the average latency is reduced to approximately 29%-33%.
Thus, the experimental result shows Fig. 9. that the random fit algorithm can outperform the COA by increasing the task scheduling efficiency for an NOC-based multicore system. Thus, even without considering the traffic, the tasks can be assigned to its corresponding cores with the help of this algorithm. Despite the overhead, it has high performance and reduces power consumption compared with COA.

Conclusions
A promising paradigm to break the bottleneck problem in on-chip communication is NOCs. Here, the cores are connected to form a core network. However, the greatest challenge faced by NOCs in terms of performance and energy consumption is network traffic. The network of on-chip cores produces a remarkable effect on the basis of the network characteristics. The key issue in scheduling lies in the scheduling of the tasks to the associated network. In this study, an optimized scheduling algorithm called random fit algorithm is used for NOC-based multicore systems. The communication energy in the scheduling process can be saved with the help of the EO algorithm. The tasks can be remapped to ensure improved performance and power consumption. Experimental analysis shows that the performance can be increased with the help of the rescheduling process. However, the proposed scheduling algorithm does not consider migration cost. In the future, the scheduling algorithm can be reframed by considering migration cost.
Funding Statement: The authors received no specific funding for this study.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.