Open Access iconOpen Access

ARTICLE

crossmark

Research on Performance Optimization of Spark Distributed Computing Platform

Qinlu He1,*, Fan Zhang1, Genqing Bian1, Weiqi Zhang1, Zhen Li2

1 School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an, 710054, China
2 Shaanxi Institute of Metrology Science, Xi’an, 710043, China

* Corresponding Author: Qinlu He. Email: email

Computers, Materials & Continua 2024, 79(2), 2833-2850. https://doi.org/10.32604/cmc.2024.046807

Abstract

Spark, a distributed computing platform, has rapidly developed in the field of big data. Its in-memory computing feature reduces disk read overhead and shortens data processing time, making it have broad application prospects in large-scale computing applications such as machine learning and image processing. However, the performance of the Spark platform still needs to be improved. When a large number of tasks are processed simultaneously, Spark’s cache replacement mechanism cannot identify high-value data partitions, resulting in memory resources not being fully utilized and affecting the performance of the Spark platform. To address the problem that Spark’s default cache replacement algorithm cannot accurately evaluate high-value data partitions, firstly the weight influence factors of data partitions are modeled and evaluated. Then, based on this weighted model, a cache replacement algorithm based on dynamic weighted data value is proposed, which takes into account hit rate and data difference. Better integration and usage strategies are implemented based on LRU (Least Recently Used). The weight update algorithm updates the weight value when the data partition information changes, accurately measuring the importance of the partition in the current job; the cache removal algorithm clears partitions without useful values in the cache to release memory resources; the weight replacement algorithm combines partition weights and partition information to replace RDD partitions when memory remaining space is insufficient. Finally, by setting up a Spark cluster environment, the algorithm proposed in this paper is experimentally verified. Experiments have shown that this algorithm can effectively improve cache hit rate, enhance the performance of the platform, and reduce job execution time by 7.61% compared to existing improved algorithms.

Keywords


Cite This Article

APA Style
He, Q., Zhang, F., Bian, G., Zhang, W., Li, Z. (2024). Research on performance optimization of spark distributed computing platform. Computers, Materials & Continua, 79(2), 2833-2850. https://doi.org/10.32604/cmc.2024.046807
Vancouver Style
He Q, Zhang F, Bian G, Zhang W, Li Z. Research on performance optimization of spark distributed computing platform. Comput Mater Contin. 2024;79(2):2833-2850 https://doi.org/10.32604/cmc.2024.046807
IEEE Style
Q. He, F. Zhang, G. Bian, W. Zhang, and Z. Li "Research on Performance Optimization of Spark Distributed Computing Platform," Comput. Mater. Contin., vol. 79, no. 2, pp. 2833-2850. 2024. https://doi.org/10.32604/cmc.2024.046807



cc This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 167

    View

  • 84

    Download

  • 0

    Like

Share Link