Table of Content

Open Access iconOpen Access

ARTICLE

On Multi-Thread Crawler Optimization for Scalable Text Searching

Guang Sun1, Huanxin Xiang2, Shuanghu Li1,*

Hunan University of Finance and Economics, Changsha, 410205, China.
The University of Alabama, Tuscaloosa, 35401, USA.

*Corresponding Author: Shuanghu Li. Email: email.

Journal on Big Data 2019, 1(2), 89-106. https://doi.org/10.32604/jbd.2019.07235

Abstract

Web crawlers are an important part of modern search engines. With the development of the times, data has exploded and humans have entered a “big data era”. For example, Wikipedia carries the knowledge from all over the world, records the real-time news that occurs every day, and provides users with a good database of data, but because of the large amount of data, it puts a lot of pressure on users to search. At present, single-threaded crawling data can no longer meet the requirements of text crawling. In order to improve the performance and program versatility of single-threaded crawlers, a high-speed multi-threaded web crawler is designed to crawl the network hyper-scale text database. Multi-threaded crawling uses multiple threads to process web pages in parallel, combining breadth-first and depth-first algorithms to control web crawling. The practice project is based on the Python language to achieve multi-threaded optimization network hyper-large-scale text database-Wikipedia book crawling method, the project is inspired by the article on the Wikipedia article in the Big Data Digest public number.

Keywords


Cite This Article

G. Sun, H. Xiang and S. Li, "On multi-thread crawler optimization for scalable text searching," Journal on Big Data, vol. 1, no.2, pp. 89–106, 2019.

Citations




cc This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 2247

    View

  • 1559

    Download

  • 3

    Like

Related articles

Share Link