Table of Content

Open Access iconOpen Access


On Multi-Thread Crawler Optimization for Scalable Text Searching

Guang Sun1, Huanxin Xiang2, Shuanghu Li1,*

Hunan University of Finance and Economics, Changsha, 410205, China.
The University of Alabama, Tuscaloosa, 35401, USA.

*Corresponding Author: Shuanghu Li. Email: email.

Journal on Big Data 2019, 1(2), 89-106.


Web crawlers are an important part of modern search engines. With the development of the times, data has exploded and humans have entered a “big data era”. For example, Wikipedia carries the knowledge from all over the world, records the real-time news that occurs every day, and provides users with a good database of data, but because of the large amount of data, it puts a lot of pressure on users to search. At present, single-threaded crawling data can no longer meet the requirements of text crawling. In order to improve the performance and program versatility of single-threaded crawlers, a high-speed multi-threaded web crawler is designed to crawl the network hyper-scale text database. Multi-threaded crawling uses multiple threads to process web pages in parallel, combining breadth-first and depth-first algorithms to control web crawling. The practice project is based on the Python language to achieve multi-threaded optimization network hyper-large-scale text database-Wikipedia book crawling method, the project is inspired by the article on the Wikipedia article in the Big Data Digest public number.


Cite This Article

APA Style
Sun, G., Xiang, H., Li, S. (2019). On multi-thread crawler optimization for scalable text searching. Journal on Big Data, 1(2), 89-106.
Vancouver Style
Sun G, Xiang H, Li S. On multi-thread crawler optimization for scalable text searching. J Big Data . 2019;1(2):89-106
IEEE Style
G. Sun, H. Xiang, and S. Li "On Multi-Thread Crawler Optimization for Scalable Text Searching," J. Big Data , vol. 1, no. 2, pp. 89-106. 2019.


cc This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 2463


  • 1718


  • 3


Related articles

Share Link