A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Types of Crawling

Apache Nutch​
Apache Nutch​

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Bingbot​
Bingbot​

Does BingBot honor the Crawl-delay directive? Yes, BingBot honors the Crawl-delay directive, whether it is defined in the most specific set of directives or in the default one – that is an important exception to the rule defined above.

image: webnots.com
DataparkSearch​
DataparkSearch​

DataparkSearch has the RESTfull interface, though it's available only in the upcoming 4.54 version (you can try it in the latest snapshot). Using DataparkSearch templates you can get the search results practically in any text based format you need to process.

Googlebot​
Googlebot​

Googlebot is Google's web crawling bot (sometimes also called a "spider"). Crawling is the process by which Googlebot discovers new and updated pages to be added to the Google index. We use a huge set of computers to fetch (or "crawl") billions of pages on the web.

image: elephate.com
Heritrix​
Heritrix​

Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is available under a free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.

Hierarchical ​Cluster Engine Project​
Hierarchical ​Cluster Engine Project​

Hierarchical Cluster as engine: Provides hierarchical cluster infrastructure – nodes connection schema, relations between nodes, roles of nodes, requests typification and data processing sequences algorithms, data sharding modes, and so on. Provides network transport layer for data of client application and administration management messages.

HTTrack​
HTTrack​

HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License Version 3. HTTrack allows users to download World Wide Web sites from the Internet to a local computer.

OpenSearchServer​
OpenSearchServer​

Scheduler: OpenSearchServer's scheduler is a highly customizable tool to set up reccurent processes. This picture shows these main concepts: So far, so good? Let's start working with our example site. Set up the index and the crawler Index creation and configuration. Let's start by creating an index. An index is the heart of OpenSearchServer.

PowerMapper​
PowerMapper​

PowerMapper is a web crawler that automatically creates a site map of a website using thumbnails of each web page. A number of map styles are available, although the cheaper Standard edition has fewer styles than the Professional edition.

image: woorank.com
Scrapy​
Scrapy​

Fast and powerful. write the rules to extract the data and let Scrapy do the rest

source: scrapy.org
image: cnblogs.com
Screaming ​Frog SEO Spider​
Screaming ​Frog SEO Spider​

Screaming Frog SEO Spider. The SEO Spider is a desktop website crawler and auditor for PC, Mac or Linux which spiders websites’ links, images, CSS, script and apps like a search engine to evaluate onsite SEO.

StormCrawler​
StormCrawler​

StormCrawler is collection of resources for building low-latency, scalable web crawlers on Apache Storm

Xenu's Link ​Sleuth​
Xenu's Link ​Sleuth​

Xenu’s not just a great tool to look inside your own site, it’s also pretty powerful for crawling external resources like directories, particularly if you’re looking for a domain to buy. Try crawling dmoz.org, being sure to restrict Xenu’s access to “editors.dmoz.org”, but allow the crawler to “check external links”.

source: moz.com
YaCy​
YaCy​

All YaCy-peers are equal and no central server exists. It can be run either in a crawling mode or as a local proxy server, indexing web pages visited by the person running YaCy on his or her computer.

image: golem.de