How Search Engines work

To understand how search engines work, we need to understand the processes behind its working. There are mainly three types of processes:

  1. Web Crawling
  2. Indexing
  3. Searching

We cover each of these processes in detail to understand the functioning of a search engine.

A. Web Crawling:

It is a process wherein web crawlers index the content of a website. Once our web pages are indexed, the search results provides results to user queries on the basis of web page content. A web crawler is also known as web spiders.

In case we want only a section of our website to be crawled, then it could be done with the help of robots.txt file. We can make relevant entries in robots.txt file for the web crawlers. They may have to identify themselves to the servers when they come looking for our website, robots.txt file as already mentioned contains crawl instructions for the web crawlers.

So, How does a web crawler work?

A web crawler usually starts with the URLs that it already has, we can call them seeds. When the crawler visit these URLs, if it finds more URL links there then it adds newly found URLs to the list, which is called crawl frontier. If a web crawler archives the web pages, then it stores the latest version of the visited web pages and forms a repository.

So, the process involves:
Check robots.txt –> Visit seeds –> Crawl Frontier –> Repository

Now, a bit about spider traps, Spider Traps is a state wherein a web page may cause a web crawler to enter in a loop by making an infinite number of requests. The Spider Traps causes the web crawlers to crash, it also results in loss of productivity and waste of system resources.

Googlebot, Bingbot, Swiftbot, WebCrawler, and Xenon are few of the prominent web crawlers.

B. Indexing:

Next is Indexing. Website indexing is a process where search engines download data from web pages and links associated with them and then store such downloaded data in a database. There are a couple of methods with can help us get our website indexed. Add our website to Google’s search console or Bing webmasters for the indexing.

There are mainly two types of website indexing: Forward Indexing and Inverted Indexing.

With Forward Indexing, web crawlers crawl the web pages and build a list of words appearing on the associated web pages. Inverted Indexing gets implemented at the user-end. When a user enters a search query then the search engine looks for pages that are associated with the users’ search query and then search engine returns with relevant web pages.

Indexing helps search engines return with results quickly, it is tough to imagine search engines going through all the web pages that exist on the internet to get results for our search query every single time. Instead, they collect the data in advance, index it properly and returns with search results of billion of pages in just milliseconds. Of course, system processing is also key here. Efficient servers are required.

C. Searching:

Lastly, its about searches done at the users end. A user enters a search string to search anything on the web. That search string is called a search query. There are mainly three types of search related queries. They are:

  1. Navigational Search Queries,
  2. Informational Search Queries,
  3. and Transactional Search Queries.

These search related queries are discussed in detail in our next article – Types of Search Related queries.

In conclusion, a web crawler visits a robots.txt file to check which parts of a website it can crawl. Thereafter, it indexes the website, stores it in a database (or repository). User then enters a search query and consequently, search engines comes up with the most relevant web pages which may contain answers to users’ query. That’s all for now.

Similar Posts