To understand how search engines work, we need to understand the processes behind its working. There are mainly three types of processes:
- Web Crawling
We would now discuss each of these three processes in detail to understand the functioning of a search engine.
A. Web Crawling:
It is a process wherein web crawlers index the content on a website and results are later used by the search engines to provide efficient searches. A web crawler is also known as web spiders.
If we don't want the web crawler to access all the pages of our website or if we already provide it with the details of which parts of our website has to be indexed, this could be done with the help of robots.txt file. Web crawlers have to identify themselves to the servers when they come looking for our website, robots.txt file defines whether we have provided the web crawlers a full or limited access.So, How does a web crawler work?
A web crawler usually starts with the URL addresses that it already has, we call them seeds. When the crawler visit these URLs, if it finds more URL links then it adds newly found URLs in its list, which is called crawl frontier. If the web crawler archives the web pages, then it stores the latest version of the visited web pages and it forms a repository.So, the process involves: Check robots.txt --> Visit seeds --> Crawl Frontier --> Repository
Now, we will discuss a bit about spider traps, Spider Traps is a state wherein a web page may cause a web crawler to enter in a loop by making an infinite number of requests. The Spider Traps causes the web crawlers to crash, it also results in loss of productivity and waste of precious resources.
Googlebot, Bingbot, Swiftbot, WebCrawler, and Xenon are few of the prominent web crawlers.
Indexing, or website indexing is a process wherein search engines download data from web pages and links associated with them and then store the data downloaded in a database. There are a couple of methods with which we could get our website indexed, we will discuss that in some other article in detail.
There are mainly two types of website indexing: Forward Indexing and Inverted Indexing.
With Forward Indexing, web crawlers crawl the web pages and build a list of words appearing on the associated web pages. Inverted Indexing gets implemented at the user-end. When a user enters a search query then the search engine looks for pages that are associated with the users' search query and then search engine returns with relevant web pages.
Indexing helps search engines return with results pretty quick, it is tough to imagine search engines going through all the web pages that exist on the internet to get results for our search query every single time. Instead, they collect the data in advance, index it properly and returns with search results of billion of pages in just milliseconds.
Final process is about searches done at the users end. When a user enters a search string to get relevant information for quick resolution, that search string is called a search query. There are mainly three types of search related queries. They are:
Navigational Search Queries, Informational Search Queries, and Transactional Search Queries.
These search related queries are discussed in detail in our next article - Types of Search Related queries.
In conclusion, a web crawler visits robots.txt file to check which parts of a website it can visit. Thereafter, it indexes the website, stores it in a database (or repository). User then enters a search query and consequently, search engines comes up with the most relevant web pages associated. So, more or less the article sums up how search engines work.