There is a city. In the city, there are many block, and each block has many firms. Similarly, the firms have different departments: sales, operations, HR, etc. Like every other city, the blocks are connected to each other through a wide network of major or minor roads.
One fine day Mr. Mark decided to meet Mr. Chang. He didn’t know the location of Mr Mark so he approached employee’s association for the same. The authority at the association after looking at their employee’s database, told him the location and the directions to Mr Chang: the road leading to the block where his firm is situated and consequently, the department where his office is situated
Have you ever wondered how employee’s database gathers all this information and above all, organises the information and makes it available instantly. The way Google works and process millions of websites is quite similar, except rather than the roads, blocks, and the firms, Google on the internet searches web addresses.
The Simple Crawl
Every search engine has crawlers, crawlers that hover around the internet to look for interlinked web pages. After finding each page, crawlers feed it back into the search engine database. Going back to my question “how such a large information is gathered?” The answer is crawlers. Crawlers are like searchers allocated by the employee’s association. They visit every road in the city, look for every block, every firm in that firm, and finally aggregating information of the employees working in the various departments of that firm. The name of every employee, webpages in the case of Google, is fed back to the database.
Technically speaking, a crawl scheduling system de-duplicates and scuffles pages by significance to index far along. While it’s there, it accumulates a list of all the pages each page links to. In case of internal links, the crawler maybe follow them to other pages. In case external links, they are placed into a database for later.
Things were going fine for the employee’s association, but as the number of firms and as a results employees increased in numbers, a need of a better system emerged. Mr. Mark had an easy time searching for Mr Chang. Not anymore, now a search of Chang in the database returns more than a hundred results. To cope up with the situation, the authorities at the association decided to prioritise the placement of results for a Name. Likewise, Chang who is a director was put on the first place while the one, a burglar, was put last on the list.
Google faces similar situation, except a way more complicated where not hundreds but thousands to millions of results are displayed for a keyword. Therefore, as soon as the link graph gets administered, the search engine jolts all the links from the database and attaches them, rating them as per their authority. The rating, let’s say out of 10, 10 for an authoritative link like Forbes.com, 5 for a mediocre link, and 1 for a spam link. The more the number of authoritative links linking to a webpage; better would be its ranking in search results. This method of passing link authority is know as link juice.
For example, if a 7 rating website links to a 5 rating site, the former passes juice to the latter. The latter, as a result will benefit in terms of search rank. The inverse is also true.
If out of 10 links pointing to a website, 4 are in the range of 7-10 and rest are in 1-3 rating, then the website will rank better than a website having 5 links having an 8 rating and the rest are spam.
Robot.txt and sitemap
Suppose an employee is on leave. The searcher, unware, will still search for him. To do so, he has to go to his firm then to his office. There is no point for a searcher to look for him, in the first place. Thus, the manager of his department decided to put a board outside the department, declaring the employee on leave. The searcher visits the firm but on seeing the board left, saving his time.
Robot.txt has similar function. It tells the search engine not to search it. Although the search engine can still check links to that page and count them, It won’t be able to see what pages that page links to, but it will be able to add link value metrics for the page — which affects the domain as a whole.
Sitemap is more like a list pasted beside a firm’s gate in the city, listing the name of the employees, department wise.
Using 404 or 410 to Remove Pages
What if the employee on leave never comes back? The manager now has to change the board from “absent” to “not working anymore”. “Absent” could have worked but that would have make the searcher to come after some time looking for the employee. A “not working anymore” ensures that the searcher will never search for that employee again ever and he can be deleted from database.
With 404, a website gives a clear message that it’s not there anymore. There is only one conclusive method to halt the movement of link value at the end point–deleting the page. 410 is more absolute than 404, and both will cause the page to be thrown down out of the index ultimately.
Removing Pages with NoIndex
Noindex is more definitive than robots.txt, but less than 404. Raising NoIndex prevents crawling of that page, the search engine can still access it, but at that time, it is stated to go away. The search engine can still give ratings to links pointing to that page.
Some employees citing some reasons decided to not to be associated with the association. They start wearing a badge “we are not associated anymore with xyz employee’s association”. The searcher can see them, but after witnessing the badge will feedback his name to the database.
The way the search engine works is too similar to human society. Everything is interconnected to form a network. Networks are interlinked and any problem with a node can disrupt a part of the network. Similar situation when a road is closed in a city a block become inaccessible.
Jignesh is the digital marketing analyst & strategist. He is helping small businesses and start-ups to gather the right online marketing tactic. He is passionate about everything digital, and his life revolves around online marketing world.