How search engin works

July 26, 2016

How search engin works

How does Search work?
To understand why Google is so fast it is important to understand some basic fundamentals about search engines. All search engines have two components that consist of a crawler and an indexer. The crawler literally crawls all over the internet jumping from one page to another using hyperlinks. When a new page is found or the content is changed the indexer will step in and collect the data and store it on the search servers.

The data collected by the indexers is processed in such a way that it becomes meaningful for the search engine. This means that keywords, meta descriptions and headers represent a priority for the indexer engine. These elements will be used to show a user possible results for his search.

Why is Google so Fast?
One of the biggest problems when it comes to search speed is the limitation of mechanical hard drives. Solid State Drives are much faster but much too expensive. Servers usually use 10k RPM hard drives or even 15k. A regular desktop grade hard drive has a speed of 7200 RPMs at most. To speed up the hard drive, server engineers prefer to have them running in RAID configurations which can double the speed.

Basically two or more hard drives are paired and the data is copied on both of them. Also the data is read from both. For example half of the file is read from one drive and the other half is read from the second one in parallel. This helps double the read speed and also creates redundancy as if one hard drive fails the other one will still have the data stored.

This type of technology enables cost effective server solutions that are much faster than what users have inside their own computers but it is not enough to make a search engine as Google as fast as it is today.
Internet speed is also important but it is not as much about bandwidth as it is about latency. What this means is that it is not important to transfer large volumes of data over the internet at one time but to send small chunks of data as fast as possible so that Google can answer back as fast as possible.

When performing a search the data is being sent in real time and the browser does not wait for the user to finish typing and hit the search button. The information reaches the Google servers much faster and thus the results are received much faster.

Role of Data Centers and SDNs

Google has been building its own software-defined data-center networks for 10 years because traditional gear can't handle the scale of what are essentially warehouse-sized computers. The company hasn't said much before about that homegrown infrastructure, but one of its networking chiefs provided some details on Wednesday at Open Network Summit and in a blog post.

The current network design, which powers all of Google's data centers, has a maximum capacity of 1.13 petabits per second. That's more than 100 times as much as the first data-center network Google developed 10 years ago. The network is a hierarchical design with three tiers of switches, but they all use the same commodity chips. And it's not controlled by standard protocols but by software that treats all the switches as one.

Networking is critical in Google's data centers, where tasks are distributed across pools of computing and storage, said Amin Vahdat, Google Fellow and networking technical lead. The network is what lets Google make the best use of all those components. But the need for network capacity in the company's data centers has grown so fast that conventional routers and switches can't keep up.

"The amount of bandwidth that we have to deliver to our servers is outpacing even Moore's Law," Vahdat said. Over the past six years, it's grown by a factor of 50. In addition to keeping up with computing power, the networks will need ever higher performance to take advantage of fast storage technologies using flash and non-volatile memory, he said.
Back when Google was using traditional gear from vendors, the size of the network was defined by the biggest router the company could buy. And when a bigger one came along, the network had to be rebuilt, Vahdat said. Finally, that didn't work.

"We could not buy, for any price, a data-center network that would meet the requirements of our distributed systems," Vahdat said. Managing 1,000 individual network boxes made Google's operations more complex, and replacing a whole data center's network was too disruptive.
So the company started building its own networks using generic hardware, centrally controlled by software. It used a so-called Clos topology, a mesh architecture with multiple paths between devices, and equipment built with merchant silicon, the kinds of chips that generic white-box vendors use. The software stack that controls it is Google's own but works through the open-source OpenFlow protocol.

Google started with a project called Firehose 1.0, which it couldn't implement in production but learned from, Vahdat said. At the time, there were no good protocols with multiple paths between destinations and no good open-source networking stacks at first, so Google developed its own. The company is now using a fifth-generation homegrown network, called Jupiter, with 40-Gigabit Ethernet connections and a hierarchy of top-of-rack, aggregation and spine switches.

The design lets Google upgrade its networks without disrupting a data center's operation, Vahdat said. "I have to be constantly refreshing my infrastructure, upgrading the network, having the old live with the new."
Google is now opening up the network technology it took a decade to develop so other developers can use it.

Search This Blog

DIGITAL INDIA IN FUTURE

How search engin works

Comments

Post a Comment

Popular Posts

Build intelligence into key systems

Smart machin for battle