How search engin works


How does Search work?
To understand why Google is so fast it is important to  understand some basic fundamentals about search engines. All search  engines have two components that consist of a crawler and an indexer.  The crawler literally crawls all over the internet jumping from one page  to another using hyperlinks. When a new page is found or the  content is changed the indexer will step in and collect the data and  store it on the search servers.

The  data collected by the indexers is processed in such a way that it  becomes meaningful for the search engine. This means that keywords, meta  descriptions and headers represent a priority for the indexer engine.  These elements will be used to show a user possible results for his  search.

Why is Google so Fast?
One of the biggest problems when it comes to search  speed is the limitation of mechanical hard drives. Solid State Drives  are much faster but much too expensive. Servers usually use 10k RPM hard  drives or even 15k. A regular desktop grade hard drive has a speed of  7200 RPMs at most. To speed up the hard drive, server engineers prefer to have them running in RAID configurations which can double the speed.

Basically two or more hard drives are paired and the  data is copied on both of them. Also the data is read from both. For  example half of the file is read from one drive and the other half is  read from the second one in parallel. This helps double the read speed  and also creates redundancy as if one hard drive fails the other one  will still have the data stored.

This type of technology enables cost effective server  solutions that are much faster than what users have inside their own  computers but it is not enough to make a search engine as Google as fast  as it is today.
Internet speed is also important but it is not as much about bandwidth as it is about latency. What  this means is that it is not important to transfer large volumes of  data over the internet at one time but to send small chunks of data as  fast as possible so that Google can answer back as fast as possible.

When performing a search the data is being sent in  real time and the browser does not wait for the user to finish typing  and hit the search button. The information reaches the Google servers  much faster and thus the results are received much faster.

Role of Data Centers and SDNs

Google has been building its own software-defined data-center networks  for 10 years because traditional gear can't handle the scale of what are  essentially warehouse-sized computers. The company hasn't said much before about that homegrown infrastructure,  but one of its networking chiefs provided some details on Wednesday at  Open Network Summit and in a blog post.

The current network design, which powers all of Google's data centers,  has a maximum capacity of 1.13 petabits per second. That's more than 100  times as much as the first data-center network Google developed 10  years ago. The network is a hierarchical design with three tiers of  switches, but they all use the same commodity chips. And it's not  controlled by standard protocols but by software that treats all the  switches as one.

Networking is critical in Google's data centers, where tasks are  distributed across pools of computing and storage, said Amin Vahdat,  Google Fellow and networking technical lead. The network is what lets  Google make the best use of all those components. But the need for  network capacity in the company's data centers has grown so fast that  conventional routers and switches can't keep up.


"The amount of bandwidth that we have to deliver to our servers is  outpacing even Moore's Law," Vahdat said. Over the past six years, it's  grown by a factor of 50. In addition to keeping up with computing power,  the networks will need ever higher performance to take advantage of  fast storage technologies using flash and non-volatile memory, he said.
Back when Google was using traditional gear from vendors, the size of  the network was defined by the biggest router the company could buy. And  when a bigger one came along, the network had to be rebuilt, Vahdat  said. Finally, that didn't work.

"We could not buy, for any price, a data-center network that would meet  the requirements of our distributed systems," Vahdat said. Managing  1,000 individual network boxes made Google's operations more complex,  and replacing a whole data center's network was too disruptive.
So the company started building its own networks using generic hardware,  centrally controlled by software. It used a so-called Clos topology, a  mesh architecture with multiple paths between devices, and equipment  built with merchant silicon, the kinds of chips that generic white-box  vendors use. The software stack that controls it is Google's own but  works through the open-source OpenFlow protocol.


Google started with a project called Firehose 1.0, which it couldn't  implement in production but learned from, Vahdat said. At the time,  there were no good protocols with multiple paths between destinations  and no good open-source networking stacks at first, so Google developed  its own. The company is now using a fifth-generation homegrown network,  called Jupiter, with 40-Gigabit Ethernet connections and a hierarchy of  top-of-rack, aggregation and spine switches.

The design lets Google upgrade its networks without disrupting a data  center's operation, Vahdat said. "I have to be constantly refreshing my  infrastructure, upgrading the network, having the old live with the  new."
Google is now opening up the network technology it took a decade to develop so other developers can use it.

Comments

Popular Posts