What Is The Search Alogorithm?

In computer science, a search algorithm is an algorithm for finding an item with specified properties among a collection of items which are coded into a computer program, that look for clues to return what is wanted.
 
a search algorithm is an algorithm that retrieves information stored within some data structure. Data structures can include linked lists, arrays, search trees, hash tables, or various other storage methods. The appropriate search algorithm often depends on the data structure being searched.
 
There is no one answer (and I couldn't make it public even if there were).

Which parts do you want to know about? There's crawling the web, determining the URLs and content of documents that need to be added to the index. There's extracting terms from the document content and adding the URL for the document to the index for each of those terms. There are algorithms related to handling things like multi-word terms vs documents that simply match several words. There are the algorithms and data structures necessary to store the index in a manner which can be searched quickly. There are the algorithms which provide redundancy and fault tolerance for each component in the system. There's the storage system capable of storing a data structure that is measured in petabytes, etc.

I suppose the piece you are most likely interested in is the part which receives a query from the user, turns that into a set of terms to be looked up in the index, collects the list of documents that match the set of terms in the query, and then scores each document according to relevance, before then sending the highest scoring 10 back the browser. But even that is a whole bunch of algorithms. The scoring is what interests content publishers the most, because they want their pages to come back in the first page of results for any relevant query. There are papers published about PageRank and there are some common sense things that seem pretty obvious to do in order to eliminate irrelevant pages. But there are also the algorithms used to ask a large number of hosts sitting behind the web server to return a set of matching documents, and the algorithms used to merge the lists from each of those hosts. Then there are the load balancing algorithms used to ensure that content is spread amongst the cluster in a manner that prevents individual hosts from being overloaded when particular searches get popular. And there is the caching of results, of course. That stuff tends to be more interesting to software and systems engineers, with page rank simply being a single line item in a vastly more complex system - the details of which are almost irrelevant so long as you feed a document in and get a score out.

My point, I guess, is that Google Search is a MASSIVELY complicated system and there is no one algorithm that describes it. Google is leveraging several decades of engineering know-how with regard to building high performance, highly scalable web serving systems, much of which has been pushed to new extremes internally by the sheer scale of google's requirements. Additionally, there is the established science of Information retrieval, which digs into optimizing the mechanisms by which you can structure extremely large datasets and then look up information within them via optimally fast algorithms (there is usually some trade off between speed, dataset size, and cost. The science is all about finding the sweet spot between them for your particular application). Much of the basic search engine functionality is based on textbook concepts in information retrieval, with significant innovation added by internal constraints driven by the sheer scale of google's operation. PageRank itself is just one small part of what is necessary to serve up a page of search results. I haven't even addressed all of google's additional functionality such as placing relevant ads on the search results page, parsing natural language into a formal query language (try asking google natural language questions such as "what is the circumference of the earth" and see what happens), pagination, etc.

There are really only a couple of ways you can learn the details of what you are asking. First, simply acquire a decade or two of software and systems engineering experience, hopefully mostly in the context of large scale web serving. If you do that, you can almost certainly make reasonably accurate guesses as to what the architecture within google probably looks like. Alternatively, go looking for a job at a search engine company. Google is not at all shy about exposing the family jewels to internal engineers, even if they aren't working directly on search. There are small details, such as the internals of the page rank black box, which are kept to a limited distribution of folks who actually need to make changes to that code, but for the most part, everyone at google can see the source code and read the documentation to every project within the company. The architecture of search is covered, at least from a high altitude perspective, in new employee training within your first week at the company.

Finally, google publishes the details of a fair bit of its research, including details of some of its more innovative algorithmic and architectural advances. Search for the white papers that describe the Google File System, BigTable, MapReduce, and others.

GFS - Page on googleusercontent.com
BigTable - Google Research Publication: BigTable
MapReduce - http://static.googleusercontent....

Google Research - Research at Google
Google Research on Algorithms and Theory - Algorithms and Theory
Note that there is a link to videos of TechTalks on those subjects from that page. Those may be easier to consume than academic papers.

Much of the content at the GoogleResearch website is going to be incomprehensible to someone without graduate degree knowledge of specific topics in computer science, but some of it is also quite easy to comprehend with just a basic understanding of CS and math. The GFS, BigTable, and MapReduce papers fall in that category, I think. And while none of those papers are particularly up to date with the state of the art within google, they do give plenty of information sufficient to help you understand the basic scalability and redundancy solutions that are in use within the company today, even if they have been significantly refined and added to over the years. They are certainly required reading for anyone thinking about interviewing for a technical position at the company.
 
Back
Top