Interested to learn how Google, Bing, or Yahoo work? Wondering what it takes to crawl the web, and what a simple web crawler looks like? In under 50 lines of Python (version 3) code, here's a simple web crawler! (The full source with comments is at the bottom of this article).
And let's see how it is run. Notice that you enter in a starting website, a word to find, and the maximum number of pages to search through.
Okay, but how does it work?
Let's first talk about what a web crawler's purpose is. As described on the Wikipedia page, a web crawler is a program that browses the World Wide Web in a methodical fashion collecting information. What sort of information does a web crawler collect? Typically two things:
- Web page content (the text and multimedia on a page)
- Links (to other web pages on the same website, or to other websites entirely)
Which is exactly what this little "robot" does. It starts at the website that you type into the spider() function and looks at all the content on that website. This particular robot doesn't examine any multimedia, instead it is just looking for "text/html" as described in the code. Each time it visits a web page it collects two sets of data: All the text on the page, and all the links on the page. If the word isn't found in the text on the page, the robot takes the next link in its collection and repeats the process, again collecting the text and the set of links on the next page. Again and again, repeating the process, until the robot has either found the word or has runs into the limit that you typed into the spider() function.
Is this how Google works?
Sort of. Google has a whole fleet of web crawlers constantly crawling the web, and crawling is a big part of discovering new content (or keeping up to date with websites that are constantly changing or adding new stuff). However you probably noticed that this search took awhile to complete, maybe a few seconds. On more difficult search words it might take even longer. There's another big component to search engines called indexing. Indexing is what you do with all the data that the web crawler collects. Indexing means that you parse (go through and analyze) the web page content and create a big collection (think database or table) of easily accessible and quickly retrievable information. So when you visit Google and type in "kitty cat", your search word is going straight* to the collection of data that has already been crawled, parsed, and analyzed. In fact, your search results are already sitting there waiting for that one magic phrase of "kitty cat" to unleash them. That's why you can get over 14 million results within 0.14 seconds.
*Your search terms actually visit a number of databases simultaneously such as spell checkers, translation services, analytic and tracking servers, etc.
Let's look at the code in more detail!
The following code should be fully functional for Python 3.x. It was written and tested with Python 3.2.2 in September 2011. Go ahead and copy+paste this into your Python IDE and run it or modify it!
If Python is your thing, a book is a great investment, such as the following