3 Kudos

 

Creating a simple web crawler

While I’m not working on side projects or am at work I like to tinker with small programs to figure out how things we use everyday work. Lately I’ve been curious about a few things at once. One of them being page ranking algorithms, and how to use Redis as a persistance layer.

I started by writing a simpler scraper using python and the BeautifulSoup and Requests libraries. The scraper will hit hacker news http://news.ycombinator.com and grab the all the stories off the first page.

As we have this the scraper simply grabs all links that are external to the hacker news domain, and pushes it to Redis. While it may not be the best way to scrape a web page it certainly was the easiest way to go about things. Next, we must try and gather data from the page itself to try and determine what the page is about. The gist below gives another simple python solution to this problem

This code will grab a url that was scraped from the previous script and parse the text in the page. It will grab all the content in <p/> tags and it will split it up. It will then add it to a dictionary. If it exists in the dictionary then it will simply increment by 1 if not it will add it to the dictionary and start it with a value of 1. This is a pretty rudimentary solution to the problem, but for the most part after running each of these scripts you have scraped the content of a web page and counted the occurrence of words within that page. I’ll keep iterating on this solution to start adding tags and rankings to content.

Some things to do to improve this solution:

  • Have Redis persist to the results to a file. Right now it does in memory persistance, and after Redis is shut down you lose all your parsing effort.
  • Use machine learning to create tags for the content
  • Use machine learning to build page ranks based on the web page graph for some content.
  • Somehow map a search query to result retrieval.

Stay tuned for my next post where I tackle some of these problems. Have anything to contribute to the project? Feel free to fork it here!