Web-Scraping overview

  • Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree using Python parsers like lxml. It automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
  • Beautiful Soup (BS4) is a parsing library that can use different parsers. A parser is simply a program that can extract data from HTML and XML documents.
  • Beautiful Soup’s default parser comes from Python’s standard library. It’s flexible and forgiving, but a little slow. The good news is that you can swap out its parser with a faster one if you need the speed.
  • One advantage of BS4 is its ability to automatically detect encodings. This allows it to gracefully handle HTML documents with special characters.
  • In addition, BS4 can help you navigate a parsed document and find what you need.
  • The web driver kit emulates a web-browser (I chose chrome driver) and executes the JS scripts to load the dynamic content.
  • Sometimes the Requests library is not enough to scrape a website. Some sites out there use JavaScript to serve content. For example, they might wait until you scroll down on the page or click a button before loading certain content.
  • Other sites may require you to click through forms before seeing their content. Or select options from a dropdown. Or perform a tribal rain dance…
  • For these sites, you’ll need something more powerful. You’ll need Selenium (which can handle everything except tribal rain dancing).
  • Selenium is a tool that automates browsers, also known as a web-driver. With it, you can actually open a Google Chrome window, visit a site, and click on links. Pretty cool, right?
  • It also comes with Python bindings for controlling it right from your application. This makes it a breeze to integrate with your chosen parsing library.




Amulya Reddy Konda

