Web-Scraping overview

  • Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree using Python parsers like lxml. It automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
  • Beautiful Soup (BS4) is a parsing library that can use different parsers. A parser is simply a program that can extract data from HTML and XML documents.
  • Beautiful Soup’s default parser comes from Python’s standard library. It’s flexible and forgiving, but a little slow. The good news is that you can swap out its parser with a faster one if you need the speed.
  • One advantage of BS4 is its ability to automatically detect encodings. This allows it to gracefully handle HTML documents with special characters.
  • In addition, BS4 can help you navigate a parsed document and find what you need.
  • The web driver kit emulates a web-browser (I chose chrome driver) and executes the JS scripts to load the dynamic content.
  • Sometimes the Requests library is not enough to scrape a website. Some sites out there use JavaScript to serve content. For example, they might wait until you scroll down on the page or click a button before loading certain content.
  • Other sites may require you to click through forms before seeing their content. Or select options from a dropdown. Or perform a tribal rain dance…
  • For these sites, you’ll need something more powerful. You’ll need Selenium (which can handle everything except tribal rain dancing).
  • Selenium is a tool that automates browsers, also known as a web-driver. With it, you can actually open a Google Chrome window, visit a site, and click on links. Pretty cool, right?
  • It also comes with Python bindings for controlling it right from your application. This makes it a breeze to integrate with your chosen parsing library.

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Journey to MIT 6.824 — Lab 2A Raft Leader Election

VMware vCenter Certificate Replacement

How to Configure SSH Key for GitHub on Ubuntu

Success Skills for Architects with Neil Ford

Memoisation In Python: Exploring @lru_cache And Its Control Knobs

12 Tips for a Productive Remote Sprint Review

Why are Startups after Flutter App Development Companies these Days?

Weekly Technical Progress Update #2

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Amulya Reddy Konda

Amulya Reddy Konda

Consultant

More from Medium

My Experience as an Automation Tester

Build Your First Web App With Flask

Application FrameWork Subject’s Final Project

Getting Started With Postman API