The Application provides an interface for the data, collected from a website, in the form of graphs.
bs4, requests, json, lxml, selenium.. and few other packages that I will discuss as we go along…
Request package helps in establishing the http connection and BeautifulSoup package helps in extracting data using html tags. There are many scrapers in the project and are implemented with different design, so that you can use appropriate chart for specific kind of data. For ex, Bar chart better shows the comparison among company financials over years, but it is not suitable for comparison on whole (Pi charts are used). Once we scrape, to store the scraped data, we use a database and also we need to create a simple webapp that could render the chart or any summary illustration we want to express.
It is almost any data that you can get from any websites you want and organise them as you like.
What is Web scraping?
A web scraping tool is a technology solution to extract data from web sites, in a quick, efficient and automated manner, offering data in a more structured and easier to use format, either for B2B or for B2C processes.
Extraction → Transformation — -> Reuse
- Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree using Python parsers like lxml. It automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
- Beautiful Soup (BS4) is a parsing library that can use different parsers. A parser is simply a program that can extract data from HTML and XML documents.
- Beautiful Soup’s default parser comes from Python’s standard library. It’s flexible and forgiving, but a little slow. The good news is that you can swap out its parser with a faster one if you need the speed.
- One advantage of BS4 is its ability to automatically detect encodings. This allows it to gracefully handle HTML documents with special characters.
- In addition, BS4 can help you navigate a parsed document and find what you need.
- The web driver kit emulates a web-browser (I chose chrome driver) and executes the JS scripts to load the dynamic content.
- Other sites may require you to click through forms before seeing their content. Or select options from a dropdown. Or perform a tribal rain dance…
- For these sites, you’ll need something more powerful. You’ll need Selenium (which can handle everything except tribal rain dancing).
- Selenium is a tool that automates browsers, also known as a web-driver. With it, you can actually open a Google Chrome window, visit a site, and click on links. Pretty cool, right?
- It also comes with Python bindings for controlling it right from your application. This makes it a breeze to integrate with your chosen parsing library.
Lxml is a high-performance, production-quality HTML and XML parsing library. We call it The Salad because you can rely on it to be good for you, no matter which diet you’re following.
Among all the Python web scraping libraries, we’ve enjoyed using lxml the most. It’s straightforward, fast, and feature-rich.
Even so, it’s quite easy to pick up if you have experience with either XPaths or CSS. Its raw speed and power has also helped it become widely adopted in the industry.
Json exposes an API familiar to users of the standard library pickle modules.
The package has been set up to fetch and run ChromeDriver for MacOS (darwin), Linux based platforms (as identified by nodejs), and Windows. If you spot any platform weirdnesses, let us know or send a patch.
Requests is an Apache2 Licensed HTTP library, written in Python, for human beings.
Most existing Python modules for sending HTTP requests are extremely verbose and cumbersome. Python’s builtin urllib2 module provides most of the HTTP capabilities you should need, but the api is thoroughly broken. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.
Requests allow you to send HTTP/1.1 requests. You can add headers, form data, multipart files, and parameters with simple Python dictionaries, and access the response data in the same way. It’s powered by httplib and urllib3, but it does all the hard work and crazy hacks for you.
Consolidate : Template engine
Mustache is a logic-less template syntax. It can be used for HTML, config files, source code — anything. It works by expanding tags in a template using values provided in a hash or object.
Express.js, or simply Express, is a web application framework for Node.js, released as free and open-source software under the MIT License. It is designed for building web applications and APIs. It has been called the de facto standard server framework for Node.js.