3rd, June (Monday):
Objective — To build a text summariser
On day 1, I met my team members and came to know about the work they are pursuing. Immediate task given to me is text summarisation. I tried using gensim summariser. Since it did not give expected results, we used nltk to remove stop words, and built a similarity matrix, found cosine distance and finally generated summary of text input.
4th, June (Tuesday):
Objective — To extract text from PDF
Next day I was assigned to extract data from PDF files. After searching modules like PyPDF2, PdfMiner, Doc2PDF and TikaParser I found that PyPDF2 is the easiest one. Applied tests on DRHP reports. These reports are from SEBI website. DRHPs includes details about its promoters, reason for raising money, how the money will be used, risks involved with investing in the company.
5th, June (Wednesday):
Objective — To segregate topics in PDF
I need to separate sections from PDF. I used PDFMiner for this purpose. PDFMiner could identify the bold characters of PDF. It also assigns each character its associated size. While testing, I found using PDFMiner is a bad choice as this module is not optimised and takes huge time to parse one pdf. Since we need to work on huge number of pdfs (around 1000), this idea failed and I need to search another way to get around this problem.
6th, June (Thursday):
Objective — To segregate topics in PDF
I started to observe the pattern of the every DRHP report. Interestingly it is highly organised. It has proper TOC. Using TOC, I could easily segregate the sections of DRHPs. It generates output in JSON format.
Note: This method is highly dependent on the input and specifically written only for DRHPs. It is not generic to all PDFs, but generic to all DRHPs from all companies.
7th, June (Friday):
Objective — Downloading PDFs by Web-scraping
I have written web scraper that downloads PDFs. The scraper does not run headless due to security issues, selenium supports such actions only if the driver is open. Though this is the only plan for this day, to run the script open, I faced issues of broken selenium connections after downloading few PDF. At the later point of internship, I did not download PDF from selenium. I found another way to do the in headless mode. The other technique will be described below (5th, July).
10th, June (Monday):
Objective : Extending the segregation to Annual Reports
This is the toughest part of the entire internship. Each annual report is incomparable with one another. I sought help from my team members. They have no clue how to do this.
11th, June (Tuesday):
12th, June (Wednesday):
Objective : Extending the segregation to annual reports
I have started to observe the trends of a generic report. Found that TOC is highly unstructured but heuristically, 1) TOC is definitely in top 10 pages, 2) a page with TOC consists of nearly equal number of integer values and sentences. This is just an assumption but works very well. I started to on this idea and I was successful. The code written generates an intermediate HTML file. Then again this HTML is converted to JSON file.
13th, June (Thursday):
Objective : Building Flask Restful APIs for DRHPs
I started writing Restful APIs for the sections. The generated JSON file is formatted according to the request and is used by this API. After completion, our project manager asked me to document it by the next day.
14th, June (Friday):
Objective : Building Flask Restplus APIs for Annual Reports
Since I did not understand documenting in Restful Flask API, I made minor changes and converted all restful APIs to restplus APIs. I used Swagger UI to provide interface and to test the APIs working. This flask api is deployed by one of the team members. It takes JSON input, formats it according to the request and responds accordingly.
17th, June (Monday):
Objective : Getting familiar with AWS
The immediate work includes using Amazon Web Services. I have referred online sources and understood AWS well. With the help of one of my team members (Mr. Chanakya) I took pem key file and started running an EC2 instance. Now I kept configuring EC2 so that I could access S3 bucket from python3. I made SSH connection to the instance and added access key and secret key to config file/bash profile to setup environment variables. I installed requirements to run Selenium which prompted for an update to Unix EC2 instance. Then I kept installing linker files downloaded. These lib files in most of the systems come inbuilt, but these facilities are not provided by Unix OS.
18th, June (Tuesday):
Objective : Getting remuneration and subsidiary data from Annual Reports
This is just not that simple. Neither the word remuneration nor subsidiary are part of TOC. I have completed the task by taking a keyword list and kept matching.
19th, June (Wednesday):
Objective : Web scraper for anchors data and getting Financial Statements
Anchors data is collected from BSE India website. I could not deliver this task as tabula not working on scanned PDFs. Though I completed writing scraper on this site. Financial Statements like Balance Sheet, Profit and Loss statement and Cash-flow statement are extracted from Annual Reports by converting page to image. Both Standalone and Consolidated Sheets are checked for.
20th, June (Thursday):
Objective : Web scraping on Money Control
Initially I used Selenium for first screen. The code generates intermediate text file which contains links to details of each company. BS4 runs on all these urls. I ran background process in EC2 instance and I saved the files in S3 storage of AWS.
21st, June (Friday):
Objective : Corporate Announcements and Email manger
I wrote the scraper that checks for new announcements. Here top 10 are entries checked for every 2 seconds and triggers Email Managing module as soon as new entry is found. It sends email that contains information about the update and with the corresponding PDF link to the subscribed users.
The initial plan is to use mail chimp but later dropped this idea due to complexity involved in creation of campaigns, tags and users. I used Amazon SES (Simple Email Service). I coded an email template and integrated it to send proper announcements to subscribed users.
A UI is provided for subscription of users. This is written in Flask. All users and the corresponding subscription category is saved to Postgres database (AWS RDB).
24th, June (Monday):
Objective : API for Android app and Admin portal
This week I started to build Admin dashboard that adds articles to database along with associated images. Admin can also delete the posts. Then, I wrote a Restplus APIs to render the articles.
25th, June (Tuesday):
Objective : A basic Android App
For this task I worked on View Pager, Recycler View and Grid View. I learnt how to deal with fragments which I don’t know how to work on before. The API calls for the data to fill in is also written.
26th, June (Wednesday):
Objective : A basic Android App
This task is left off in the middle and is not complete but the major work is completed. I felt like someone could complete this picking up from where I left off.
27th, June (Thursday):
Objective : Chittogarh IPO
I scraped Initial Public Offer (IPO) details from Chittogarh website and collected summary of the company, IPO details and Listing Day Trading Information. Both mainline and SME data is collected and ran in EC2, stored in S3.
28th, June (Friday):
Objective : To visualise stock time series data
The Daily Adjusted data is from Alpha Vantage API. I plotted the data and generated images. All images contain watermark. I used OpenCV for watermarking both text and image. Since Alpha Vantage has limit in number of requests, I took 6 API keys that were generated and sent request with each of them in turns.
1st, July (Monday):
Objective : To get Public Issues data and aligning
A scraper was written on BSE India to get Cumulative Historic Public Issues data. After fetching data this data is processed using pandas.
2nd, July (Tuesday):
Objective : Aligning Financial data and Share Holdings
Aligning Financial Data is a fix by me on the data generated and stored by one of the team members. This data is highly unstructured. I made a consolidated and structured chart for each company. I fetched Share Holdings data by connecting to database where he stored the scraped data. Then I used Matplotlib to generate stacked bar plot that distinguishes Promoter and Public percentage of shares.
3rd, July (Wednesday):
Objective : Handling Telecom data and Plotting
I wrote a scraper that collects Telecom data of all states in India and company-wise monthly subscribers totals such as Airtel, BSNL etc.. This data is scraped from PDFs. After merging all data into a file, I started to plot the data. Then I generated the animated visualisation (Video) that shows monthly increase or decrease of 4 companies. This way it has become easy to compare subscription rate over months.
4th, July (Thursday):
Objective : Stock time-series prediction model
I used LSTM to predict the stock opening price. Architecture of the Sequential model includes 4 LSTM layers, Dropouts and final layer is a dense layer. I used Adam optimiser and to compute loss I used mean squared error.
5th, July (Friday):
Objective : Pushing all data to Amazon S3, pushing all code to GitHub and documenting
On the final day of internship, I made few fixes before running all code in EC2 so that it computes with very little latency. Here the main issue is PDF downloading. So I did not download instead I have open the web-file and through web-read I read and written bytes to a temp-file. This is how I got around the issue. Finally, I pushed entire code to GitHub on a private repo where I am a contributor.