Dana Incorporated, a Fortune 500 company, is the world's leading supplier of integrated drivetrain and propulsion systems. Founded in 1904 and based in Maumee, Ohio, the company employs nearly 36,000 people in 33 countries.
For this project, Dana needed support with web scraping and data analysis and chose Deltologic to take charge of the end-to-end development process. After the initial consultations, we decided to develop multiple web crawlers to gather millions of unique product data from specific websites of automotive manufacturers and suppliers.
The client can now run 10 GCE machines on the Google Cloud Platform. Each of them runs 25 scripts that assemble 150 web scrapers to gather product data from multiple websites simultaneously. The whole operation lasts around 24 hours and, as a result, delivers ~6.000.000 pieces of unique product data. We managed to fully automate the entire pipeline so it can be run whenever our client needs the newest information and reports.
All of this data is later sorted, aggregated, standardised and archived in the cloud databases. When necessary, they can also be automatically converted into visual reports shared amongst the employees and compared to other datasets.
While scraping simple static websites on a small scale is very simple, this project was nothing alike.
We manage to overcome all of those obstacles and deliver an end-to-end product, collaborating with Dana's Data Analytics team.