Data analysis and web scraping for a leading global supplier of integrated drivetrain and propulsion systems.
Dana Incorporated, a Fortune 500 company, is the world's leading supplier of integrated drivetrain and propulsion systems. Founded in 1904 and based in Maumee, Ohio, the company employs nearly 36,000 people in 33 countries.
For this project, Dana needed support with web scraping and data analysis and chose Deltologic to take charge of the end-to-end development process. After the initial consultations, we decided to develop multiple web crawlers to gather millions of unique product data from specific websites of automotive manufacturers and suppliers.
The client can now run 10 GCE machines on the Google Cloud Platform. Each of them runs 25 scripts that assemble 150 web scrapers to gather product data from multiple websites simultaneously. The whole operation lasts around 24 hours and, as a result, delivers ~6.000.000 pieces of unique product data. We managed to fully automate the entire pipeline so it can be run whenever our client needs the newest information and reports.
All of this data is later sorted, aggregated, standardised and archived in the cloud databases. When necessary, they can also be automatically converted into visual reports shared amongst the employees and compared to other datasets.
While scraping simple static websites on a small scale is very simple, this project was nothing alike.
• The forms themselves were highly complex; they had about 1000 different permutations of options chosen;
• After choosing a combination, we had to scroll through the entire page and expand tree structure dropdowns to get to the deepest level and gather the most crucial data;
• To prevent access restriction, we rotated our IP and used proxies;
• We limited the speed of each scraper to follow the rules of robots.txt;
• We had to run ten machines asynchronously and on each machine 25 scrapers, which, as a result, gave us 250 crawlers running simultaneously;
• If any of the scrapers failed, we had to return them from a specific breakpoint;
• We aggregated product data by model, year and manufacturer and deleted all of the duplicates;
• We had to make sure not to use a max pool of connections to the database simultaneously.
We manage to overcome all of those obstacles and deliver an end-to-end product, collaborating with Dana's Data Analytics team.