Case studies

Data scraping

Dana

Dana Incorporated, a Fortune 500 company, is the world's leading supplier of integrated drivetrain and propulsion systems. Founded in 1904 and based in Maumee, Ohio, the company employs nearly 36,000 people in 33 countries.

Gaining competitive advantage through data and web scraping

For this project, Dana needed support with web scraping and data analysis and chose Deltologic to take charge of the end-to-end development process. After the initial consultations, we decided to develop multiple web crawlers to gather millions of unique product data from specific websites of automotive manufacturers and suppliers.

The excellent outcomes of Dana and Deltologic partnership

The client can now run 10 GCE machines on the Google Cloud Platform. Each of them runs 25 scripts that assemble 150 web scrapers to gather product data from multiple websites simultaneously. The whole operation lasts around 24 hours and, as a result, delivers ~6.000.000 pieces of unique product data. We managed to fully automate the entire pipeline so it can be run whenever our client needs the newest information and reports.

All of this data is later sorted, aggregated, standardised and archived in the cloud databases. When necessary, they can also be automatically converted into visual reports shared amongst the employees and compared to other datasets.

The goal was clear; the road to accomplish it rather bumpy

While scraping simple static websites on a small scale is very simple, this project was nothing alike.

Why?

      JavaScript dynamically generated all data after selecting forms
      The forms themselves were highly complex; they had about 1000 different permutations of options chosen
      After choosing a combination, we had to scroll through the entire page and expand tree structure dropdowns to get to the deepest level and gather the most crucial data
      To prevent access restriction, we rotated our IP and used proxies
      We limited the speed of each scraper to follow the rules of robots.txt
      We had to run ten machines asynchronously and on each machine 25 scrapers, which, as a result, gave us 250 crawlers running simultaneously
      If any of the scrapers failed, we had to return them from a specific breakpoint
      We aggregated product data by model, year and manufacturer and deleted all of the duplicates
      We had to make sure not to use a max pool of connections to the database simultaneously

We manage to overcome all of those obstacles and deliver an end-to-end product, collaborating with Dana's Data Analytics team.

Tech stack

Python, Django, Heroku, Postgresql, JavaScript, TailwindCSS

Want to hire the best team for the job?