From Scrapy to Apify: The Real Case Study That Made Me Rethink How I Build Production Scrapers
There’s a specific moment in the life of every scraping project.
You start with Scrapy or Playwright locally. It works. You push it to a server. Still works. Then you need to scale: more URLs, higher frequency, proxy rotation, CAPTCHA handling, structured storage, monitoring…
Suddenly what was a 200-line script becomes infrastructure you have to maintain.
That’s exactly what happened to Daltix, a price intelligence company. They were managing their own EC2 servers, and when they needed to scale to real volumes, the equation stopped making sense. They migrated to Apify. Result: 90% reduction in EC2 costs and the ability to process 5 million resources per day.
Let’s look at why — and when it makes sense for you to make that same switch.
The Real Problem with Scrapy in Production
Scrapy has no problems as a framework. It’s powerful, flexible, and has a huge community. The problem appears when you reach production and have to solve things that are outside of scraping itself:
- Where do I store data? How do I export to 15 different formats?
- How do I manage URL queues when a node fails?
- Who monitors that the scraper hasn’t died at 3am?
- How do I rotate residential proxies without getting banned?
- How do I scale without provisioning more servers manually?
These questions don’t have answers within Scrapy itself. You have to build them. And that’s engineering time not going to your product.
Apify tries to answer all those questions from day one.
What Apify Really Is
Apify is a web scraping and automation platform. It was voted the #1 web scraping tool on Capterra in 2024 and has clients like Intercom and the European Commission itself (which uses it to monitor prices across more than 42,000 products in 720 retailers).
The central unit in Apify is the Actor: a container that encapsulates a scraper with all its logic, configuration, input/output, and lifecycle. Think of it as a serverless function specialized for data extraction.
You can either:
- Use Actors from the Marketplace — over 1,500 pre-built scrapers ready to use without writing a single line of code. The Google Maps Actor alone has over 193,000 active users.
- Build your own Actor — using the JavaScript or Python SDK, powered by the Crawlee library underneath.
Building Your First Actor with Crawlee
Crawlee is Apify’s open-source library that unifies Playwright, Puppeteer, and Cheerio under a single API. It’s what truly differentiates the development experience.
What Crawlee handles automatically: request queue management, retries, concurrency, and integration with Apify’s storage layer.
The Storage Trinity
One area where Apify genuinely shines over custom solutions is storage:
- Datasets — For structured row-based data (like a product table). Exportable to 15+ formats: JSON, CSV, Excel, XML…
- Key-Value Stores — For state, configurations, or screenshots. Works like a simple key-value object.
- Request Queues — The persistent, distributed queue of URLs pending scraping. If a node fails, the queue survives.
This trinity is what allows Daltix to process 5 million resources daily without losing data between restarts.
Proxies, Anti-Bot, and the Cat-and-Mouse Game
In production, the technical scraping is only 50% of the problem. The other 50% is not getting blocked.
Apify has integrated proxy management with two main modes:
- Datacenter proxies — Faster and more cost-effective, but easier to detect.
- Residential proxies — Real user IPs, much harder to block, but at a premium cost.
For JavaScript-heavy sites with active bot detection, Apify includes Puppeteer Stealth and integrated CAPTCHA handling. The Crawlee library has realistic fingerprinting enabled by default.
No solution is foolproof — it’s an ongoing arms race between detection techniques and evasion techniques. But having this solved at the platform level saves you weeks of work.
The Legal Elephant in the Room
Critical, especially in Spain and the European Union.
The LinkedIn v. hiQ ruling established that scraping publicly available data is generally legal. But in Europe you have to cross-reference that with GDPR.
Practical rule:
- Public data without personal information → generally safe
- Data including names, emails, or personal identifiers → GDPR red zone
- Terms of Service violations → real legal risk, though not always illegal
The European Commission uses Apify precisely because their use cases are product price monitoring, not personal data extraction. If your use case is similar (prices, inventory, public content), you’re in relatively safe territory. If you’re extracting personal information, you need specific legal counsel.
When NOT to Use Apify
I’ll be direct: if you have a simple scraper that scrapes 500 pages per month for internal use, Apify is overkill. A Python script with Requests + BeautifulSoup on a cron job is enough.
Apify makes sense when:
- You need to scale to volumes that would make managing your own infrastructure unviable
- Your team doesn’t want to dedicate engineering time to maintaining an infrastructure layer
- You need scheduling, monitoring, and alerts without building it all from scratch
- You want to use one of the 1,500+ pre-built Marketplace Actors without touching code
The Daltix case is the perfect example: the 90% EC2 savings didn’t come from Apify being cheaper per page scraped. It came from not having to pay engineers to maintain servers.
Where to Start
Two paths depending on your situation:
If you just need to extract data without code: Go straight to the Marketplace, find the Actor for your use case (Google Maps, Amazon, public LinkedIn, etc.) and run it with the parameters you need. Apify offers monthly free credits enough to test without commitment.
If you need a custom Actor: Start with Crawlee locally:
When ready to deploy to Apify:
The deployment workflow is integrated. In minutes you have your Actor running in the cloud with scheduling, storage, and monitoring included.
The Conclusion Nobody Usually Mentions
Scraping isn’t a code problem. It’s an infrastructure problem.
Scrapy, Playwright, and any other library solve the code part brilliantly. What they don’t solve is everything else: where data lives, who monitors it, how it scales, how it survives blocks.
In 2026, when development time is scarcer than ever, paying for managed infrastructure instead of building it yourself makes far more sense than it seems at first.
Daltix learned it by scaling. You can learn it before.
What’s your scraping use case? Are you in the local script phase or already dealing with infrastructure? Tell me in the comments.
