Skip to content

Web Data Extraction with Make.com and AI

Extract and analyze web data using Make.com (formerly Integromat) combined with AI for intelligent data processing, transformation, and enrichment.

What you’ll build: Automated web scraping scenario that extracts product data, analyzes it with AI, and stores structured results.

Use cases: Competitive analysis, price monitoring, content aggregation, lead generation.

Time: 35 minutes

  • Make.com account (free tier available)
  • OpenAI API key
  • Target website to scrape (use ethically and legally)
  1. Sign up at Make.com
  2. Create new scenario
  3. Add OpenAI connection in settings

Goal: Monitor competitor prices and analyze trends

Flow:

HTTP Request → HTML Parser → Iterator → OpenAI Analysis → Google Sheets

1. HTTP Module - Fetch Web Page

URL: https://example.com/products
Method: GET
Headers:
User-Agent: Mozilla/5.0...

2. HTML Parser

  • Parse HTML content
  • Extract elements by CSS selector
  • Example: .product-card

3. Iterator

  • Loop through each product

4. OpenAI - Extract Structured Data

Prompt:

Extract product information from this HTML:
{{html_content}}
Return JSON:
{
"name": "product name",
"price": numeric_value,
"currency": "USD",
"in_stock": boolean,
"features": ["feature1", "feature2"]
}

5. Google Sheets - Append Row

  • Spreadsheet: Price Monitor
  • Sheet: Products
  • Values: {{parsed_data}}
RSS Feed → OpenAI Summarize → Filter → Notion Database

1. RSS Module

  • Feed URL: Tech news RSS
  • Limit: 10 items

2. OpenAI - Summarize

Prompt:

Summarize this article in 2-3 sentences:
Title: {{title}}
Content: {{content}}
Focus on key points and implications.

3. Filter

  • Only continue if summary mentions “AI” or “automation”

4. Notion - Create Page

  • Database: News Tracker
  • Properties:
    • Title: {{title}}
    • Summary: {{ai_summary}}
    • URL: {{link}}
    • Date: {{published}}
Webhook → Scrape LinkedIn → OpenAI Analyze → CRM Update

1. Webhook Trigger

  • Receives: {"linkedin_url": "..."}

2. HTTP - Scrape Profile

Note: Use LinkedIn API or authorized scraping service
Direct scraping may violate ToS

3. OpenAI - Extract Key Info

Prompt:

From this LinkedIn profile, extract:
Profile text: {{profile_html}}
Return JSON:
{
"industry": "...",
"company_size": "...",
"seniority": "junior|mid|senior|executive",
"interests": ["..."],
"likely_pain_points": ["..."]
}

4. HubSpot - Update Contact

  • Find contact by email
  • Update custom fields with AI insights
HTTP Get Page 1
Extract Pagination Links
Iterator (Each Page)
HTTP Get Each Page
Aggregate Results

Add Sleep module between requests:

  • Delay: 2 seconds
  • Prevents IP blocking

Use Filter modules:

Filter: Price > 0 AND Price < 10000
Filter: Name is not empty

Add Error Handler route:

  • On HTTP error: Log to sheet
  • On parse error: Send Slack alert
  • Continue execution

Check website’s robots.txt:

https://example.com/robots.txt

Only scrape allowed paths.

Add delays between requests:

  • Minimum 1-2 seconds
  • For large operations: 5+ seconds

For JavaScript-rendered content:

  • Use browser automation service (Browserless, Apify)
  • Or APIs when available

Store scraped HTML:

  • Reduces API calls
  • Enables re-processing
  • Debugging easier

Free tier: 1,000 operations/month

  • 1 HTTP request = 1 operation
  • 1 OpenAI call = 1 operation
  • 1 Sheet update = 1 operation

Monitor 50 products daily:

  • HTTP requests: 50/day × 30 = 1,500 ops
  • OpenAI analysis: 50/day × 30 = 1,500 ops
  • Sheet updates: 50/day × 30 = 1,500 ops
  • Total: 4,500 ops/month = $9/month
  1. Batch processing: Process multiple items per run
  2. Conditional execution: Only run when needed
  3. Use webhooks: Trigger-based vs scheduled
Trigger: Schedule (daily 9 AM)
1. HTTP - Get job listings page
2. HTML Parse - Extract job cards
3. Iterator - Loop jobs
4. OpenAI - Extract job details
5. Filter - Match criteria
6. Telegram - Send matching jobs
Trigger: Schedule (every hour)
1. Twitter API - Search mentions
2. OpenAI - Sentiment analysis
3. Router:
- Positive → Thank user
- Negative → Alert team
- Neutral → Add to CRM
Trigger: Schedule (twice daily)
1. HTTP - Get product pages
2. Parse prices
3. Google Sheets - Get previous prices
4. Compare - If dropped > 10%
5. Email - Send alert
6. Update sheet with new price

Issue: HTTP 403 Forbidden

  • Add proper User-Agent header
  • Respect rate limits
  • Consider using proxy
  • Check if scraping is allowed

Issue: Parse errors

  • Inspect actual HTML structure
  • Use browser devtools to find selectors
  • Handle missing elements gracefully

Issue: Inconsistent data

  • Add validation filters
  • Use try-catch with error handlers
  • Set default values for missing fields

Issue: Scenario timeouts

  • Break into smaller scenarios
  • Use data stores for intermediate results
  • Optimize selectors
  • ✅ Check robots.txt
  • ✅ Read Terms of Service
  • ✅ Respect rate limits
  • ✅ Identify your bot (User-Agent)
  • ✅ Use official APIs when available
  • ❌ Scrape personal data without consent
  • ❌ Overwhelm servers with requests
  • ❌ Bypass authentication
  • ❌ Ignore cease & desist notices

Enhance your scenarios:

  • Add data cleaning with OpenAI
  • Implement change detection
  • Create visual dashboards
  • Set up monitoring and alerts

Related guides:


Found an issue? Open an issue!