From Scraping 1,100 Coffee Roasters to Detecting Shadow AI: Why Pattern Recognition at Scale Is Everything

Two years ago, I built Roastguide - a platform to help specialty coffee lovers find great roasters across Europe. The problem seemed straightforward: there were hundreds of small roasters making incredible coffee, but no easy way to discover them. The solution? Scrape, structure, and surface the signal.

Today, I'm building Trustflo with Hanna - a platform that helps companies discover and govern the AI tools their teams are actually using. On the surface, these problems seem unrelated. One is about coffee. The other is about compliance.

But underneath, they're the same problem: finding needles in haystacks at scale.

Roastguide: Finding signal in a sea of coffee

When I started Roastguide, specialty coffee was fragmented. Great roasters existed, but they were hidden - on Instagram, in local directories, mentioned in forum threads. There was no single source of truth.

So I built one. I scraped data from 1,100+ roasters across Europe, normalized it, and made it searchable. Users could filter by region, roast style, or shipping options. The app was featured by Apple's editorial team. It worked.

But the hard part wasn't the UI. It was the data pipeline. Here's what that looked like:

Discovery Scraping multiple unstructured sources (websites, Instagram, Google Maps) to find roasters.

Normalization Converting messy, inconsistent data (addresses, opening hours, product catalogs) into a clean schema.

Classification Tagging roasters by style, certifications, and availability.

Maintenance Re-scraping periodically because roasters close, move, or change their offerings.

At the time, I thought this was a niche problem specific to specialty coffee. It wasn't.

Fast-forward: The shadow AI problem

When Hanna and I started researching AI governance for mid-market European companies, we kept hearing the same frustration:

"We know our employees are using AI tools. We just don't know which ones, or what data they're feeding them."

This is shadow AI - the use of AI tools outside official procurement or oversight. It's not just ChatGPT. It's Grammarly, Notion AI, GitHub Copilot, Otter.ai, Perplexity, DeepL, and dozens of others. Each one is a potential compliance risk under GDPR and the AI Act.

But here's the thing: most companies don't even know these tools exist in their environment. They're hidden in SaaS subscriptions, browser extensions, Slack integrations, and personal accounts.

Sound familiar?

It's the same fragmentation problem I solved with Roastguide. Except instead of finding coffee roasters, we're finding AI tools. And instead of scrapers hitting public websites, we're parsing SaaS expense receipts, SSO logs, and integration metadata.

The pattern: Discovery → Normalization → Classification → Control

When I look back at Roastguide and forward at Trustflo, I see the same four-stage pipeline:

Discovery SaaS expense receipts + SaaS Integrations

Normalization Converting unstructured data (email receipts, API logs) into a structured inventory of tools.

Classification Tagging tools by risk level, data processing location, and AI Act category.

Control Surfacing this information in a dashboard where Legal, IT, and Finance can approve, block, or monitor usage.

The specifics differ. But the structure is identical.

Why this matters for compliance

The AI Act doesn't just regulate AI providers. It also regulates deployers - companies that use AI in their operations. That includes your organization, even if you didn't build the AI yourself.

To comply, you need to:

Know which AI systems you're using
Classify them by risk level
Maintain records of usage and data flows
Ensure human oversight where required
Inform affected individuals (employees, customers)

You can't do any of that if you don't know the tools exist.

That's the shadow AI problem. And it's not going away. Netskope tracked 1,550+ distinct GenAI apps in 2025 (up from 317 a year earlier). The average mid-market company uses 15+ AI tools. Most are outside formal procurement.

Manual spreadsheets don't scale. IT surveys get outdated the day they're sent. You need continuous discovery.

Why we built Trustflo the way we did

When Hanna and I started Trustflo, we could have built a "compliance dashboard" where companies manually log their AI tools. That's what most vendors do.

But I knew from Roastguide that manual data entry doesn't work at scale. Roasters didn't submit their own listings. I scraped them. Because if you rely on people to self-report, you get 20% coverage and it's out of date by next week.

So we built Trustflo to automatically discover shadow AI. We connect to your SaaS spend data, your SSO logs, your integrations. We parse it, normalize it, and classify it against the AI Act's risk framework. Then we surface it in a dashboard where you can approve, block, or monitor.

No surveys. No spreadsheets. No asking IT to manually audit 500 employees.

Just continuous, automated discovery. The same way Roastguide found coffee roasters.

The takeaway

I used to think Roastguide was a consumer product and Trustflo was an enterprise compliance tool. But they're both information products. They both solve the same problem: making the invisible visible.

If you're building in compliance, security, or governance, you're not really building compliance software. You're building a discovery engine.

The value isn't in the UI. It's in the data pipeline. Can you find the signal in the noise? Can you keep it current? Can you make it actionable?

That's the hard part. Everything else is just UI work.

Whether you're finding great coffee or shadow AI, the problem is the same: pattern recognition at scale. Get that right, and the rest follows.

Roastguide: Finding signal in a sea of coffee

Fast-forward: The shadow AI problem

The pattern: Discovery → Normalization → Classification → Control

Why this matters for compliance

Why we built Trustflo the way we did

The takeaway

Ready to get compliant?

Related articles

Shadow AI: What it is, why it's risky, and what it means for AI Act compliance