
Pricing
Simple, Transparent Pricing for Every Stage of Your Business
Choose a plan that grows with you, with no hidden fees and unlimited transactions
Plan
Basic
Cost per month
For annual plan
For quarterly plan
For monthly plan
$49
$59
$69
Active promotional cards
Choose from our different promotional card types
1
Stamp cards
Unlimited digital cards - save on ink stamps and printing cards
Reward cards
Unlimited membership digital cards
Card templates
Ready to use card templates for different types of business. Choose the right one or create your own card with an individual design in a few minutes
Custom card design
Create your own card with a unique design in 5 minutes. Customize colors, logo, stamp images, as well as card description. Take full advantage of our custom card builder
Yes. Here is the **full blueprint** to build this yourself: a practical, opinionated pipeline for discovering companies using on‑prem, self‑hosted, private-cloud, or air‑gapped LLM infrastructure, extracting evidence, scoring confidence, and exporting a target list. [firecrawl](https://www.firecrawl.dev/glossary/web-scraping-apis/how-web-scraping-apis-convert-html-to-json)
This is not the fanciest architecture. It is the one most likely to work.
## Architecture
Build a batch pipeline with six stages:
1. Seed and search for candidate URLs.
2. Fetch and clean page content.
3. Extract structured evidence with an LLM.
4. Normalize company identities.
5. Verify each company with targeted follow-up searches.
6. Score, store, and export. [zackproser](https://zackproser.com/blog/extract-structured-data-websites)
The core extraction pattern is schema-first: define the JSON structure you want, then make the model fill it from cleaned page content rather than letting it freestyle. [developers.cloudflare](https://developers.cloudflare.com/browser-run/quick-actions/json-endpoint/)
## Folder structure
Use this exact structure:
```text
onprem-llm-miner/
├── .env
├── README.md
├── requirements.txt
├── config/
│ ├── queries.txt
│ ├── seed_urls.csv
│ ├── keyword_rules.yaml
│ ├── company_aliases.csv
│ └── vertical_map.yaml
├── data/
│ ├── raw/
│ ├── cleaned/
│ ├── exports/
│ └── logs/
├── prompts/
│ ├── extract_signals.txt
│ ├── verify_company.txt
│ └── normalize_company.txt
├── sql/
│ ├── schema.sql
│ └── views.sql
├── src/
│ ├── main.py
│ ├── settings.py
│ ├── db.py
│ ├── models.py
│ ├── search.py
│ ├── fetch.py
│ ├── extract.py
│ ├── verify.py
│ ├── normalize.py
│ ├── score.py
│ ├── enrich.py
│ ├── export.py
│ └── utils.py
└── notebooks/
└── review.ipynb
```
## Tech stack
Use:
- Python 3.11+
- SQLite first, Postgres later
- `pydantic` for schemas
- `httpx` for HTTP
- `pandas` for exports
- `tenacity` for retries
- `python-dotenv` for secrets
- `trafilatura` for text extraction fallback
- one fetch service that can return cleaned markdown or JSON from URLs, because AI-oriented scraping tools are materially better than raw HTML for this use case. [github](https://github.com/firecrawl/firecrawl)
- one strong LLM that supports structured output / JSON mode, because structured extraction works better when the schema is enforced. [community.databricks](https://community.databricks.com/t5/technical-blog/end-to-end-structured-extraction-with-llm-part-1-batch-entity/ba-p/98396)
### requirements.txt
```txt
pydantic>=2.8.0
pandas>=2.2.2
httpx>=0.27.0
python-dotenv>=1.0.1
tenacity>=8.4.2
sqlalchemy>=2.0.31
trafilatura>=1.10.0
beautifulsoup4>=4.12.3
lxml>=5.2.2
tqdm>=4.66.4
orjson>=3.10.6
pyyaml>=6.0.2
```
If you use SQLite directly, you do not need SQLAlchemy, but I’d still use it because you’ll outgrow ad hoc SQL fast.
## Environment variables
Your `.env` should look like this:
```env
SEARCH_API_KEY=your_search_key
FETCH_API_KEY=your_fetch_key
LLM_API_KEY=your_llm_key
SEARCH_PROVIDER=tavily
FETCH_PROVIDER=firecrawl
LLM_PROVIDER=openai
DATABASE_URL=sqlite:///data/onprem_llm.db
MAX_SEARCH_RESULTS=10
MAX_VERIFY_RESULTS=6
MAX_DOC_CHARS=30000
RUN_MODE=batch
```
The specific vendor can vary; the abstraction should not. The point is to separate search, fetch, and extraction so you can swap providers later. [firecrawl](https://www.firecrawl.dev/blog/ai-powered-data-retrieval)
## Step 1: Define your schema
This is the most important piece. Do not improvise it later.
### `src/models.py`
```python
from pydantic import BaseModel, Field
from typing import List, Optional
class Signal(BaseModel):
company_name: str = Field(..., description="Raw company name as found in source")
evidence_snippet: str = Field(..., description="Exact supporting snippet from source")
evidence_keywords: List[str] = Field(default_factory=list)
deployment_type: str = Field(..., description="One of: on_prem, self_hosted, private_cloud, vpc, air_gapped, unknown")
model_type: Optional[str] = Field(default=None, description="Open-source, proprietary, unknown")
source_type: str = Field(..., description="case_study, vendor_page, job_post, article, talk, rfp, forum")
confidence_rationale: str = Field(..., description="Why this looks like a real signal")
mentioned_products: List[str] = Field(default_factory=list)
class ExtractionResult(BaseModel):
signals: List[Signal] = Field(default_factory=list)
class VerificationResult(BaseModel):
company_name: str
confirmed: bool
best_deployment_type: str
evidence_snippet: str
evidence_keywords: List[str] = Field(default_factory=list)
confidence_rationale: str
```
Structured JSON extraction from messy content works best when fields, types, and descriptions are explicit. [firecrawl](https://www.firecrawl.dev/glossary/web-scraping-apis/how-web-scraping-apis-convert-html-to-json)
## Step 2: Define your keyword rules
### `config/keyword_rules.yaml`
```yaml
high_confidence:
- on-prem llm
- on premise llm
- self-hosted llm
- self hosted llm
- air-gapped ai
- air gapped ai
- private cloud llm
- runs in customer vpc
- deployed in customer data center
medium_confidence:
- self-hosted ai
- self hosted ai
- private ai deployment
- local llm
- local deployment
- sovereign ai
- customer-controlled infrastructure
reject_if_only:
- generative ai
- ai platform
- secure ai
- enterprise ai
- model deployment
```
This is your anti-bullshit layer. Without it, the model will happily inflate weak signals.
## Step 3: Seed URLs
### `config/seed_urls.csv`
Use columns:
```csv
url,source_type,priority,notes
https://bentoml.com/llm/getting-started/on-prem-llms,vendor_page,1,on-prem handbook
https://www.truefoundry.com/blog/on-prem-llms,vendor_page,1,on-prem deployment
https://www.pryon.com/landing/enterprises-generative-ai-on-premises,vendor_page,1,on-prem landing
https://venturebeat.com/ai/how-enterprises-are-using-open-source-llms-16-examples,article,1,named enterprise examples
https://zedly.ai/blog/on-premise-llm-deployment,vendor_page,2,deployment guide
```
You can and should expand this continuously. [bentoml](https://bentoml.com/llm/getting-started/on-prem-llms)
## Step 4: Search query templates
### `config/queries.txt`
```txt
"on-premise LLM deployment case study"
"self-hosted LLM enterprise"
"private cloud LLM regulated industry"
"air-gapped generative AI company"
"local LLM enterprise deployment"
"self-hosted AI model customer case study"
"customer VPC generative AI case study"
"bank self-hosted LLM"
"healthcare private cloud LLM"
"defense air-gapped AI deployment"
```
These queries align with the actual language used in deployment guides and extraction tooling docs around private/on-prem deployment patterns. [zedly](https://zedly.ai/blog/on-premise-llm-deployment)
## Step 5: Database schema
### `sql/schema.sql`
```sql
CREATE TABLE IF NOT EXISTS search_results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
query TEXT NOT NULL,
url TEXT NOT NULL,
title TEXT,
snippet TEXT,
source TEXT,
discovered_at TEXT NOT NULL,
UNIQUE(query, url)
);
CREATE TABLE IF NOT EXISTS pages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT NOT NULL UNIQUE,
title TEXT,
source_type TEXT,
raw_path TEXT,
cleaned_path TEXT,
cleaned_text TEXT,
fetched_at TEXT NOT NULL,
fetch_status TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS extracted_signals (
id INTEGER PRIMARY KEY AUTOINCREMENT,
page_url TEXT NOT NULL,
company_name_raw TEXT NOT NULL,
evidence_snippet TEXT NOT NULL,
evidence_keywords TEXT,
deployment_type TEXT,
model_type TEXT,
source_type TEXT,
confidence_rationale TEXT,
mentioned_products TEXT,
extracted_at TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS companies (
id INTEGER PRIMARY KEY AUTOINCREMENT,
company_name_normalized TEXT NOT NULL UNIQUE,
domain TEXT,
vertical TEXT,
region TEXT,
evidence_count INTEGER DEFAULT 0,
best_confidence_score INTEGER DEFAULT 0,
best_confidence_label TEXT DEFAULT 'low',
best_deployment_type TEXT DEFAULT 'unknown',
status TEXT DEFAULT 'candidate'
);
CREATE TABLE IF NOT EXISTS company_evidence (
id INTEGER PRIMARY KEY AUTOINCREMENT,
company_id INTEGER NOT NULL,
source_url TEXT NOT NULL,
source_title TEXT,
source_type TEXT,
evidence_snippet TEXT,
evidence_keywords TEXT,
deployment_type TEXT,
confidence_score INTEGER,
confidence_label TEXT,
verified INTEGER DEFAULT 0,
created_at TEXT NOT NULL,
FOREIGN KEY(company_id) REFERENCES companies(id)
);
CREATE TABLE IF NOT EXISTS verification_queue (
id INTEGER PRIMARY KEY AUTOINCREMENT,
company_name TEXT NOT NULL,
query TEXT NOT NULL,
status TEXT DEFAULT 'pending',
created_at TEXT NOT NULL
);
```
This gives you lineage from final account back to exact source evidence.
## Step 6: Search module
### `src/search.py`
Your search module should:
- read `queries.txt`
- call search API
- persist top N results
- dedupe URLs
Pseudo-implementation:
```python
import os, datetime, httpx
from src.db import insert_search_results
def run_search(query: str):
# replace with your provider
resp = httpx.get(
"https://api.example.com/search",
params={"q": query, "num_results": int(os.getenv("MAX_SEARCH_RESULTS", 10))},
headers={"Authorization": f"Bearer {os.getenv('SEARCH_API_KEY')}"}
)
resp.raise_for_status()
data = resp.json()
rows = []
for r in data["results"]:
rows.append({
"query": query,
"url": r["url"],
"title": r.get("title"),
"snippet": r.get("snippet"),
"source": "search_api",
"discovered_at": datetime.datetime.utcnow().isoformat()
})
insert_search_results(rows)
```
Keep this dumb. Search is retrieval, not reasoning. [scrapegraphai](https://scrapegraphai.com/blog/why-scraping-is-more-important-than-search)
## Step 7: Fetch module
Use a fetch layer that returns clean markdown or structured extraction from a URL when possible, because these tools are built exactly for AI-friendly downstream consumption. [github](https://github.com/firecrawl/firecrawl)
### `src/fetch.py`
```python
import os, json, datetime, pathlib, httpx, trafilatura
from src.db import upsert_page
RAW_DIR = pathlib.Path("data/raw")
CLEAN_DIR = pathlib.Path("data/cleaned")
RAW_DIR.mkdir(parents=True, exist_ok=True)
CLEAN_DIR.mkdir(parents=True, exist_ok=True)
def fetch_with_provider(url: str):
provider = os.getenv("FETCH_PROVIDER", "firecrawl")
if provider == "firecrawl":
resp = httpx.post(
"https://api.firecrawl.dev/v1/scrape",
headers={"Authorization": f"Bearer {os.getenv('FETCH_API_KEY')}"},
json={"url": url, "formats": ["markdown"]}
)
resp.raise_for_status()
j = resp.json()
title = j.get("data", {}).get("metadata", {}).get("title")
markdown = j.get("data", {}).get("markdown", "")
return title, markdown
raise NotImplementedError
def fetch_url(url: str, source_type: str = "unknown"):
try:
title, cleaned = fetch_with_provider(url)
if not cleaned:
raw_html = httpx.get(url, timeout=30).text
cleaned = trafilatura.extract(raw_html) or raw_html[:30000]
title = title or url
ts = datetime.datetime.utcnow().strftime("%Y%m%d%H%M%S")
raw_path = RAW_DIR / f"{ts}.txt"
clean_path = CLEAN_DIR / f"{ts}.md"
raw_path.write_text(cleaned, encoding="utf-8")
clean_path.write_text(cleaned, encoding="utf-8")
upsert_page({
"url": url,
"title": title,
"source_type": source_type,
"raw_path": str(raw_path),
"cleaned_path": str(clean_path),
"cleaned_text": cleaned[:int(os.getenv("MAX_DOC_CHARS", 30000))],
"fetched_at": datetime.datetime.utcnow().isoformat(),
"fetch_status": "success",
})
except Exception:
upsert_page({
"url": url,
"title": None,
"source_type": source_type,
"raw_path": None,
"cleaned_path": None,
"cleaned_text": None,
"fetched_at": datetime.datetime.utcnow().isoformat(),
"fetch_status": "failed",
})
```
The important part is not the vendor. The important part is: get cleaned content, cap document length, and persist it. [firecrawl](https://www.firecrawl.dev/blog/ai-powered-data-retrieval)
## Step 8: Extraction prompt
### `prompts/extract_signals.txt`
```txt
You are extracting evidence of enterprise use of on-premise, self-hosted, private-cloud, VPC-based, or air-gapped LLM deployments.
Return JSON only, matching the schema exactly.
Definitions:
- on_prem: model runs in the customer's own data center or physical infrastructure
- self_hosted: customer runs the model weights/runtime themselves
- private_cloud: model runs in a dedicated private cloud environment
- vpc: model runs in a customer-controlled VPC
- air_gapped: environment is isolated from public internet
- unknown: none of the above can be determined
Only extract a signal if the page contains textual evidence.
Do not infer from vague security or compliance language alone.
Use exact supporting snippets from the page.
If there are no valid signals, return {"signals": []}.
```
Schema-first structured extraction is the right pattern here, and modern extraction tools explicitly recommend defining response format or JSON schema rather than relying on vague prompts. [developers.cloudflare](https://developers.cloudflare.com/browser-run/quick-actions/json-endpoint/)
## Step 9: Extraction module
### `src/extract.py`
Pseudo-code:
```python
import os, json, datetime
from src.models import ExtractionResult
from src.db import get_unprocessed_pages, insert_extracted_signals
def call_llm_structured(prompt: str, document: str, schema: dict) -> dict:
# replace with your provider's structured output / JSON schema call
raise NotImplementedError
def extract_signals_from_pages():
pages = get_unprocessed_pages()
for page in pages:
result = call_llm_structured(
prompt=open("prompts/extract_signals.txt").read(),
document=page["cleaned_text"],
schema=ExtractionResult.model_json_schema()
)
parsed = ExtractionResult.model_validate(result)
rows = []
for s in parsed.signals:
rows.append({
"page_url": page["url"],
"company_name_raw": s.company_name,
"evidence_snippet": s.evidence_snippet,
"evidence_keywords": json.dumps(s.evidence_keywords),
"deployment_type": s.deployment_type,
"model_type": s.model_type,
"source_type": s.source_type,
"confidence_rationale": s.confidence_rationale,
"mentioned_products": json.dumps(s.mentioned_products),
"extracted_at": datetime.datetime.utcnow().isoformat(),
})
insert_extracted_signals(rows)
```
## Step 10: Normalization
You need a basic alias table.
### `config/company_aliases.csv`
```csv
alias,canonical
VMWare,VMware
JPMorgan,JPMorgan Chase
IBM Consulting,IBM
Google Cloud,Google
Microsoft Azure,Microsoft
```
### `src/normalize.py`
Rules:
- lowercase
- strip punctuation
- apply alias mapping
- if suffix-only difference (`Inc`, `Corp`, `Ltd`), collapse
- optionally enrich via a firmographic API later
Do not over-engineer this at first.
## Step 11: Verification prompt
### `prompts/verify_company.txt`
```txt
You are verifying whether the named company has evidence of using on-premise, self-hosted, private-cloud, VPC-based, or air-gapped LLM deployments.
Return JSON only.
Rules:
- Be conservative.
- Confirm only if the page has explicit textual support.
- Ignore vague enterprise AI/security language.
- Use the strongest exact snippet.
- If not confirmed, return confirmed=false.
```
This second-pass verification is what makes the dataset commercially credible.
## Step 12: Verification workflow
### `src/verify.py`
For each normalized company:
1. Run 4–6 targeted queries.
2. Fetch top results.
3. Ask the model to verify using the narrower prompt.
4. Write evidence back to `company_evidence`.
Suggested per-company query set:
```python
def company_queries(company: str):
return [
f'"{company}" "on-prem" "LLM"',
f'"{company}" "self-hosted" "AI model"',
f'"{company}" "private cloud" "generative AI"',
f'"{company}" "air-gapped" "AI"',
f'"{company}" "customer VPC" "LLM"',
f'"{company}" "local LLM"'
]
```
This is where you separate real buyers from noise.
## Step 13: Confidence scoring
### `src/score.py`
```python
HIGH_TERMS = {
"on-prem llm": 5,
"self-hosted llm": 5,
"air-gapped ai": 5,
"private cloud llm": 4,
"customer vpc": 4,
"customer data center": 4,
}
MEDIUM_TERMS = {
"self-hosted ai": 3,
"local llm": 2,
"private ai deployment": 2,
"sovereign ai": 2,
}
def score_signal(snippet: str, keywords: list[str], source_type: str, verified: bool) -> int:
s = 0
joined = " ".join([snippet.lower()] + [k.lower() for k in keywords])
for term, pts in HIGH_TERMS.items():
if term in joined:
s += pts
for term, pts in MEDIUM_TERMS.items():
if term in joined:
s += pts
if source_type in {"case_study", "vendor_page", "rfp"}:
s += 2
if verified:
s += 2
if "generative ai" in joined and s == 0:
s -= 3
return s
def label(score: int) -> str:
if score >= 7:
return "high"
if score >= 4:
return "medium"
if score >= 1:
return "low"
return "reject"
```
This prevents the model from deciding everything by vibes.
## Step 14: Main orchestration
### `src/main.py`
Run in this order:
```python
from src.search import run_search
from src.fetch import fetch_url
from src.extract import extract_signals_from_pages
from src.normalize import normalize_companies
from src.verify import verify_companies
from src.score import recompute_company_scores
from src.export import export_reviews
def main():
# 1. search generic queries
# 2. fetch URLs from search + seed
# 3. extract signals
# 4. normalize names
# 5. verify companies
# 6. score
# 7. export CSV
pass
if __name__ == "__main__":
main()
```
### Recommended run order
- Day 1:
- fetch seeds
- extract signals
- export preliminary companies
- Day 2:
- search generic queries
- fetch new pages
- extract again
- normalize
- verify
- Weekly:
- rerun generic queries
- rerun verification only for new or low-confidence companies
## Step 15: Export layer
### `src/export.py`
Export at least:
- `confirmed_companies.csv`
- `medium_confidence_companies.csv`
- `rejected_signals.csv`
- `all_evidence.csv`
Columns for `confirmed_companies.csv`:
- company_name_normalized
- domain
- vertical
- region
- best_confidence_label
- best_deployment_type
- evidence_count
- best_source_url
- best_evidence_snippet
The point is reviewability, not elegance.
## Step 16: Review notebook
### `notebooks/review.ipynb`
Build a dead-simple review notebook with filters:
- show top 50 high-confidence accounts
- show companies with conflicting deployment types
- show companies with only one weak signal
- show most common false-positive terms
This is where you improve the rules weekly.
## Step 17: Package list by file
Here is the minimal implementation order:
1. `sql/schema.sql`
2. `src/db.py`
3. `src/search.py`
4. `src/fetch.py`
5. `src/models.py`
6. `prompts/extract_signals.txt`
7. `src/extract.py`
8. `src/normalize.py`
9. `prompts/verify_company.txt`
10. `src/verify.py`
11. `src/score.py`
12. `src/export.py`
13. `src/main.py`
If you try to start with “agentic orchestration,” you are doing it wrong.
## Step 18: Exact first test
Your first test should be tiny:
- 5 seed URLs
- 20 fetched pages max
- 1 extraction prompt
- 1 CSV export
Success criteria:
- at least 10 candidate companies
- every company has an auditable snippet
- fewer than 30% obvious false positives after manual review
If that works, then scale.
## Step 19: Which LLM and why
Use a strong model that supports reliable structured outputs or JSON-mode extraction, because that’s exactly what this workload needs. The extraction pattern itself is model-agnostic, but higher-quality structured extraction is materially easier with better models and schema enforcement. [community.databricks](https://community.databricks.com/t5/technical-blog/end-to-end-structured-extraction-with-llm-part-1-batch-entity/ba-p/98396)
My recommendation:
- Use a frontier model for extraction and verification first.
- Add a cheaper model later for enrichment and classification.
- Do not start with a local OSS model unless you specifically want to spend time debugging extraction drift.
That is the boring answer, and it is the correct one.
## Step 20: What not to do
- Do not let the LLM browse freely without fixed query templates.
- Do not trust “AI company lists” with no evidence snippets.
- Do not merge companies without a canonicalization step.
- Do not store only final companies; store every source and snippet.
- Do not score confidence purely from model output.
- Do not skip manual review in the first few runs.
Those are the failure modes that wreck these systems.
## Minimal pseudocode
Here is the end-to-end control flow:
```python
def pipeline():
seed_urls = load_seed_urls()
generic_queries = load_queries()
for url in seed_urls:
fetch_url(url, source_type="seed")
for q in generic_queries:
run_search(q)
for result in get_unfetched_search_results():
fetch_url(result["url"], source_type="search")
extract_signals_from_pages()
normalize_companies()
create_verification_queue()
for company in get_companies_needing_verification():
for q in company_queries(company):
run_search(q)
for url in get_new_company_urls(company):
fetch_url(url, source_type="verification")
verify_company(company)
recompute_company_scores()
export_reviews()
```
That is enough to build version one.
Would you like the next step to be the **actual code skeleton** for each Python file, with copy-pasteable starter code rather than just the blueprint?
Free push notifications
Replace text messages from unknown numbers. Send push notifications completely free that link directly to their digital card
Store locations
For annual plan
1
geolocation
Automated push notifications
Create your own automated push notifications to engage customers
Integrated scanner app
Award points to your customers without additional hardware
Referral program
Your customers can get points and rewards for inviting friends to your promotion. Grow your customer base without advertising costs
Duplicate control
Your customers will not be able to issue themselves several loyalty cards for one promotion and get extra benefits. Your customer base contains only unique customer records
Analytics
Built-in statistics and analytics system — judge the effectiveness of your loyalty program online
Manager seats
If you have several points of sale and a number of salespeople or cashiers work in them. then this feature is for you separate accounting for each manager and point of sale. Reward the most effective managers within the company
API
Integration with your software for automatic accrual of stamps, points and awards
Free service setup and onboarding
We'll setup this service for your business and onboard your team
Boost Retention, Engagement, and Revenue

A 35% increase in repeat customer visits results in 40% more revenue

Loyalty members spend 30% more per transaction compared to non-members

Attract referrals to reduce your acquisition costs


