I have updated pii-hound to include GitHub Actions (SARIF) report and have published a GitHub Action to make pii-hound easy to use in your CI/CD workflows.
That is a good question. No, we don't do anything with names at the moment. Names are hard because they don't follow a pattern. The next version will flag columns named first_name, last_name, fullname, or customer_name. That should be published later today.
Beyond that, pii-hound supports custom rules. A user could create some rules to match known names if they wanted.
I am open to ideas of other ways to close that gap.
I don’t know if this is viable but I wonder if you could package a small open source LLM and feed the data through it in chunks to scrub names. I’m sure it would add to the processing time and bunch other issues. But just a thought.
I’ve spent a lot of time working on data pipelines, and one of the most frustrating problems is accidentally syncing PII or developer secrets (like AWS keys or SSNs) into a data warehouse or downstream system.
Most of the enterprise tools that solve this are either massive Java applications, require complex Python environments, or cost $50k/year. I just wanted a lightning-fast, single binary I could drop into a CI/CD pipeline (--fail-on-pii) or run locally against a Postgres DB to see my exposure. So, I built pii-hound.
A few technical details on how it works under the hood:
Memory Efficiency: Scanning a 50GB CSV file shouldn't cause an OOM error. It uses a concurrent, streaming architecture and implements Reservoir Sampling so it can sample huge datasets sequentially while maintaining randomness and a tiny memory footprint.
Speed: For the keyword and column-name heuristics, I implemented Aho-Corasick string matching, which is significantly faster than running dozens of individual regexes against every header.
Accuracy: To cut down on false positives, things like Credit Card numbers don't just use regex; they are piped through a Luhn algorithm validation step.
Full transparency: I originally wrote the core of this scanning engine for a larger data management platform I’m building called Saddle Data. But I realized the scanner itself is incredibly useful as a standalone utility, so I extracted it, polished the CLI, and open-sourced it under the MIT license.
It currently supports Postgres, MySQL, Snowflake, BigQuery, SQLite, S3, GCS, and local files (CSV/JSON/Parquet).
I'd love for you to point it at a local database or a messy CSV and let me know how it performs. Happy to answer any questions about the Go implementation, and PRs for new regex rules or source connectors are very welcome!