dbuckman's comments

dbuckman · 2026-05-26T13:48:19 1779803299

I started an open source Git platform. Can be self hosted. I would call it beta at this point if you are interested in trying it. https://velogit.com

dbuckman · 2026-05-26T14:11:54 1779804714

I guess a link to the source code would be helpful https://velogit.com/velogit/velogit

dbuckman · 2026-04-09T13:03:00 1775739780

I have updated pii-hound to include GitHub Actions (SARIF) report and have published a GitHub Action to make pii-hound easy to use in your CI/CD workflows.

dbuckman · 2026-04-08T16:16:44 1775665004

That is a good question. No, we don't do anything with names at the moment. Names are hard because they don't follow a pattern. The next version will flag columns named first_name, last_name, fullname, or customer_name. That should be published later today.

Beyond that, pii-hound supports custom rules. A user could create some rules to match known names if they wanted.

I am open to ideas of other ways to close that gap.

Finnoid · 2026-04-08T16:33:47 1775666027

I don’t know if this is viable but I wonder if you could package a small open source LLM and feed the data through it in chunks to scrub names. I’m sure it would add to the processing time and bunch other issues. But just a thought.

dbuckman · 2026-04-08T15:11:57 1775661117

Hi HN,

I’ve spent a lot of time working on data pipelines, and one of the most frustrating problems is accidentally syncing PII or developer secrets (like AWS keys or SSNs) into a data warehouse or downstream system.

Most of the enterprise tools that solve this are either massive Java applications, require complex Python environments, or cost $50k/year. I just wanted a lightning-fast, single binary I could drop into a CI/CD pipeline (--fail-on-pii) or run locally against a Postgres DB to see my exposure. So, I built pii-hound.

A few technical details on how it works under the hood:

Memory Efficiency: Scanning a 50GB CSV file shouldn't cause an OOM error. It uses a concurrent, streaming architecture and implements Reservoir Sampling so it can sample huge datasets sequentially while maintaining randomness and a tiny memory footprint.

Speed: For the keyword and column-name heuristics, I implemented Aho-Corasick string matching, which is significantly faster than running dozens of individual regexes against every header.

Accuracy: To cut down on false positives, things like Credit Card numbers don't just use regex; they are piped through a Luhn algorithm validation step.

Full transparency: I originally wrote the core of this scanning engine for a larger data management platform I’m building called Saddle Data. But I realized the scanner itself is incredibly useful as a standalone utility, so I extracted it, polished the CLI, and open-sourced it under the MIT license.

It currently supports Postgres, MySQL, Snowflake, BigQuery, SQLite, S3, GCS, and local files (CSV/JSON/Parquet).

I'd love for you to point it at a local database or a messy CSV and let me know how it performs. Happy to answer any questions about the Go implementation, and PRs for new regex rules or source connectors are very welcome!