Has it gotten any better recently? I run a node but I haven’t actually used it a...

Avamander · on March 6, 2024

No.

Either it picks up too much garbage if you allow any P2P data exchange (can't allow only outgoing AFAIK) or it kinda only knows about the sites you know about. Which kinda defeats the purpose.

Even assuming you just want a specific index for yourself of your own content then it struggles to display useful snippets about the results, which makes it really tedious to shift through the already poor results.

If you try to proactively blacklist garbage, which is incredibly tedious because there's no quick "delete from index and blocklist" button under index explorer, then you'll soon run into an unmanageable blocklist, the admin interface doesn't handle long lists well. At some point (around 160k blocked domains) Yacy just runs out of heap during startup trying to load it which makes the instance unusable.

It also can't really handle being reverse proxied (accessed securely by both the users and peers).

It also likes to completely deplete disk space or memory, so both have to be forcefully constrained. But that ends up with a nonfunctional instance you can't really manage. It also doesn't separate functionality enough that you could manually delete a corrupt index for example.

Running (z)grep on locally stored web archives works significantly better.

bobajeff · on March 6, 2024

Those are pretty bad issues. I remember using it along time ago and only remember the results being bad. I've heard that Yacy could be good for searching sites you've already visited but it sounds like even that might not be a good use case for it.

I do understand the taking up of disk space thing. It's hard to store text of all your sites without it talking up a lot of space unless you can intelligently determine which text is unique and desired. Unless you are just crawling static pages it becomes hard to know what needs to be saved or updated.

rahen · on March 6, 2024

I remember trying it for a while in 2012, but the results were essentially worthless, probably because there were so few nodes/crawlers back then. I guess the more users there are, the better the results.

viraptor · on March 6, 2024

Alternatively, ignore the public network (it's still useless) and run it as your own crawler. Seed it with your browsing history, some aggregators like HN, your favourite RSS feeds, etc. and you'll be good.

WarOnPrivacy · on March 6, 2024

> I remember trying it for a while in 2012, but the results were essentially worthless,

I had mine crawling gov, mil, etc sties for pages that Google was starting to delist back then. Inbound requests were heavy with porn until I tweaked - IDK, something.

Brian_K_White · on March 7, 2024

"until I tweaked - IDK, something."

omg so much this.

I got an instance going in a truenas core jail, freebsd and using freebsd java not a linux vm or linux abi compatibility. had to make my own rc script.

Then had to mess with the disk & ram settings to get it to run for more than a day. But the settings are not actually explained at all and whatever they do, they definitely don't do what their names and worthless tooltips say they do.

It seems to be running now indefinitely without killing either itself or the host, in full p2p mode, but I really have no idea why it's working, or really for sur if it actually is fully. I changed "idk, something"

And I don't use it for search myself so far. Maybe some day but for now I'm paying for kagi.

I just like the idea and want it to be a thing, and it seemed a little less "invite a world of shit and attention onto my ip" than running say a tor exit or something. Maybe only a bit less but I'll see how it goes and react if I need to.