The more detail, the better. If `<section>` elements are found you chunk those? Do you do it recursively or do you stop after a certain level? And when section elements don't exist, you use `<h1>`, `<h2>`, etc. to infer logical chunks?
Having looked at a lot of HTMLs, I noticed that sections are not really the default. I rely on headings (h1, h2, ...) to chunk each pages. Each chunk has its heading hierarchy attached to it. There are a lot of optimizations that could be done at that level.
i'm just guessing but i would think following whatever semantics leads to the highest search rank in google's algorithm would be what you're most likely to find out in the wild.
I appreciate the reply. As someone who runs multiple CMSs it’s painful to deal with the ai crawlers these days. Specially the ones that don’t respect my terms.
Thanks! The chat demo is actually just a small thing I put together as a preview of what can be done, but the main product is the API. But seeing that most users seem to like that, there's probably something there...
If you want to email me at support at embedding.io with some requirements, I can see how to make that work for you.
You can group as many websites as you want into a collection. Then query that collection.
Not sure what you mean by exporting; you would like to export the vectors themselves? Or just the chunks of text from the websites?