The reason we're conflating them is because there is a strong correlation between "highly processed food" and "designed recklessly". If you look at Carlos Monteiro (The pioneer in this domain) he operationalized it with the NOVA metric. NOVA 4 being the closest to what you're talking about:
"Industrially manufactured food products made up of several ingredients (formulations) including sugar, oils, fats and salt (generally in combination and in higher amounts than in processed foods) and food substances of no or rare culinary use (such as high-fructose corn syrup, hydrogenated oils, modified starches and protein isolates)..." [1]
This is what my Master project was about, working in the case of Wolof. I've trained XTTSv2 and had solid results with less than 20h of paired data that wasn't of the highest quality either - hmu: [email protected]
This is an interesting problem that has various challenges - currently most tokenization solutions where trainees using hype pair encoding where the most commonly seen combinations of letters were being selected to be a mapping. This meant that the majority of tokenization was English mappings meaning your LLM had a better tokenization of English compared to other languages it was being trained on.
When Carlos Monteiro decided to operationalize UPFs by giving them a definition (laymans terms: UPF is one ingredient you wouldn't find in a traditional kitchen and wrapped in plastic) Kevin Hall from the US had the same reaction as you and decided to make a multi-million dollar experiment to disprove the definition proposed by Dr. Monteiro. Result: People who ate unprocessed lost weight, and the other group gained weight. (Groups were exchanged after 2 weeks and saw similar effects).
They could be solving it with multimodal mixup, a technique making sure that there's no big latent gap between the two : https://arxiv.org/abs/2203.03897
reply