More

jshorty · 2025-06-30T21:41:59 1751319719

I have felt somewhat frustrated with what I perceive as a broad tendency to malign "prompt engineering" as an antiquated approach for whatever new the industry technique is with regards to building a request body for a model API. Whether that's RAG years ago, nuance in a model request's schema beyond simple text (tool calls, structured outputs, etc), or concepts of agentic knowledge and memory more recently.

While models were less powerful a couple of years ago, there was nothing stopping you at that time from taking a highly dynamic approach to what you asked of them as a "prompt engineer"; you were just more vulnerable to indeterminism in the contract with the models at each step.

Context windows have grown larger; you can fit more in now, push out the need for fine-tuning, and get more ambitious with what you dump in to help guide the LLM. But I'm not immediately sure what skill requirements fundamentally change here. You just have more resources at your disposal, and can care less about counting tokens.

simonw · 2025-06-30T21:48:47 1751320127

I liked what Andrej Karpathy had to say about this:

https://twitter.com/karpathy/status/1937902205765607626

> [..] in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step. Science because doing this right involves task descriptions and explanations, few shot examples, RAG, related (possibly multimodal) data, tools, state and history, compacting... Too little or of the wrong form and the LLM doesn't have the right context for optimal performance. Too much or too irrelevant and the LLM costs might go up and performance might come down. Doing this well is highly non-trivial. And art because of the guiding intuition around LLM psychology of people spirits.

bgwalter · 2025-07-01T00:07:10 1751328430

All that work just for stripping a license. If one uses code directly from GitHub, copy and paste is sufficient. One can even keep the license.

jshorty · on April 28, 2024

Are you using them regularly in physical labor though?

peebeebee · on April 28, 2024

I must say: things from H&M hm have regularly been holding up longer than other way more expensive brands. Sometimes they don’t really fit right anymore, but no holes or stitches that come loose.

jshorty · on April 17, 2024

I think the point is more around the collective distraction of dressing to a formal (and for kids, unfamiliar) standard, rather than enjoying time with the people around you at a party.

jshorty · on April 4, 2024

I’m skeptical that established automakers even felt threatened. For one thing, manufacturers seemed to overestimate present demand for high-end, expensive EVs (no way Apple was going to build a lower-tier “economy” car). Also, companies just entering the automotive industry (Tesla, Rivian etc.) seem to face loads of challenges that long-lived ICE manufacturers have solved a long time ago.

jshorty · on Aug 25, 2023

Could someone explain exactly what it means to be "completely sequence" the human genome when all humans have distinct genetic makeup (ie, different sequences of nucleobases in their DNA/RNA)?

dekhn · on Aug 25, 2023

The public Human Genome Project used a group of people but most of the sequence library was derived from a single individual in Buffalo, NY. The celera project also used a group of people but it was mostly Venter's genome

https://www.nytimes.com/2002/04/27/us/scientist-reveals-secr...

I believe more recent sequencing projects have used a wider pool of individuals. I think some projects pool all the individuals and sequence them together, while others sequence each individual separately. This isn't really so much of a problem since the large-scale structure is highly similar across all humans and we have developed sophisticated approaches to model the variations in individuals, see https://www.biomedcentral.com/collections/graphgenomes for an explanation of the "graph structure" used to reprsent alternatives in the reference, which can include individual single nucleobase differences, as well as more complex ones such as large deletions in one individual, to rearrangements and even inversions.

diekhans · on Aug 25, 2023

We really should say "a human genome". Reference genomes serve as a Rosetta Stone of genomics. So we can take DNA/RNA sequences from other individuals and align (pattern match) them to the reference as a way of understanding and comparing individuals.

It is not perfect, as a references can be missing or have large variability in DNA regions. The goal of the Human Pangenome Reference Consortium (HPRC) https://humanpangenome.org/ is to sequence individuals from different populations to address this issue. We are also working to develop new computation models to support analysis of data across populations.

pdonis · on Aug 25, 2023

They mean they have obtained the complete sequence for a particular Y chromosome that is considered to be a "reference" chromosome. This is similar to what was done for all the other chromosomes.

chihuahua · on Aug 25, 2023

I've never understood this either. I assume the genome is many megabytes of [ATCG]+. If we have that sequence, what does it tell us? Do we look at it and say "Ah, yes, ...ATGCTACGACTACGACTAGCG... very interesting?"

jiggawatts · on Aug 25, 2023

Many genes are highly conserved or consistent enough. E.g.: if there's a 1% difference between two people, then it's a bit like two very unique sentences that have a couple of small typos. They're sill recognisable, and it's also still pretty obvious that they're the "same".

A gene sequence allows researchers to determine the amino acids that are coded for, and from those, which proteins match which genes.

This can be matched up with genetic diseases. If you know that damage to a certain location in a chromosome causes a problem with a certain biological process, then ergo, the associated protein is needed for that process!

So: genetic illness -> gene sequence -> protein -> role in the body

Without sequencing, that chain can't be built.

dclowd9901 · on Aug 25, 2023

But you can only know that by having a large sample of very “stable” (have few genetic irregularities) gene samples compared to a large pool of samples from people with very narrow and pronounced gene irregularities, right?

Is this why it’s so hard? This feels more like a healthcare records keeping people and less like an “actually reading the data problem”.

I can’t help but feel like some form of single payer healthcare is truly the way out of this problem. One where all disease record keeping is uniform and complete.

Earw0rm · on Aug 25, 2023

Single payer healthcare here (UK) is still subject to privacy controls in a way which would make it very difficult to do that.

(Also our health system's IT is a hellscape, but one reason for that is that people would literally rather not have a working system at all, than one with less than impeccable privacy controls.

Personally I'd gladly sacrifice a fair bit of medical privacy in return for giving scientists greater insight into disease processes, but the average citizen here wants advanced healthcare without giving their data to research scientists. /facepalm )

spookie · on Aug 25, 2023

I trust the scientists, the problem isn't them. Look at the whole abortion data scandal in the U.S.

Earw0rm · on Aug 25, 2023

The problem there is the US' insane theocrat-conservatives (or just misogynist assholes hiding behind a thin veneer of religious justification, as the case may be).

I'm not saying a health IT system should have no privacy controls either. But the requirements for such controls need to be balanced against having a system that actually works, and that means having some people who actually understand the tech, and the workings of hospitals, having a role in requirements conversations. Instead it was dominated by MPs, "patient advocacy" groups and privacy campaigners, none of whom know or care anything about how to build a workable system.

hwillis · on Aug 25, 2023

> But you can only know that by having a large sample of very “stable” (have few genetic irregularities) gene samples compared to a large pool of samples from people with very narrow and pronounced gene irregularities, right?

No- A "gene" isn't an A/G/C/T- it's a sequence of 1000-1000000 base pairs. Each gene has a well-defined start/stop sequence called a start/stop codon. When people have genetic differences, one (an SNP- single nucleotide polymorph) of the tens of thousands of base pairs in that gene is different. Even for genes that are entirely "missing" in some people, they're really just different in a way that makes them nonfunctional.

Does that make it obvious how sequencing all those genes is useful, even if everyone has different genes? It tells us 99.999% of how proteins are coded, even if individual variation is the other .001%.

bglazer · on Aug 25, 2023

It’s actually about 3 gigabases (ATCG). There are some recurrent features of the genome whose function we’ve worked out. For example the TATA box is a classic sequence that typically indicates the start of a part of the genome that codes for a protein. The vast majority of the genome doesn’t code for proteins. The function of these genome regions are much more murky. Some of these regions function like scaffolds for proteins to assemble into complexes. These protein complexes then start transcribing the genome into into mRNA. So the genome regulates its own expression, in a sense. Many of the sequences that function in this way are known. There are also just a bunch of parts of the genome that probably don’t do anything. There are also many regions of the genome that are basically self replicating sequences. They code for proteins that are capable of inserting their own genetic sequence back into the genome. These are transposons.

In short, a lot of very painstaking genetics and molecular biology work has gone into characterizing the function of certain sequences.

dopylitty · on Aug 25, 2023

Also interesting are HERVs - human endogenous retroviruses which integrated into the human or our ancestor species’ genomes. They have degraded over time so none of the human hervs seem to be capable of activating but there are some in other mammals that can fully reactivate.

In humans even though hervs don’t reactivate into infectious viruses they have been implicated in both harmful (senescence during aging[0]) and beneficial (protection from modern retroviruses)[1] activities in the body.

They might be up to 8% of the human genome.

0: https://www.cell.com/cell/pdf/S0092-8674(22)01530-6.pdf

1:https://www.microbe.tv/twiv/twiv-956/

throw3823423 · on Aug 25, 2023

For the same reason Monsanto sequences basically anything: Because we can tell what proteins are encoded in there, and what is near them, and we can have good ideas of what proteins are expressed together. When dealing with genetic modification, we get to see whether our modification went in, and where it landed: Having a protein in a genome isn't enough. Its expression might be having an effect on other things, depending on where it is.

When we have baselines, we can compare different individuals, and eventually make predictions of how they are going to be based solely on the genetic code. If I know that a certain polymorphism is tied to some trait I want, I might not have to even bother spending the time growing a plant: I know that it's not what I want, and discard it as a seed.

With humans we are probably not going to see much modification soon, but just being able to detect genetic diseases, risk factors for other diseases that have genetic omponents, or allow for selection of embryos in cases of artificial insemination is already quite valuable.

It's not source code that we are all that good at understanding just yet, but there's already some applications, and we have good reason to think there's a lot more to come

khazhoux · on Aug 25, 2023

It's just about 3 gigabytes (each byte a letter). Pretty mind-blowing, if you ask me.

cvoss · on Aug 25, 2023

It's a slight exaggeration of the information content to report the data size using an ASCII encoding. Since there are 4 bases, each can be encoded using 2 bits, rather than 8. So we're really talking 750 megabytes. But still mind-blowing.

dekhn · on Aug 25, 2023

And since the data is highly redundant the 750MB can be compressed down even further using standard approaches (DEFLATE works well, it uses both huffman coding and dictionary backreferences).

Or, you could build an embedding with far fewer parameters that could explain the vast majority of phenotypic differences. the genome is a hierarchical palimpsest of low entropy.

My standard interview question- because I hate leetcode- walks the interviewee through compressing DNA using bit encoding, then using that to implement a rolling hash to do fast frequency counting. Some folks get stuck at "how many bits in a byte", others at "if you have 4 symbols, how many bits are required to encode a symbol?", and other candidates jump straight to bloom filters and other probabilistic approaches (https://github.com/bcgsc/ntHash and https://github.com/dib-lab/khmer are good places to start if you are interested).

borissk · on Aug 25, 2023

I'm curious if these 750MB + the DNA of mitochondria + the protein metagenomics contain all the information needed to build a human, or if there's extra info stored in the machinery of the first cell.

That is if we transfer the DNA to an advanced alien civilization - would they be able to make a human.

dekhn · on Aug 25, 2023

This is a complex question. The cocktail soup in a gamete (sperm or egg) and the resulting zygote contains an awful lot of stuff that would be extremely hard to replace. I could imagine that if the receiving civilization was sufficiently advanced and had a model of what those cells contained (beyond the genomic information) they could build some sort of artificial cell that could bootstrap the genome to the point of being able to start the development process. it would be quite an accomplishment.

If they just received the DNA without some information about the zygote, I don't think it would be practical for even advanced alien civilization (LR5 or LR6) but probably an LR7 and definitely an LR8 could.

Angostura · on Aug 25, 2023

I’m just pondering this, and it’s not clear to me that there is anything intrinsic in the genome itself that explicitly’says’ “this sequence of DNA bases encodes a protein” or even “these three base-pairs equate to this amino acid”.

I wonder if that information could ever really be untangled by a civilisation starting entirely from scratch without access to a cell

hwillis · on Aug 25, 2023

If you knew what DNA was and had seen a protein you could easily figure out start/stop codons. If you had only seen something similar it would be harder. If you had nothing similar, I don't know.

Coding DNA and non-coding DNA looks very different. Proteins are full of short repetitive sequences that form structural elements like alpha helixes: https://en.wikipedia.org/wiki/Alpha_helix

Once you've identified roughly where the protein-coding genes are it would be trivial to identify 3'/5' as being common to all those regions. You could pretty easily imagine a much more complicated system with different transcription mechanisms and codon categories, but earth genomes are super simple in that respect. Once you have those you just have the (incredibly complex) problem of creating a polymerase and bam, you'll be able to print every single gene in the body.

Without the right balance of promoters/factors/polymerase you probably won't get anything close to a human cell, but you'd be able to at least work closer to what the natural balance should be, and once you get closer to building a correct ribosome etc the cell would start to self-correct.

brookst · on Aug 25, 2023

It’s an interesting question. Naively, I would expect it to be about like reverse engineering a CPU from a binary program. Which sounds daunting but maybe not impossible if you understand the fundamentals of registers, memory, opcodes, etc.

But… doing so from first principles without a mental model of how all (human) CPUs work? I guess it comes down to whether the recipients had enough context to know what they’re looking at.

dekhn · on Aug 25, 2023

Yes, it's intrinsic in the genome but implemented through such a complicated mechanism that attempting to understand these things from first principles is impractical, not impossible.

In genomic science we nearly always use more cheaply available information rather than attempt to solve the hard problem directly. For example, for decades, a lot of sequencing only focused on the transcribed parts of the genome (which typically encode for protein), letting biology do the work for determining which parts are protein.

If you look at the process biophysically, you will see there are actual proteins that bind to the regions just before a protein, because the DNA sequences there match some pattern the protein recognizes. If you move that signal in front of a non-coding region, the apparatus will happily transcribe and even attempt to translate the non-coding region, making a garbage protein.

namanyayg · on Aug 25, 2023

What do you mean by "LR"? I queried an LLM but no results there either.

fao_ · on Aug 25, 2023

It's likely just a typo. LR5 "civilisation"/"civilization" brings up nothing on google. I don't know why you would an LLM to know more.

Based on the way the person is using it, it does not seem to equate to the Kardashev scale, as my peer stated

dekhn · on Aug 25, 2023

Since the cat is out of the bag, no, it's not a typo. it's related to Kardashev but is oriented around the common path most galactic civilizations follow on the path to either senescence (LR8.0) or singularity (LR8.1-4). Each level in LR is effectively unaware of the levels above it, basically because the level above is an Outside Context Problem.

Humans are currently LR2 (food security) and approaching LR3 (artificial general intelligence, self genetic modification). LR4 is generally associated with multiplanetary homing (IE, could survive a fatal meteor strike on the home planet) and LR5 with multisolar homing (IE, could survive a fatal solar incident). LR6 usually has total mastery of physical matter, LR7 can read remote multiverses, and LR8.2 can write remote multiverses. To the best of LR8's knowledge, there is no LR9, so far as their detectors can tell, but it would be hard to say, as LR9 implies existence in multiple multiverses simultaneously. Further, faster than light travel and time travel both remain impossible, so far as LR8 can tell.

“An Outside Context Problem was the sort of thing most civilisations encountered just once, and which they tended to encounter rather in the same way a sentence encountered a full stop.” ― Iain M. Banks, Excession

“Unbelievable. I’m in a fucking Outside Context situation, the ship thought, and suddenly felt as stupid and dumb-struck as any muddy savage confronted with explosives or electricity.” ― Iain M. Banks, Excession

“It was like living half your life in a tiny, scruffy, warm grey box, and being moderately happy in there because you knew no better...and then discovering a little hole in the corner of the box, a tiny opening which you could get your finger into, and tease and pull apart at, so that eventually you created a tear, which led to a greater tear, which led to the box falling apart around you... so that you stepped out of the tiny box's confines into startlingly cool, clear fresh air and found yourself on top of a mountain, surrounded by deep valleys, sighing forests, soaring peaks, glittering lakes, sparkling snow fields and a stunning, breathtakingly blue sky. And that, of course, wasn't even the start of the real story, that was more like the breath that is drawn in before the first syllable of the first word of the first paragraph of the first chapter of the first book of the first volume of the story.” ― Iain M. Banks, Excession

kelnos · on Aug 27, 2023

If we're at LR2, and each level is effectively unaware of the levels above it, how do we know what LR3/4/5/6/7/8/9 are or might be?

Or do you mean that a civilization at a particular level will always be unaware of civilizations above? That doesn't seem to make sense either; I see no reason why a LR4 civ couldn't have knowledge of a LR5 civ, for example.

fao_ · on Sept 4, 2023

Because it's an authorial construct used as a plot device, and thus has only small value of mapping onto the real world.

fao_ · on Sept 4, 2023

I'd imagine it might be beneficial to you to read more than one book

dekhn · on Aug 25, 2023

oops i've said too much

namanyayg · on Aug 25, 2023

Or you've said too little?

borissk · on Aug 25, 2023

The code how to build a sperm and an egg is inside the human DNA, isn't it?

dekhn · on Aug 25, 2023

Yes, but it currently requires developmentally mature individuals to build the gametes, and the "code" is so complex you couldn't really decipher it from first principles.

tobinfricke · on Aug 25, 2023

The code to build mitochondria is not.

klyrs · on Aug 25, 2023

Given code written for unknown hardware... can you execute it?

borissk · on Aug 25, 2023

Given that the code contains the instructions how to make the hardware - if one is very smart than yes.

bbrx · on Aug 25, 2023

It would not necessary be possible, because it's incremental instructions on how to make the hardware, but based on already existing, unspecified and very complex, hardware. So the first instruction would be something like "take the stuff you have on your left and fuse it with the stuff you have on your right", both stuff being unspecified very complex protein assumed to be present.

mjan22640 · on Aug 25, 2023

Imagine a machine shop that has blueprints of components of the machines they use in the shop, and processes to assemble machines from the components. When a machine shop grows large and splits in two, each inherits a half of shop with the ongoing processes and a copy of the blueprints. https://m.youtube.com/watch?v=B7PMf7bBczQ&pp=QAFIAQ%3D%3D

DNA is the blueprints. There are infinite possibilities what to do with them. The advanced civilization would need additional information, like that they are supposed create a cell from the components to begin with, and a lot of detailed information how exactly to do that.

Edit: improved clarity

pmoriarty · on Aug 25, 2023

"if we transfer the DNA to an advanced alien civilization - would they be able to make a human."

I'm really surprised that in all these responses to your question no one's mentioned the womb or the mother, who (at least with current technology) is still necessary for making a human.

That's not to mention the necessity of the egg.

We're not just DNA.

dchftcs · on Aug 25, 2023

This is a question about theoretical possibilities and what you're saying seems to be a rigid belief in an answer "no". But you provided no evidence or justification, except for "with current technology", which answers nothing about the theoretical question.

dekhn · on Aug 25, 2023

Artificial wombs have come quite a long way! It is not inconcievable to imagine that you could bring a zygote to term in an artificial womb.

borissk · on Aug 25, 2023

Instructions on how to make a womb and an "egg" are contained within the human DNA.

crdrost · on Aug 25, 2023

It is known that that is not true, due to the distinct genetic code of mitochondria and known epigenetic influences of mothers on their children in utero.

You could say “well that's the last 10% of the details, maybe 90% is in the DNA,” but I think I would be suspicious that it's that high, because one of the things we know about humans is that we are born with all of the ova that we will ever have, rather than deferring the process until puberty. I should think that if it could be deferred it would have been, “you will spend the energy to make these 15 years before you need to for no real reason” seems very unlike evolution whereas “my body is going to teach you how to make these eggs, just the same as my mother's body taught me,” sounds quite evolutionarily reasonable.

wizofaus · on Aug 25, 2023

But maybe you needed a pre-human womb to bootstrap the first human, and we no longer have the blueprint for that...

bonsai_spool · on Aug 25, 2023

> That is if we transfer the DNA to an advanced alien civilization - would they be able to make a human.

You'd need a cell to start the process, with the various nucleic acids distributed correctly and proteins/energy with which to create further proteins using the information encoded by the DNA. Thus the civilization would need information about cells and a set of building blocks before being able to use the DNA.

borissk · on Aug 25, 2023

The DNA contains all the code that creates and regulates the proteins.

penteract · on Aug 25, 2023

Including code for the proteins that read DNA to produce proteins. You might hit similar problems trying to understand C given the source code for a C compiler - a non-standard environment could reproduce itself given the source code, meaning the code alone doesn't strictly determine the output.

kibibyte · on Aug 25, 2023

I'll torture this DNA and C source code analogy a bit.

Epigenetics is missing in this discussion about reproducing a human from just the DNA. These are superficial modifications (e.g. methylation, histone modification, repressor factors) to a strand of DNA that can drastically alter how specific regions get expressed into proteins. These mechanisms essentially work by either hiding or unhiding DNA from RNA polymerases and other parts of the transcription complex. These mechanisms can change throughout your lifetime because of environmental factors and can be inherited.

So it's like reading C source code, except there so many of these inscrutable C preprocessor directives strewn all throughout. You won't get a successful compilation by turning on or off all the directives. Instead, you need to get this similarly inscrutable configuration blob that tells you how to set each directive.

I guess in a way, it's like the weights for an ML model. It just works, you can't explain why it works, and changing this weight here produces a program that crashes prematurely, and changing a weight there produces a program with allergic reactions to everything.

pfdietz · on Aug 25, 2023

There's also postprocessing: varying modification of the RNA, RNA interference, glycosylation and other post transcriptional modifications.

khazhoux · on Aug 25, 2023

And how will you decode it?

senkora · on Aug 25, 2023

I can’t wait until we can bootstrap a human from a stage 3 tarball.

mfld · on Aug 25, 2023

Yes, there is extra information in the first cells, in particular regulatory elements such as miRNAs. The headline here is epigenetics.

cvoss · on Aug 25, 2023

There's also some interesting work on understanding the roll of loops in the physical structure of the DNA storage on gene expression. [0] The base sequence of the DNA isn't everything; it may also matter how the DNA gets laid out in space---a feature which can be inherited.

[0] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2638769/

singularity2001 · on Aug 25, 2023

Our DNA does not contain the mitochondria nor the gut bacteria so the raw data would most certainly not be enough to build a working copy

kovacs_x · on Aug 25, 2023

it's bit like- if i have source code of Linux (think DNA), can I build a machine running Linux? (think cell). no- you cant, you need to have machine that can run the code.

ie. "software" without "machine" to run it on, is kind of a useless.

khazhoux · on Aug 25, 2023

Yes, and if you gzip it it's even smaller. But the big takeaway is that the amount of info that fully defines a human, is what we consider "not much data," even in its plainest encoding.

astrange · on Aug 25, 2023

We don't know that it fully defines a human until we can create one without the starting condition of being inside another human. It's prototype-based inheritance.

hotnfresh · on Aug 25, 2023

Some of the research about being able to make simple animals grow structures from other animals in their evolutionary “tree” by changing chemical signaling—among other wild things like finding that memories may be stored outside the brain, at least in some animals—makes me think you need more than just the “code” to get the animal that would have been produced if that “code” were in its full context (of a reproductive cell doing all sorts of other stuff). Even if the dna contains the instructions for that reproductive cell, too, in some sense… which instructions do you “run”? There might be multiple possible variants, some of which don’t actually reproduce the animal you took the dna from.

astrange · on Aug 25, 2023

My favorite trivia here is that flamingos aren't actually "genetically" pink but "environmentally" pink because they pick up the color from eating algae.

Except of course "genetics" and "environment" aren't actually separate things; sure, people's skin color isn't usually affected by their food, but only because most people don't eat colloidal silver.

https://en.wikipedia.org/wiki/Paul_Karason

hotnfresh · on Aug 25, 2023

AFAIK most poisonous frogs also aren’t “naturally” poisonous—they get it from diet. Ones raised in captivity aren’t poisonous unless you go out of your way to feed them the things they need to become poisonous.

dekhn · on Aug 25, 2023

bzip2 is marginally better, and then genome-specific compressors were developed, and then finally, people started storing individual genomes as diffs from a single reference, https://en.wikipedia.org/wiki/CRAM_(file_format)

Since genome files contain more data than just ATGC (typically a comment line, then a DNA line, then a quality score line), and each of those draws from a different distribution, DEFLATE on a FASTA file doesn't reach the full potential of the compressor because the huffman table ends up having to hold all three distributions, and the dictionary backlookups aren't as efficient either. It turns out you can split the file into multiple streams, one per line type, and then compress those independently, with slightly better compression ratios, but it's still not great.

dwattttt · on Aug 25, 2023

You could say exactly the same of all data; it's just 1s and 0s, but when I look I just see blonde, brunette.

dekhn · on Aug 25, 2023

If you think like an ML engineer, the genome is a feature vector 3B bases (or 6B binary bits) long that is highly redundant (many sections contain repeats and other regions that are correlated to other regions), and the mapping between that feature vector and an individual's specific properties (their "phenotype", which could be their height at full maturity, or their eye color, or hair properties, or propensity to diseases, etc) is highly nonlinear.

If you had a list of all the genomes of all the people in the world, and all their phenotypes (height, eye color, hair type, etc), you could take all their genomes as input variables and treat all their phenotypes as output variables, and make embeddings or other models that mapped from genomes to phenotypes. The result would be a predictive model that could take a human genome, and spit out a prediction of what that person looks like and other details around them (up to the limits of heritability).

A good example is height. If you take a very large diverse sample of people, and sequence them, you will find that about 50% of the variance in height can be traced to the genomic sequence of that individual (other things, such as socioeconomic status, access to health care, pollution, etc, which are non-genomic, contribute as well). originally many geneticists believed that a small number of genes- tiny parts of the feature vector- would be the important features in the genome that explained height.

But it didn't turn out that way. Instead, height is a nonlinear function of thousands of different locations (either individual bases, entire genes, or other structures that vary between individuals) in the genome. This was less surprising to folks who are molecular biologists (mainly based on the mental models geneticists and MBers use to think about the mapping of genotype to phenotype), and we still don't have great mechanistic explanations of how each individual difference works in concert with all the others to lead to specific heights.

When I started out studying this some 35 years ago the problem sounded fairly simple, I assumed it would be easy to find the place in my genome that led to my funny shaped (inherited) nose, but the more I learn about genomics and phenotypes, the more I appreciate that the problem is unbelievably complex, and really well suited to large datasets and machine learning. All the pharma have petabytes of genome sequences in the cloud that they try hard to analyze but the results are mixed.

I spent my entire thesis working on ATGCAAAT, by the way. https://en.wikipedia.org/wiki/Octamer_transcription_factor is a family of proteins that are incredibly important during growth and development. Your genome is sprinkled with locations that contain that sequence- or ones like it- that are used to regulate the expression of proteins to carry out the development plan.

whilenot-dev · on Aug 25, 2023

> If you had a list of all the genomes of all the people in the world, and all their phenotypes (height, eye color, hair type, etc), you could take all their genomes as input variables and treat all their phenotypes as output variables, and make embeddings or other models that mapped from genomes to phenotypes. The result would be a predictive model that could take a human genome, and spit out a prediction of what that person looks like and other details around them (up to the limits of heritability).

Would such a predictive model really be possible? As far as I'm aware there is contradicting research whether a specific phenotype distinctly originates from a SNP/genotype.

dekhn · on Aug 25, 2023

I can't technically say with 100% confidence that it would be possible. It does seem extremely likely based on all the evidence I've seen over the past 30 years.

The model would be highly nonlinear and nonlocal, at the very least.

alexchantavy · on Aug 25, 2023

Fascinating, are there lots of people looking at genetics with this ML kind of lens?

dekhn · on Aug 25, 2023

Sure, although I'm not aware of anybody who is contemplating quite the level I believe is necessary to really nail the problem into the ground. When I worked at Google, I proposed that Google build a datacenter-sized sequencing center in Iowa or Nebraska near its data centers, buy thousands of sequencers, and run industrial-scale sequencing, push the data straight to the cloud over fat fiber, followed by machine learning, for health research. I don't think Google wants to get involved in the physical sequencing part but they did listen to my ideas and they have several teams working on applying ML to genomics as well as other health research problems, and my part of my job today (working at a biotech) is to manage the flows of petabytes of genomic data into the cloud and make it accessible to our machine learning engineers.

The really interesting approaches these days, IMHO, combine genomics and microscopic imaging of organoids, and many folks are trying to set up a "lab in the loop", in which large-scale experiments run autonomously by sophisticated ML systems could accelerate discovery. It's a fractally complex and challenging problem.

Statistics has been key to understanding genetics from the beginning (see Mendel, Fisher) and so at a big pharma you will see everything from Bayesian bootstrappers using R to deep learners using pytorch.

2dvisio · on Aug 25, 2023

Guys at Verily are working on Terra.bio with the Broad institute and others. Genomics England in the UK is also experiencing with multimodal and machine learning applied to whole genome sequences [1].

[1] https://www.genomicsengland.co.uk/blog/data-representations-...

krab · on Aug 25, 2023

But why Google? This is what big pharma are doing. Also you can outsource the data collection part. See for example UK Biobank. Their data are available to multiple companies after some period so it makes it more cost efficient.

dekhn · on Aug 25, 2023

Why Google? Because this is a big data problem and Google mastered big data and ML on big data a long time ago. Most big pharma hasn't completely internalized the mindset required to do truly large-scale data analysis.

panosfilianos · on Aug 25, 2023

I have spent the better part of the past year looking obsessively over genomics papers for cancer and I've grown very fond of the field.

Are there any positions at Google/ companies you wold suggest me to look into? I'm coming from algortrading/ ML research with ML MSc.

asielen · on Aug 25, 2023

You could try Calico. They are an Alphabet company that specifically studies aging. There how a good amount of machine learning roles. However biotech typically pays less than finance or software.

https://calicolabs.com/careers/

panosfilianos · on Aug 25, 2023

Thanks!

krab · on Aug 25, 2023

Yes. For example when word2vec came out, immediately there were people trying similar approaches to protein sequences. Transformers work better.

globular-toast · on Aug 25, 2023

The genetic code maps nucleotide sequences (DNA) to amino acid sequences (proteins). Every three bases (say AGT) maps to one amino acid. So you can literally read a sequence of ACGTs and decode it into a protein. A sequence that encodes a protein is called a gene.

Almost all variations that humans have in their genomes (compared to each other or a reference genome) are tiny, mostly one base differences called single nucleotide polymorphisms (SNPs). These tiny changes encode who you are. The rest of it just makes you carbon-based organism, a eukaryote, an animal, a mammal etc, just like a whole load of other organisms.

zbaxrl · on Aug 25, 2023

Kinda, https://youtu.be/1ThNnuSuZC0?t=4925

VeninVidiaVicii · on Aug 25, 2023

I always make this mistake too as a computational biologist; when talking about DNA it’s megabases not megabytes.

robwwilliams · on Aug 25, 2023

What you are getting at is now called a “pangenome assembly”. Several high profile papers earlier this year, one by Guarracino and Garrison in Nature.

A pangenome is a complex graph model that weaves together hundreds or more genomes/haplotypes—usually of one species, but the idea can extend across species too, or even cells within one individual (think cancer pangenomes).

On the idealized human pangenome graph each human is represented by two threads along each autosome, plus threads through Chr X, Y, and the mitochondrial genome.

lopis · on Aug 25, 2023

While you are correct, the differences between different people's DNA is tiny, less 1% at best. So this information is still very valuable. This article is talking about the first time in finishing sequencing one person's Y chromosome's DNA.

hk__2 · on Aug 25, 2023

> While you are correct, the differences between different people's DNA is tiny, less 1% at best

How do we know this, if we have only sequenced the chromosome of one individual?

lopis · on Aug 25, 2023

We have sequenced the genome using different sampling and statistical models for a long time.

jhbadger · on Aug 25, 2023

Traditional sequences of the Y chromosome (and other chromosomes) were missing parts, particularly the highly repetitive regions called "telomeres". This is different from the issue of individual variation (although the authors do provide a map of known variations as well).

quietbritishjim · on Aug 26, 2023

The title of the original article and as submitted here seems quite clear that it's of a specific individual: "The complete sequence of *a* human Y chromosome"

dclowd9901 · on Aug 25, 2023

Thank you for putting into words exactly the thing I wanted to understand but couldn’t figure out how to ask.

photochemsyn · on Aug 25, 2023

Good question. Practically they call their complete sequence a 'reference sequence' which can be thought of as a baseline for comparison to the complete spectrum of human Y chromosome genetic variation, so at least people have something to use as a standard for comparison. The line in the abstract "mapped available population variation, clinical variants" is about the only mention of the issue.

Ideally we'd have hundreds if not thousands of complete genomes which in total would reveal the population diversity of the human species as it currently exists, but this is a big ask. "Clincal variants" are of particular interest as those are regions of the genome associated with certain inherited diseases, although the promises of individual genomic knowledge leading to a medical revolution have turned out to be wildly overblown.

Since the paper is paywalled, there's not much else to say than that they have a (fairly arbitrary in origin, i.e. it could have been from any one individual or possibly even a chimera of several individuals) reference sequence to which other specific human Y chromosomes can be compared, eventually leading to a larger dataset from many individuals which will reveal the highly conserved and highly variable regions of the chromosome, population-wise.

wheelerof4te · on Aug 25, 2023

It means that they are trying to find a baseline from which they can eventually clone a human being.

Let's not pretend that this is not an end goal. It always was.

gorjusborg · on Aug 25, 2023

I had the same question. Perhaps this will help you.

https://en.m.wikipedia.org/wiki/DNA_sequencing

mikepurvis · on Aug 25, 2023

That page describes the human genome as having been sequenced back in 2003. *confusion intensifies*

dekhn · on Aug 25, 2023

the project started around 1990, they announced a draft completed in 2000, "completion" in 2003 (this was more a token announcement based on a threshold than a true milestone). Even then the scientists knew that major parts of the centromere, telomere, and highly repetitive regions were not fully resolved, and that was fully admitted. The work by Karen Miga at UCSC and others is more of a mop-up now that genome sequencing is a mature technology and we have much better ways at getting at those tricky regions.

another "completion" happened 3 years ago, before this announcement. but this is the last one. I promise.

jshorty · on July 26, 2023

This sort of approach is only really viable in certain climates and with certain starting conditions, ideally away from extremes in temperature and drought where seedlings can compete with what's already established. I live in the (climate change pending) sun-baked former prairie of East Denver and a laissez faire approach to the vegetation in my yard is a virtual guarantee that invasives (kochia, bindweed, curly dock) will completely dominate.

I've found you don't have to be totally organized about it uprooting the worst of the noxious weeds, scattershot planting of desirables, and watering unpredictably around a busy life schedule, but this still takes a lot of time and energy. At that point, if you value your time and energy, you find yourself needing to make plans and genuinely care about the process.

jshorty · on June 23, 2023

Not sure I agree, on one hand you have actual, physical potential energy, on the other hand you have numbers on a computer that could become worthless depending on unpredictable economic factors.

jshorty · on May 30, 2023

The punctuation of the sentence is a little odd, but I take it to mean $1732 + $1200 = $2932.

Less $1100 of current rent, that’s $1832 per month to cover all other costs for two people. Few would elect to be traveling 6 days a week obtaining food assistance if there were easy expenditures to cut.

jshorty · on Nov 6, 2022

I don't understand this argument. If 8k employees were laying technical groundwork that can be transformed into product features more efficiently than greenfield work, how were they "doing nothing"? Twitter is a mature product used by millions, not every engineer is going to be shipping user-facing changes regularly.

idlewords · on Nov 6, 2022

This was my snarky way of saying that the company was doing nothing to improve its core product while massively overstaffed.

jshorty · on Oct 10, 2022

Especially sad given that, in my opinion, this project is doomed to fail anyway— a 21st century Tower of Babel. The plans for a year-round ski resort seem especially ridiculous given the relatively mild climate in the Sarawat mountains.

hinkley · on Oct 10, 2022

The tricky thing about calling something a Tower of Babel is that it’s one thing to build the first one, it’s quite something else to build a second knowing about the first. This linear city idea is worse than the planned city they built in, Brazil? That was more 2 dimensional at least.

So this one goes up instead of out, but you’ll notice for instance the gap for the stadium. Stadiums are a bottleneck in regular cities. All traffic in the whole city has to drive past the stadium if they want to get somewhere on the other side, and with the gaps it’s a double choke point.

The Architect’s Sketch, Monty Python:

“Are you proposing to slaughter our tenants?” “… does that not fit in with your plan?” “[…] no it’s just that we wanted a block of flats, and not an abattoir” “Yes, well of course, that’s just the sort of blinkered, philistine pig-ignorance I’ve come to expect from you lot of non-creative garbage.”

The other candidate’s scale model breaks twice, and catches fire and then explodes. They decide that thin, sedentary tenants should avoid those problems.

h4n1 · on Oct 10, 2022

It wouldn't be the first linear city. The idea was popular in the early 20th century for industrial settlements. Magnitogorsk was orginally planned as such. None of the planned linear cities have stayed linear, to my knowledge. It only makes sense for when the whole point of living there is to operate a massive production line.