liquorice's comments

liquorice · on Feb 1, 2024

SEEKING WORK | Netherlands | Remote (flexible timezones)

Technologies:

- Data Science: Pandas, Numpy, sentence-transformers, Matplotlib, Scikit-Learn, LLMs, ElasticSearch, Pinecone, PyTorch, nltk, XGBoost

- Data engineering: Dagster, Airflow, Spark, Kafka, BigQuery, Postgres, ElasticSearch, Great Expectations

- DevOps: Docker, Kubernetes, GCP, AWS

- Frontend: React.js, Next.js, TailwindCSS

- Backend: FastAPI, Express

Resume: https://www.linkedin.com/in/nsv-3b1a54234

Email: [email protected]

Hi! I've been working as a data scientist/MLE for the past 5 years. The last year as a freelancer, and before that working on various NLP tasks at a hedge fund. My generalist skillset with a specialisation in NLP makes me a perfect fit for startups working on generative AI products. I work highly autonomously and thrive in uncertainty. I am available for 32 hours per week until May, and full time after that.

liquorice · on Feb 1, 2024

Location: The Netherlands

Remote: Yes (flexible timezones)

Willing to relocate: No

Technologies:

- Data Science: Pandas, Numpy, sentence-transformers, Matplotlib, Scikit-Learn, LLMs, ElasticSearch, Pinecone, PyTorch, nltk, XGBoost

- Data engineering: Dagster, Airflow, Spark, Kafka, BigQuery, Postgres, ElasticSearch, Great Expectations

- DevOps: Docker, Kubernetes, GCP, AWS

- Frontend: React.js, Next.js, TailwindCSS

- Backend: FastAPI, Express

Resume: https://www.linkedin.com/in/nsv-3b1a54234

Email: [email protected]

Hi! I've been working as a data scientist/MLE for the past 5 years. The last year as a freelancer, and before that working on various NLP tasks at a hedge fund. My generalist skillset with a specialisation in NLP makes me a perfect fit for startups working on generative AI products. I work highly autonomously and thrive in uncertainty. I am available for 32 hours per week until May, and full time after that.

liquorice · on Aug 16, 2022

They are probably not silly mistakes. Label encoding can be very useful for tree based models when the categories are ordinal, or when there are a high amount of categories.

alanfranz · on Aug 16, 2022

They are most of the times. You get a prediction with a meaningless float (unless the categories are ordinal, which isn’t so common), and categories can change their assigned number (happens in lots of analyses) at every run since they’re not properly sorted. Crawl a few notebooks, I spotted that error quite often.

liquorice · on March 24, 2022

AFAIK he never revealed his intents, so he could just be doing it 'for the lulz'... That seems more likely to me considering his 4chan-y attitude on his other channel 'Vagrant Holiday'

liquorice · on March 22, 2022

Hi everyone,

I’m Nathan, working on subsets.io. We’re building a search engine for data, where your search with your internal dataset and can add matches with one click.

— Context —

The datasets we work with during our analytics/modelling jobs are often lacking. They were generated by internal systems, but in many cases there are important missing attributes, or the entities in the internal dataset are significantly influenced by external factors that are not captured.

Unfortunately, integrating with external data is a major hassle. External data is scattered across many different places (data dumps, api providers, api marketplaces, open data platforms, the web), in all kinds of formats. Integrating it is very time-consuming and requires significant technical skills, which is usually not an analysts/scientist’s core competency. Also, the fragmentation of external data makes exploration very difficult. Furthermore, it’s often unclear if a given external dataset will add value so it can be hard to justify the integration investment.

We want to make data exploration and integration easier by treating it as a search problem. Because a dataset contains many values for a given column, we can make much stronger assertions about types than api’s that work with single values. We currently check basic types, and use language models to infer context for more complex string types. For example, if you upload a dataset with countries and their respective alcohol consumption for a year, our system will recommend to add smoking rates, tariff rates, and crime rates.

It’s obviously a very complex problem, but I think that if we can communicate clearly in case of uncertainty we can make the process a lot easier.

Would love to hear your thoughts.

PS: Feel free to drop me a message at [email protected] if you have any questions or would like to chat

Thanks :>

liquorice · on Aug 3, 2021

Keepa is a data company though, not an Amazon Affiliate, so they shouldn't care about violating that policy

liquorice · on Feb 2, 2021

Unfortunately this data is often inconsistent with the visual representation. For example, webshops often list their product as 'InStock' regardless of the actual stock status. Since products are in stock most of the time, you will not find out about this and thus likely extract wrong data in the future.

This was especially apparent when I tried to get my hands on some weights during the early corona pandemic. All webshops were out of stock, but in about 80% of them the schema markup indicated otherwise

liquorice · on Jan 7, 2021

I think the conclusion of a death toll estimation exercise may shift if you adjust for expected remaining lifetime.

In Scotland, the mean COVID death age was higher than the average lifetime expectancy[1], which indicates that on average those people would probably live <10 years had they not contracted the disease. The expected remaining lifetime of the average individual in Scotland (if it remains unchanged) is ~37 years, which would reduce the lifetime adjusted death toll by at least 3.7.

There have been a significant amount of excess deaths that are not attributable to COVID [2]. I suspect that such deaths have a lower average age (e.g. due to suicides), than COVID deaths, further reducing the average life years saved due to the lockdowns.

I'm not proposing that we should optimise for life years saved, but I do think that the death of a 86 year old is less tragic than that of a 38 year old, and that we perhaps shouldn't evaluate lockdown vs no lockdown purely based on the percentage of the population which died.

[1] https://www.bbc.com/news/uk-scotland-54433305 [2] https://www.newswise.com/factcheck/are-a-third-of-the-excess...

patrec · on Jan 8, 2021

> I'm not proposing that we should optimise for life years saved, but I do think that the death of a 86 year old is less tragic than that of a 38 year old, and that we perhaps shouldn't evaluate lockdown vs no lockdown purely based on the percentage of the population which died.

I of course agree with that. Just stating lives lost in isolation is misleading. It simply much easier to use for back-of-envelope math, and generally back of the envelope math indicating a course of action would lead to millions of excess deaths is enough to conclude that probably it won't be a great idea.

> which indicates that on average those people would probably live <10 years

This is plausible, but it may not be much less than 10 years either. The life expectancy of someone who reaches average UK life expectancy is close to another 10 years. And we'd probably both prefer neither ourselves nor any elderly relatives or friends to suffocate 10 years before their or our time.

> There have been a significant amount of excess deaths that are not attributable to COVID [2].

So this sounds initially plausible (lockdown stress/economic hardship leading to an increase in suicides) but I believe it's bullshit and that these deaths are in fact due to covid. Since we were talking about the UK, let's go back to look at that: the baseline figure of suicides is pretty low (~6k in the UK for 2019, which is 1e-4 per individual per year). The number of excess deaths in 2020 is at least 12 times higher than that. So the suicide rate would have had to at least double to account for a significant part of that. And nationwide suicide rates are quite stable, year-on-year, including, crucially, during times of severe economic hardship.