It's quite interesting that Anki(It's a Japanese 暗记 for those who don't know where the name comes from) is mentioned a few times but no body mention "Memory palace"[1] and "Major system"[2] yet. Both methods try to hack some features/bugs of our brains built by mother nature during million years of her evolutionary development. Most memory athletes use some variation of these methods to achieve astonishing performance. There's a Dr. Yep can remember a 65 thousands of English word and the locations (i.e. page number and the sequence on that page) of each word on a Chinese/English dictionary.
The trick is investing some time to build a personalized memory system, then use the system to remember other things like passport number, credit card , phones etc. It might not be attractive to most people because today we have smart phone to help us. But for learning new knowledge, combination of both space repetition and memory system is quite interesting. Here's what I did: I use a Anki alternative called Memrise[3] to build a Major system. (I also have Anki installed but just prefer Memrise). Then I use Major system to review the concepts I've learned. Major system here is the equivalent of Anki as a tool for short term place holder of space repetition, not directly used for learning. I found brain Anki is better than smartphone Anki/Memrise although I built former Anki with later Anki.
Come on fellow hackers, hacking your brain is very rewarding
Good article but a bit short. Anyway two books I can recommend for data engineering:
Designing data intensive applications-will give you a good overview of tools and the theories, algorithms, data structures behind it for different types of DB.
High performance browser networking-throwing this in because it can help extend the book above outside of a data center to the last mile. Sometimes you can save yourself a bunch of headaches on the data server side by doing things client side and caching/prefetching more than what the user asked for.
>a lot of the problems that data teams face in my experience are around how to expose data to the organization in a meaningful, regular, digestible, actionable way. This means thinking about how to ingest and transform data and present it to end users for analysis (BI) in such a way that it’s not painful or arduous to consume it, and ensuring that it is clean and accurate
100% this. I basically work in this role in my organization. I'm not sure how typical my experience is compared with other people who call themselves 'data Engineers'. I'm usually put into the situation where I'm looking at how to capture a raw signal (out of an instrument, device or similar) and package it up as structured data and store it in some kind of meaningful format people can use.
Useful skills I think (other than SQL) are:
-Some understanding of PLCs/SCADA/OPC communication world.
-knowledge about Signal analysis/ DSP, Kalman Filters, Fourier Transforms etc. Especially useful when you need to store a raw signal as "data" of some kind. also increasingly useful when the underlying data is in some format like Video, Audio etc.
-An Understanding of working with Time Series analysis can help with data transformation etc (downsampling a high frequency signal into lower frequency one for example).
Once you have captured the signal and stored it you move on to the second part how to make it available to end users in a meaningful and accessible way - this is where SQL knowledge and working with DBAs comes into it. In my experience it has become increasingly common you need to understand about cloud architecture etc as well. For example in my org more and more of our storage is moving to Azure so knowing how to navigate that stack and all the terms they use for things can help save a lot of time and frustration.
Lastly it can help if you are at least somewhat familiar with the BI tools the end users will be using to ultimately leverage the data.
One really interesting aspect of this that I’ve had some revealing conversations with the data engineers in my org is how to best expose the firehose of data for people in BI tooling. We use dbt in my org, and Metabase as the BI tool, and a lot of thought is put into how to create a clearinghouse that serves the needs of the organization. The current pattern that has been of interest is to ELT into what the data engineers call OBT (one big table). The OBT is cleaned, denormalized, and able to be sliced on. An org might have several of these OBT consisting of various areas of interest. End users then import the OBT in Metabase to drive their filtering and build dashboards. The goal is to reduce reliance on custom SQL scripts and push all of that custom slicing and dicing of the data into Metabase’s front end logic where filtering can be applied dynamically rather than trying to maintain a bazillion sql variants.
Eventually I think we will move into a post ChatGPT world where you’ll give ChatGPT (or whatever equivalent) your schema and a question and it will output the dashboards for you. We aren’t quite there yet though
I love all the series about writing your own X in 100 lines of code.
It gives you the understanding of technology and removes a lot of unnecessary details.
Domain Driven Design. The book by Eric Evans lays out a bunch of concepts and as a developer that had not owned the architecture of a big domain, it was hard for me to see exactly where they fit. But after reading the book a couple times, and then encountering a few tricky domain modeling challenges, I started to see where these patterns add value. Also, as I started trying to describe the cohesive domain architecture of the system to a growing engineering organization, I also clicked on the advantage of having a standardized set of terminology for the problem, rather than inventing your own. It's nice to be able to link to an existing explanation of what a Repository is for, instead of having to name and document your own ad-hoc architectural patterns (or more likely, end up with you ad-hoc architecture being under-documented).
Things like Repositories, Aggregates, Bounded Contexts, and so on are going to be a net drag on your system if you only have a few 100 kloc of code in a monolith. But they really start to shine as you grow beyond that. Bounded Contexts in particular are a gem of an idea, so good that Uber re-discovered them in their microservices design: https://www.uber.com/blog/microservice-architecture/.
Data warehouse: bundles compute & storage and comes at a comparatively high price point. Great option for certain workflows. Not as great for scaling & non-SQL workflows.
Data lake: typically refers to Parquet files / CSV files in some storage system (cloud or HDFS). Data lakes are better for non-SQL workflows compared to data warehouses, but have a number of disadvantages.
Lakehouse storage formats: Based on OSS files and solve a number of data lake limitations. Options are Delta Lake, Iceberg, and Hudi. Lakehouse storage formats offer a ton of advantages and basically no downsides compared to Parquet tables for example.
You mean Lake 'Shanty' Architecture (think DataSwamp vs DataLake) am I right?
But in all seriousness, I totally agree with your opinion on LakeHouse Architecture and am especially excited about Apache Iceberg (external table format) and the support and attention it's getting.
Although I don't think that selecting any of these data technologies/philosophies comes down to making a mutually exclusive decision. In my opinion, they either build on or compliment each other quite nicely.
For those that are interested, here are my descriptions of each...
Data Lake Arch - all of your data is stored on blob-storage (S3, etc) in a way that is partitioned thoughtfully and easily accessible, along with a meta-index/catalogue of what data is there, and where it is.
Lake House Arch - similar to a DataLake but data is structured and mutable, and hopefully allows for transactions/atomic-ops, schema evolution/drift, time-travel/rollback, so on... Ideally all of the properties that you usually assume to get with any sort of OLAP (maybe even OLTP) DB table. But the most important property in my opinion is that the table is accessible through any compatible compute/query engine/layer. Separating storage and compute has revolutionized the Data Warehouse as we know it, and this is the next iteration of this movement in my opinion.
Data Mesh/Grid Arch - designing how the data moves from a source all the way through each and every target while storing/capturing this information in an accessible catalogue/meta-database even as things transpire and change. As a result it provides data lineage and provenance, potentially labeling/tagging, inventory, data-dictionary-like information etc... This one is the most ambiguous and maybe most difficult to describe and probably design/implement, and to be honest I've never seen a real-life working example. I do think this functionality is a critical missing piece of the data stack, whether the solution is a Data Mesh/Grid or something else. Data Engineers have their work cutout on this one, mostly bc this is where their paths cross with those of Application/Service Developers, Software Engineers. In my opinion, developers are usually creating services/applications that are glorified CRUD wrappers around some kind of operational/transactional data store like MySQL, Postgres, Mongo, etc. Analytics, reporting, retention, volume, etc are usually an after thought and not their problem. Until someone hooks the operational data store up to their SQL IDE or Tableau/Looker and takes down prod. Then along comes the data engineer to come up with yet another ETL/ELT to get the data out of the operational data store and into a data warehouse so that reports and analytics can be run without taking prod down again.
Data Warehouse (modern) - Massive Parallel Processing (MPP) over detached/separated columnar (for now) data. Some Data Warehouses are already somewhat compatible with Data Lakes since they can use their MPP compute to index and access external tables. Some are already planning to be even more Lake House compatible by not only leveraging their own MPP compute against externally managed tables (eg), but also managing external tables in the first place. That includes managing schemas and running all of the DDLs (CREATE, ALTER, DROP, etc) as well as DQLs (SELECT) and DMLs (MERGE, INSERT, UPDATE, DELETE, ...). Querying data across native DB tables, external tables (potentially from multiple Lake Houses, Data Lakes) all becomes possible with a join in a SQL statement. Additionally this allows for all kinds of governance related functionality as well. Masking, row/column level security, logging, auditing, so on.
As you might be able to tell from this post (and my post history) is that I'm a big fan of Snowflake. I'm excited for Snowflake managed Iceberg tables and then consume the data with a different compute/query engine. Snowflake (or other modern DW) could prepare the data (ETL/calc/crunch/etc) and then manage (DDL & DML) it in an Iceberg table. Then something like DuckDB could consume the Iceberg table schema and listen for table changes (oplog?), and then read/query the data performing last-mile analytics (pagination, order, filter, aggs, etc).
DuckDB doesn't support Apache Iceberg, but it can read parquet files which are used internally in Iceberg. Obviously supporting external tables is far more complex than just reading a parquet file, but I don't see why this isn't in their future. DuckDB guys, I know you're out there reading this :)
Graph nodes are an excellent framework for thinking about what constitutes intelligence, thanks for sharing.
I really don't like the term "smart" because it is vague, anecdotally it typically means memory, but it can also relate a high level of creativity and problem solving ability as well. All of which I feel are aspects of an intellect.
The main vectors I see for general intelligence as I see it are:
memory/knowledge - here relating the number of nodes held on the graph
curiosity/learning - the rate at which nodes are added to the graph
abstraction/understanding - the ability to distill knowledge, almost like a hash map, but I think an ML model of some subset of nodes is a more accurate (but kinda cheating)approximation.
critical thinking - the window size with which you attempt to traverse the graph
creativity/problem-solving - the ability to formulate rules for a novel traversal of the graph
logic/reasoning - the ability to traverse the graph in an efficient manner and give the right output
astuteness - the ability to recognize nodes that don't belong or are missing on the traversal
self-awareness - indexing the contents of your graph and relating it to all possible nodes, also your understanding of what all possible nodes represents.
Then there are higher level abstractions, wisdom is the confluence of a high level of understanding, critical thinking, reasoning, astuteness and self-awareness. Smarts is some impressive combination of knowledge, critical thinking, problem-solving and reasoning.
Sequence and Activity are my most used - beyond the ones above logical architecture tends to be a Draw.io created abstraction that ends up in a Google Doc... stuff like this: https://camo.githubusercontent.com/f14ac82eda765733a5f2b5200... (not one of mine)
A lot of this can be simplified to three questions:
1. What problem is your company solving?
If you don't get an answer, beware. If the answer sounds vague, beware. If the answer makes no sense, beware. If the answer is multifaceted, beware. This suggests that the company will not even begin the process of becoming profitable.
2. Who has this problem?
You should get a clear picture of an actual person. If not, beware. If that person has no money, beware. If that person has no pull within an organization, beware. If that person is high maintenance or fickle, beware. This suggests that the company will never find the revenue they seek.
3. What's your solution?
If the solution doesn't actually address the problem, beware. If the solution is too expensive for the customer, beware. If the solution can't be differentiated from its competitors, beware. If the solution has no competitors, beware. If there are a dozen solutions, beware. This suggests that no matter how amazing the technology or technical team, the company will not be able to execute on its business plan.
> So, I say to this poster, it's true that things like idle chitchat may not be your cup of tea, I think it's important to realize that many people don't enjoy it either but just use it as a social grease to move into more deep conversation or connection.
it took me probably 7 years after moving out of my parent’s place to learn that “smalltalk” and “idle chitchat” are not necessarily the same thing. the “aha!” moment was attending a conference with a much older coworker and seeing him strike up conversations with the people sitting next to him which in the course of 15 minutes went from surface level to incredibly personal, sometimes philosophical things that i would never have thought a stranger willing to discuss, prior to that.
in my taxonomy, “idle chitchat” is talking about things. “smalltalk” is learning about each other. “the weather sure is nice today, isn’t it?” => idle chitchat. you’re not likely to understand a person from that starting point, except that “wow look, we both like the sun”. “where are you from? what brings you here?” => smalltalk. you’re encouraging the person to reveal some small amounts of information about themselves which you can use to probe further and hopefully find something fascinating (about them, about a topic you haven’t thought much about, or about yourself and how you relate to something in contrast to them).
now i take conversations with strangers (or anyone really) as a sport, as a challenge. “how can i use these precious moments we strangers share to discover something new? to leave one of us pondering something novel later in the evening”. sometimes these lead to lasting friendships, usually not. but still frequently beneficial to my life. important to identify the situations where smalltalk has the possibility of going beyond surface-level things though. an elevator ride — probably not. a conference, a party, a group activity, anything where people have already put themselves out there more than normal — seems to select for people more likely to “get” smalltalk, or maybe it primes them to open up more, idk.
A related issue is that many technical people mistake a passion for their craft with a passion for their work.
For example, I am very passionate about statistics, I spend a tremendous amount of my free time studying it. I also work as a data scientist. For far too long I mistook a passion for statistics as a passion for data science. I know many software engineers make a similar mistake regarding their passion for programming.
This is a surprisingly big issue in my experience because commitment to your craft can lead to friction with your work and vice versa. This is not a problem when you realize these are two distinct things, but can lead to problems if you aren't aware of this difference.
The most obvious one is that confusing work for craft means you can put more energy into your employer's goals than ones related to bettering your craft (and also yourself). For a software engineer, at first, a late night coding session can benefit both. However in the long run if you keep spending time solving your employers problems, you will have less energy to study and practice software for it's own sake. This can also lead to burn out in which you start to lose you passion for your craft as well.
The reverse of this is also true: being very good at your craft can hurt you professionally. Your employer doesn't care about good code, or the correct statistical models. In the past, whenever I saw fundamentally incorrect statistical tools being used in production at work I couldn't stop and try to correct it. I've seen many software engineers struggle similarly when orgs make bad technical decisions.
I failed many interviews because the interviewer had a mistaken view of things, and rather than just play along, I would try to correct them (I've learned that no matter how sincere and kind you are in your correction, it is always a mistake to correct an interviewer). I distinctly remember the first time an interviewer incorrectly "corrected" me, and instead of justifying my decision, I just said "wow you're right, I was just sketching out some ideas here, but that path is worth investigating". Got that job very easily.
Eventually I realized that I am passionate about statistics and mathematical modeling, these are related but ultimately tangential to my day job. It's great that I get paid well to do something closely related to what I love to study, but at the end of the day it's no different than a true coffee lover working at starbucks.
Does anybody know of something similar, but written in a more elaborate book-like fashion?
It's not about format, of course, it's just that all these model-zoo's are only half of the story — not useless, but rather incomplete. That is, if you learn a lot of them, one day you might remember something and find it useful — or you might not. This is a bit abstract, but I want to say you rarely ever need any special tools of thinking (or maybe even tools in general) — you want a technology. That is, a relatively broad (but precisely defined) instruction of how to identify and approach a set of problems in some domain. I.e., it is more about when to use a tool or technique, rather than being a exhibition of such tools. I mean, there is important difference — that's kinda the difference between going to college and going to a museum.
For example, GTD is a technology (doesn't matter good or bad). It tells you when to use tools/techniques suggested (and a better book would have more showcases and exercises to internalize the idea, IMO). This site or coursera's famous "Models Thinking" course — aren't.
I've worked at 75% of FANGs. My method was basically just bone up on algorithms and data structures by going onto the Wikipedia list and trying to implement them on pen and paper. Practice thinking out loud, practice space management(no backspacing on pen and paper). Be honest if you've heard a question before. Know how to analyze runtime of your algorithms.
I chose to interview in python, even though I know other languages better, because it is fairly dense relative to, say java or c++ and easier to write by hand.
I read these in 5th grade or so and learned most of middle school and high school math just from those.
They're written as a fantasy novel in which the characters discover the relevant math concepts as part of the story. I know, you're thinking that's a silly gimmick and it can't possibly be any good, but it was actually done quite well. At least until the later parts of Calculus where the situations have to get pretty contrived.
Highly recommended, at least for precocious kids who like math and fantasy novels.
I do enjoy Khalid's explanations, and I applaud anyone who is exploring different mediums and methods to help improve education, especially around Mathematics and Computing (my areas of interest).
Commercial television news is very manipulative and the intent of the manipulation is to keep eyeballs glued to the screen to sell ads. They use lots of elaborately animated transitions with bright colors, swooshing sound effects and musical cues to mesmerize viewers, just as slot machines in Vegas use such things to mesmerize gamblers. It's the same sort of graphical bullshit they pad out televized American football with. And the talking head personalities emotively reading from teleprompters are there for viewers to form parasocial relationships with, to keep viewers coming back to that channel. Before social media was invented, the term was created to describe the asymmetric relationships television viewers have with television personalities. The whole industry has manipulation down to a science and I think anybody would be better off reading newspapers instead. Public funded news channels may be better, I remember PBS's News Hour was okay. But is there any value in watching CNN instead of reading the NYTimes? I don't see any.