Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Good article but a bit short. Anyway two books I can recommend for data engineering:

Designing data intensive applications-will give you a good overview of tools and the theories, algorithms, data structures behind it for different types of DB.

High performance browser networking-throwing this in because it can help extend the book above outside of a data center to the last mile. Sometimes you can save yourself a bunch of headaches on the data server side by doing things client side and caching/prefetching more than what the user asked for.



Interesting. I work in the space currently (an ELT company) and I wouldn’t characterize the skill sets you mentioned as critical to a data team or a data driven organization. They are helpful, sure, but a lot of the problems that data teams face in my experience are around how to expose data to the organization in a meaningful, regular, digestible, actionable way. This means thinking about how to ingest and transform data and present it to end users for analysis (BI) in such a way that it’s not painful or arduous to consume it, and ensuring that it is clean and accurate. In other words, how well you move the data has always been a secondary point to how people can access and use said data.

Note that I’m not in a dedicated data engineering role, but have worked adjacent to it for the better part of a decade and have done a fair amount of data engineering in that time.


>a lot of the problems that data teams face in my experience are around how to expose data to the organization in a meaningful, regular, digestible, actionable way. This means thinking about how to ingest and transform data and present it to end users for analysis (BI) in such a way that it’s not painful or arduous to consume it, and ensuring that it is clean and accurate

100% this. I basically work in this role in my organization. I'm not sure how typical my experience is compared with other people who call themselves 'data Engineers'. I'm usually put into the situation where I'm looking at how to capture a raw signal (out of an instrument, device or similar) and package it up as structured data and store it in some kind of meaningful format people can use.

Useful skills I think (other than SQL) are: -Some understanding of PLCs/SCADA/OPC communication world. -knowledge about Signal analysis/ DSP, Kalman Filters, Fourier Transforms etc. Especially useful when you need to store a raw signal as "data" of some kind. also increasingly useful when the underlying data is in some format like Video, Audio etc. -An Understanding of working with Time Series analysis can help with data transformation etc (downsampling a high frequency signal into lower frequency one for example).

Once you have captured the signal and stored it you move on to the second part how to make it available to end users in a meaningful and accessible way - this is where SQL knowledge and working with DBAs comes into it. In my experience it has become increasingly common you need to understand about cloud architecture etc as well. For example in my org more and more of our storage is moving to Azure so knowing how to navigate that stack and all the terms they use for things can help save a lot of time and frustration.

Lastly it can help if you are at least somewhat familiar with the BI tools the end users will be using to ultimately leverage the data.


One really interesting aspect of this that I’ve had some revealing conversations with the data engineers in my org is how to best expose the firehose of data for people in BI tooling. We use dbt in my org, and Metabase as the BI tool, and a lot of thought is put into how to create a clearinghouse that serves the needs of the organization. The current pattern that has been of interest is to ELT into what the data engineers call OBT (one big table). The OBT is cleaned, denormalized, and able to be sliced on. An org might have several of these OBT consisting of various areas of interest. End users then import the OBT in Metabase to drive their filtering and build dashboards. The goal is to reduce reliance on custom SQL scripts and push all of that custom slicing and dicing of the data into Metabase’s front end logic where filtering can be applied dynamically rather than trying to maintain a bazillion sql variants.

Eventually I think we will move into a post ChatGPT world where you’ll give ChatGPT (or whatever equivalent) your schema and a question and it will output the dashboards for you. We aren’t quite there yet though


I like this and I think it's where modern AI will shine the most. Like a clippy but for data.

The (outside of this scope) question is what happens when you feed that decision back into the system. I think the "recursive AI" question has been exhausted though.


Have you evaluated Superset or Lightdash against Metabase? If so I'd love to hear about your experience. I'll shortly be helping a client company migrate BI off Looker and haven't gotten my hands dirty with the options yet.


I imagine the number of people who fit the venn of data engineer with signal analysis knowledge/skills is pretty small. Is hiring for that kind of role difficult? How would you recommend people with skills in SWE/DE become hireable in the signal processing world?


I currently work in a Data Engineering role and agree with your assessment. At the business I work for, there was no BI or data management processes automation until I joined and wrote all the pipelines (in Python) and queries (in T-SQL). Being comfortable with Python and pandas/arrow (or some other language or ETL/ELT platform), SQL, and working with REST APIs is all usually necessary or helpful, but SQL is ubiquitous. And the most challenging problems for the data mgmt team I'm on involve:

(1) getting datasets from historically siloed data management teams to join with each other cleanly and programmatically — which in my case is a political issue, not a technical one

(2) determining the best way to replace old tabular (Excel) reporting with modern interactive (Tableau) dashboards that refresh from a database — which in my case usually means getting a better understanding of many cross-team business processes, understanding how data management is involved in them, and how data is being used to make decisions and take action

(3) determining how to structure and present this data to them with the least possible friction, and in a way that's the most useful and impactful to them

This is all to say that DE roles seem to vary wildly depending on the domain you work in and the kind of business you work for. But at the end of the day the purpose of the role is to make clean and accurate data available to end users with the least friction and in the most impactful way. And most kinds of data businesses use can be represented by, transformed with, and interacted with using SQL.

There's tons of SQL resources, but I'd love to find some books that cover these deeper issues of politics and business process discovery from a data engineering perspective.


I wrote Excel macros that connected to a VAX VMS DB2 to get updated data so that management could just open the same excel files as they always did. This was the early/mid 1990s


This kinda sounds like higher level concerns where the books cover lower level problems?


I think it's just two sides of the same coin. There's definitely an art to making it consumable to end users. It's even an engineering challenge to open these data lakes to end users without killing the database (think of a dev trying to show load average on a graph for all running servers over last 3 years and there's 5000 servers). You can't really blame them and realistically they might have a project where they need that data. And then there's data crazy people who make 400 graphs of various metrics and only 20 get actively used. It can lead you down red herring paths and all kinds of chaos as an SRE, and also blow the budget.

Anyway, I understand what they are saying and unfortunately I have no book recommendations to solve these problems. In larger companies it can be much more of a challenge than actually ingesting data, especially in the observability space.


To add some additional notes - Designing Data Intensive Applications was a heavy read to dive into for me as a junior engineer. I took a step back after the first couple of chapters and read Understanding Distributed Systems[0] which was a fantastic primer.

[0] https://understandingdistributed.systems




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: