The take down on abstraction and software engineers (by using Java as an example...

dnautics · on Sept 19, 2018

I'm a scientist (wet lab) by training, a programmer (back end) by profession, and a data scientist by hobby (I have a machine learning project that I'm working on), and most of "data science" is not really stats... There will be a bit of stats at the end product but really the bulk of the necessary work is data curation. Annoying stuff like making sure my data fit into the right buckets.

I did have to debug a memory leak that only showed up when I deployed my data pipeline on 22 cores.

arthurcolle · on Sept 19, 2018

N == 22 specifically? Or N >= 22? Interesting threshold value

dnautics · on Sept 19, 2018

My box has 24 cores. By default I deploy on 22. Actually it fails at 10 cores, but it gets 3/4 of the way through the dataset. At 22 it dies about 1/4 of the way, at 5 cores it makes it all the way through.

Error is in a string tokenizer, which I wrote as a recursive call. Usually it's fine but I made a code change which absolutely killed it. Also I'm writing in Julia, which does not TCO, the back end stuff I do is in elixir, which does.

_x6vl · on Sept 19, 2018

Why is 22 particularly interesting?

arthurcolle · on Sept 19, 2018

It's not. The fact that it doesn't readily have any real-world significance is what makes it an 'interesting' (read: odd, curious) threshold value, which is why I asked OP whether it would only fail at that core number (N == 22) or whether it effected all processor counts higher than the value. I can see that my use of interesting was colloquial and not literal. My bad for any confusion this may have caused ;)

karmakaze · on Sept 19, 2018

"data monkey" is even more of a fitting name than the corresponding "code monkey". Much of so called data science leans more on the side of data engineer where one fits existing solutions to your specific data. The split of data scientist and data engineer is the most unfortunate. It's like splitting programming into program design and development (opposite of devops) in a specific language. That's done too but usually the spec is functional (behavior not FP sense) and not algorithmic.

If this pattern works for anyone, great keep running find it he limitations. I just believe that there will be a better structuring and selective application movement.

A true data scientist would be doing research into new solutions or high level improvements. This can't happen at typical sized companies unless it's the core product and not a feature of one.

The big data, data scientist/engineer bandwagon is a little like blockchain. Everyone wants to leverage it, there are places where it is suitable but not everywhere where it's applied.

hef19898 · on Sept 19, 2018

The risk I see is over using data science in circumstances where it is just a product feature. Risk is then to over emphasize the data science part and forgetting the relevant context. Like getting lost in the data itself.

A tendency I saw is that math graduates have a tendency to put everything in probability functions. That reality is composed of people that cannot be predicted is sometimes beyond their horizon. As a result everybody believes the solution is mathatically correct and thus suitable to reality while it is quite the opposite.

EDIT: Typos, again...

oldandtired · on Sept 19, 2018

A good programmer is someone who can communicate with the problem domain experts and provide an a solution to fit their problem. Someone who understands the limitations of the computing environment and can engineer a solution that is adequate for the problem space problem.

Many who consider themselves programmers produce solution space solutions that just don't get to the core of the problem space problems. This is a function of the simple fact that many programmers never have the opportunity to see what the problem space experts are doing or actually need. This is a real shortcoming in the education of programmers.

We don't have to be subject matter experts in all fields, we just need to become competent in being able to understand the kinds of problems that are being faced by the various subject matter experts that we build systems for.

On the other side of that coin are those who are subject matter experts who think it is easy enough to become competent programmers. What they miss is the essential problem that programming is, itself, a field that requires a subject matter expert. I have come across too many systems that have been developed by the subject matter experts that were just wrong. Wrong in design, wrong in understanding the limitations of the tools being used, wrong in oh so many ways.

To build properly functional and functioning systems requires the cooperation, input and continual communication between those who are subject matter experts facing problem space problems and those who are subject matter experts in computing systems. This is a rare event and so we see the problems in every field with the computing systems that currently exist.

chosenbreed · on Sept 19, 2018

> Many who consider themselves programmers produce solution space solutions that just don't get to the core of the problem space problems. This is a function of the simple fact that many programmers never have the opportunity to see what the problem space experts are doing or actually need. This is a real shortcoming in the education of programmers.

I can't argue with that. I would add that another factor is how the work is structured and presented to the programmers. In many places programmers largely disconnected from any users. The programmers are usually given a set of requirements by a third party who themselves derived it from someone other than a user/consumer. Thus, programmers may end producing lovely programs that don't actually address the needs of users/consumers.

chrisweekly · on Sept 19, 2018

Good points, if a bit verbose. :)

Tangent: regarding "problem space vs solution space" issues, I find that many projects suffer needlessly from too much focus on one of these over the other. Learning to balance them isn't easy, but is critically important.

oldandtired · on Sept 19, 2018

It was one of my former managers/mentors that introduced me to the concepts of problem space vs solution space. As the decades have passed since then, what I have seen is that most computing solutions that have been offered for the problems people have experienced do not really consider what the problem is that is being faced.

It takes a lot of effort to actually elucidate what the actual problem is that needs solving. Which is why I have made the comment earlier that programmers need to get out and see what the end user (client/customer/whatever you might want to call them) is actually doing and experiencing. When all you have is some design documents, functional specifications and technical specifications, the actual working environment for the solution is then missing.

We need to get out and face the complaints, observations, ire and suggestions of those who use the software we write.

Edit: as for verbosity, my mother has made the statement for many decades that, of her children, I was the one who could talk the legs off a cast iron stove. As my sons and daughters, grandsons and granddaughters, nephews and nieces have all had to learn, to shut me up, they have to talk.

chrisweekly · on Sept 20, 2018

wrt verbosity: Haha, I'm the same way, as in: "Sorry this [email|message|comment|...] is so long, I didn't have time to write a short one."

wrt problem space, yes! In contrast to all the focus on product development and engineering methodologies, somehow customer development generally suffers from a lack of rigor and attention. Ditto marketing -- in the sense of identifying or growing a market for the goods or services on offer.

FranzFerdiNaN · on Sept 19, 2018

I have never seen someone with “a 3 week mooc” getting a data science job. In fact, those jobs are being gatekeeped to a ridiculous degree, suddenly asking for PhDs for jobs that are barely more than a regular BI job.

psyklic · on Sept 19, 2018

A skilled software engineer who is good at math could probably take some MOOCs, build a portfolio, and do well at data science interviews that emphasize coding. Many interviews often just ask ISL-level questions (Introduction to Statistical Learning), which is studyable over a few months. On the other hand, it would be significantly more difficult for a new-to-coding statistician to become an excellent coder in a short time ... although I've seen some people do it.

thisisit · on Sept 19, 2018

> who otherwise had never dealt with statistics before.

And why is statistics required? Let's face it most companies who need "Data Scientists" are looking for regular BI guys with fancy terms. Most of the problems are solvable using out of box functionalities in python/keras etc. Sure there are places and problems which require hard mathematics and stats but those are few and far between.

mywittyname · on Sept 19, 2018

This is true in my experience. Data Scientist seem to run the gamut from "knows SQL" to "has a Ph.D in Behavioral Psych and spent four years getting scientific results published in peer-reviewed journals".

The company that I work for has changed role titles from Data Analyst to Data Scientist specifically because people who know SQL, but don't program, won't apply to/accept jobs without that title.

dr_zoidberg · on Sept 19, 2018

Because data science and machine learning are applied statistics, and if you don't understand how it works under the hood (and not necesarily a very deep understanding, sometimes just a broad understanding is enough) you will have trouble adjusting things, debugging edge cases, or simply not know why something works and something else doesn't.

(edited for slightly better clarity)

ryandrake · on Sept 19, 2018

Most companies that think they need a team of data scientists just need a SELECT with a WHERE clause and maybe GROUP BY.

hef19898 · on Sept 19, 2018

Did you ever the pleasure to work with a Six Sigma Black Belt who had the most of one week statistics training? I am one, but honestly this is just enough to do some back-of napkin number crunching for purely operational purposes. That you need a math PhD is maybe exaggerating as well on the end of the spectrum.

That being said, the abilit to talk to domain experts and accept their experience is one of the most important skills for a true data scientist. Without proper context all the data in the world gets you nowhere.

digitalzombie · on Sept 20, 2018

> statistics is a three year long grueling applied maths degree, and condensing it to three weeks is silly.

I agree. I was in a very prestige organization and they didn't know what a statistician really do and just hired CS machine learning PHDs. Even those people don't even know what statistician does. One person gave me ISLR when I ask for advice to get hire at this prestige place (I did an equivalent of this over several graduate courses in statistic program).

Another person proudly told me that in his project, he was using GLM stating he knows GLM. I asked what was the link function and that person stated he didn't know and it's somewhere in the code...

I've since then double down on statistic and will be going into Biostatistic field instead of data science. It feels like there are a lot of impostors especially in start ups and government organizations in data science. I have no clue why but there is just this culture in tech industry that have made me left it for better field. I've intern in the biostat field it is much better. CNN and other have listed with high quality of life.