What can data scientists learn from engineers?

Or: Things that I am still bad at, but trying to improve

Ewan Nicolson

dataewan.github.io

Why engineering skills?

  • Self-determination, more impact
  • Hand overs and teamwork
  • Long term payoff
  • Mastery
  • You may be smarter, and possibly less likely to embarass yourself

In the rough order I was convinced of these values

Code review

A systematic way of getting your code checked by others.

Code review is brilliant. Excellent learning experience. Like peer review in science.

Pointers

Agree the rules beforehand. PEP-8.

Peer review with both data scientists and engineers if you can.

Testing

I was initially very reluctant about testing.

  • What if the answer is complicated, and I need to make a programme to find the answer?
  • What if it is a process that is non-deterministic like a random forest?
  • Isn't this a lot of overhead?

Couple of gateways into testing:

These are the same tests you do on paper

Use simple values

In [2]:
def f1score(y_true, y_pred):
    """F1 score is given by this formula.

    F1 = 2 * (precision * recall) / (precision + recall)
    """
    y_true = set(y_true)
    y_pred = set(y_pred)

    precision = sum([1 for i in y_pred if i in y_true]) / len(y_pred)
    recall = sum([1 for i in y_true if i in y_pred]) / len(y_true)
    
    if precision + recall == 0:
        return 0.0
    else:
        return (2 * precision * recall) / (precision + recall)
In [3]:
import pytest

assert f1score([1, 2, 3], [2, 3]) == 0.8
assert f1score(['None'], [2, 'None']) == pytest.approx(2/3)
assert f1score([4, 5, 6, 7], [2, 4, 8, 9]) == 0.25
assert f1score([1, 2, 3, 4], [1]) == 0.4

Mock anything that has already been tested

For example, don't unit test sklearn modules.

Only test your own logic, as that's the most likely place for problems to occur.

Data engineering

Best way to get high quality data

Very underrated task. I used to complain about this not being right. Invest in this instead.

Don't get put off by nomenclature

Star, snowflake schemas. Normalised, denormalised.

You have the knowledge to do this. If you write SQL then you understand this.

Understand how ETL works

Probably the biggest gain you get for lowest effort. Productionise your pipelines.

Many ETL technologies use python already.

AWS

Have knowledge of the AWS tools available to you.

I'm sure the rest are very good too. I like bigquery.

Deployment

First the concept of concept drift.

step1

concept drift

You have a nice model, trained well, evaluating correctly.

step1

concept drift

Then the environment changes.

step1

concept drift

One way to avoid concept drift is to retrain models regularly, before they get stale.

You want this to be low cost operation.

Stealing ideas from continuous deployment/delivery

Repeatable, automated pipelines

Whole pipeline should be completely repeatable.

Automate deployment, retraining, evaluation.

Logging performance

grafana dashboard

grafana dashboard

These dashboards not just for software engineers!

This idea scales well. You can also log:

  • Training metrics like AUC, RMSE
  • Feature importances; what are the most relevant predictors in the model
  • Performance on evaluation data
  • Distribution of predictions

Microservices

Each job is a container.

Each container should do one thing and do it well. Connect together like Lego.

Components can be parts of data pipeline, presentation of results, or parts of the model.

Excellent talk from Ravelin who do fraud detection. I love this idea.

They have a library of model components.

Ensemble them together to make classifiers.

library

Find what programmers get grumpy about

Data science is very young field. We can learn from experienced programmers.

Find an experienced programmer, see what they get grumpy about

Code smells

code smells

They've seen these before. These things have names!

  • Long Parameter List
  • Don't repeat yourself!
  • Conditional Complexity
  • Speculative Generality
  • Shotgun Surgery

I read this list and I recognised things that I got wrong in my own work.

Getting away from notebooks

Data scientists love notebooks (jupyter)

A great experience is watching an engineer hate them

An even better experience is watching an engineer say: "they are a great tool for collaboration and experimentation, but wouldn't use them for much else"

What can engineers learn from data scientists?

  • Being data aware. Analysis and stats.

  • Dealing with uncertainty.

  • Knowing about the domain applications of our work.

  • If a data scientist who knows engineering is awesome, then an engineer who knows data science is too. Together we build super cool data driven products that scale.

In [2]:
print("Many thanks")
Many thanks

Things that I wanted to cover, but didn't have time for

Agile

Version control

Debugger