What can data scientists learn from engineers?¶

What if the answer is complicated, and I need to make a programme to find the answer?
What if it is a process that is non-deterministic like a random forest?
Isn't this a lot of overhead?

Couple of gateways into testing:

These are the same tests you do on paper¶

Use simple values

In [2]:

def f1score(y_true, y_pred):
    """F1 score is given by this formula.

    F1 = 2 * (precision * recall) / (precision + recall)
    """
    y_true = set(y_true)
    y_pred = set(y_pred)

    precision = sum([1 for i in y_pred if i in y_true]) / len(y_pred)
    recall = sum([1 for i in y_true if i in y_pred]) / len(y_true)
    
    if precision + recall == 0:
        return 0.0
    else:
        return (2 * precision * recall) / (precision + recall)

In [3]:

import pytest

assert f1score([1, 2, 3], [2, 3]) == 0.8
assert f1score(['None'], [2, 'None']) == pytest.approx(2/3)
assert f1score([4, 5, 6, 7], [2, 4, 8, 9]) == 0.25
assert f1score([1, 2, 3, 4], [1]) == 0.4

Mock anything that has already been tested¶

For example, don't unit test sklearn modules.

Only test your own logic, as that's the most likely place for problems to occur.

https://testingpodcast.com/33-katharine-jarmul-testing-in-data-science/

http://www.tdda.info/

https://www.eecs.tufts.edu/~dsculley/papers/ml_test_score.pdf

https://smile.amazon.co.uk/Testing-Python-Applying-Unit-Acceptance-ebook/dp/B00LJV2GXI/ref=sr_1_1?ie=UTF8&qid=1519759535&sr=8-1&keywords=testing+python+david+sale

Data engineering¶

Best way to get high quality data

Very underrated task. I used to complain about this not being right. Invest in this instead.

Don't get put off by nomenclature¶

Star, snowflake schemas. Normalised, denormalised.

You have the knowledge to do this. If you write SQL then you understand this.

Understand how ETL works¶

Probably the biggest gain you get for lowest effort. Productionise your pipelines.

Many ETL technologies use python already.

https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7

https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b

AWS¶

Have knowledge of the AWS tools available to you.

I'm sure the rest are very good too. I like bigquery.

https://aws.amazon.com/big-data/

Deployment¶

First the concept of concept drift.

step1

concept drift

You have a nice model, trained well, evaluating correctly.

step1

concept drift

Then the environment changes.

step1

concept drift

One way to avoid concept drift is to retrain models regularly, before they get stale.

You want this to be low cost operation.

Stealing ideas from continuous deployment/delivery

Repeatable, automated pipelines¶

Whole pipeline should be completely repeatable.

Automate deployment, retraining, evaluation.

Logging performance¶

grafana dashboard

grafana dashboard

These dashboards not just for software engineers!

This idea scales well. You can also log:

Training metrics like AUC, RMSE
Feature importances; what are the most relevant predictors in the model
Performance on evaluation data
Distribution of predictions

Microservices¶

Each job is a container.

Each container should do one thing and do it well. Connect together like Lego.

Components can be parts of data pipeline, presentation of results, or parts of the model.

Excellent talk from Ravelin who do fraud detection. I love this idea.

They have a library of model components.

Ensemble them together to make classifiers.

library

Find what programmers get grumpy about¶

Data science is very young field. We can learn from experienced programmers.

Find an experienced programmer, see what they get grumpy about

Code smells¶

code smells

They've seen these before. These things have names!

Long Parameter List
Don't repeat yourself!
Conditional Complexity
Speculative Generality
Shotgun Surgery

I read this list and I recognised things that I got wrong in my own work.

Getting away from notebooks¶

Data scientists love notebooks (jupyter)

A great experience is watching an engineer hate them

An even better experience is watching an engineer say: "they are a great tool for collaboration and experimentation, but wouldn't use them for much else"

https://smile.amazon.co.uk/Pragmatic-Programmer-Andrew-Hunt/dp/020161622X/ref=sr_1_1?s=books&ie=UTF8&qid=1519841835&sr=1-1&keywords=pragmatic+programmer&dpID=41BKx1AxQWL&preST=_SX218_BO1,204,203,200_QL40_&dpSrc=srch

https://github.com/braydie/HowToBeAProgrammer

http://opiateforthemass.es/articles/why-i-dont-like-jupyter-fka-ipython-notebook/

What can engineers learn from data scientists?¶

Being data aware. Analysis and stats.
Dealing with uncertainty.
Knowing about the domain applications of our work.
If a data scientist who knows engineering is awesome, then an engineer who knows data science is too. Together we build super cool data driven products that scale.

In [2]:

print("Many thanks")

Many thanks

Things that I wanted to cover, but didn't have time for¶

Agile

Version control

Debugger

Python resources that have really helped me¶

Python tips

Talk Python to me podcast

Hitchhiker's guide to python

Python weekly newsletter