A systematic way of getting your code checked by others.
Code review is brilliant. Excellent learning experience. Like peer review in science.
Agree the rules beforehand. PEP-8.
Peer review with both data scientists and engineers if you can.
I was initially very reluctant about testing.
Couple of gateways into testing:
Use simple values
def f1score(y_true, y_pred):
"""F1 score is given by this formula.
F1 = 2 * (precision * recall) / (precision + recall)
"""
y_true = set(y_true)
y_pred = set(y_pred)
precision = sum([1 for i in y_pred if i in y_true]) / len(y_pred)
recall = sum([1 for i in y_true if i in y_pred]) / len(y_true)
if precision + recall == 0:
return 0.0
else:
return (2 * precision * recall) / (precision + recall)
import pytest
assert f1score([1, 2, 3], [2, 3]) == 0.8
assert f1score(['None'], [2, 'None']) == pytest.approx(2/3)
assert f1score([4, 5, 6, 7], [2, 4, 8, 9]) == 0.25
assert f1score([1, 2, 3, 4], [1]) == 0.4
For example, don't unit test sklearn modules.
Only test your own logic, as that's the most likely place for problems to occur.
Best way to get high quality data
Very underrated task. I used to complain about this not being right. Invest in this instead.
Star, snowflake schemas. Normalised, denormalised.
You have the knowledge to do this. If you write SQL then you understand this.
Probably the biggest gain you get for lowest effort. Productionise your pipelines.
Many ETL technologies use python already.
Have knowledge of the AWS tools available to you.
I'm sure the rest are very good too. I like bigquery.
First the concept of concept drift.
concept drift
You have a nice model, trained well, evaluating correctly.
concept drift
Then the environment changes.
concept drift
One way to avoid concept drift is to retrain models regularly, before they get stale.
You want this to be low cost operation.
Stealing ideas from continuous deployment/delivery
Whole pipeline should be completely repeatable.
Automate deployment, retraining, evaluation.
grafana dashboard
These dashboards not just for software engineers!
This idea scales well. You can also log:
Each job is a container.
Each container should do one thing and do it well. Connect together like Lego.
Components can be parts of data pipeline, presentation of results, or parts of the model.
Excellent talk from Ravelin who do fraud detection. I love this idea.
They have a library of model components.
Ensemble them together to make classifiers.
Data science is very young field. We can learn from experienced programmers.
Find an experienced programmer, see what they get grumpy about
They've seen these before. These things have names!
I read this list and I recognised things that I got wrong in my own work.
Data scientists love notebooks (jupyter)
A great experience is watching an engineer hate them
An even better experience is watching an engineer say: "they are a great tool for collaboration and experimentation, but wouldn't use them for much else"
Being data aware. Analysis and stats.
Dealing with uncertainty.
Knowing about the domain applications of our work.
If a data scientist who knows engineering is awesome, then an engineer who knows data science is too. Together we build super cool data driven products that scale.
print("Many thanks")
Many thanks