Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Doing research with Jupyter NotebooksGeorgi Karadzhov
@G_Karadzhov
https://gkaradzhov.com
What is this all about
● Writing code is not reserved for computer scientists● Researchers in all fields write software daily ● Researchers leverage interactive computing
environments, such as Jupyter and RMarkdown● In pursuit of “open science”, we share our
code/data/models hoping our research can be used as a stepping stone for further advances
Yet poor quality code is a common occurrence
We share everything, yet nothing is reproducible
We strive for rapid prototyping, but sometimes we spend days finding bugs
in our code
About me
● Professional software developer for the past 6 years● Researcher in the fields of natural language processing,
focusing on fact-checking and rumour detection● Heavy notebook user
About me
● Data Scientist @ SiteGround Hosting
Outline
● Writing research code● Python and Jupyter notebooks for research● What can possibly go wrong● Tools and processes for writing reliable research code
Let’s address the elephant in the room
It’s OK if you don’t use or like Python! This talk is *mostly* about software
engineering.
Code examples will be in Python and Jupyter, but the concepts are
transferable.
Outline
● Writing research code● Python and Jupyter notebooks for research● What can possibly go wrong● Tools and processes for writing reliable research code
Automation of processes
● Preprocessing data ● Data collection● Calculating result measures
Data analysis
● Calculating correlations in data● Finding outliers● Time-series analysis
Data modeling
● Predictive modeling● Machine learning ● Knowledge extraction
What tools do we need
Fast prototyping
Easy-to-write
Reproducible
What else ?
Outline
● Writing research code
● Python and Jupyter notebooks for research
● What can possibly go wrong● Tools and processes for writing reliable research code
Why do I like python ?
1. Easy-to-learn2. Easy-to-read3. Batteries included + Third Party Modules
a. NumPy, SciPy, Pandas, Matplotlib, Seabornb. Sklearn, StatsModelsc. Tensorflow, Keras, PyTorch
The Zen of Pythonby Tim Peters
>> import thisBeautiful is better than ugly.Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.Flat is better than nested.Sparse is better than dense.Readability counts.Special cases aren't special enough to break the rules.Although practicality beats purity.Errors should never pass silently.Unless explicitly silenced.In the face of ambiguity, refuse the temptation to guess.There should be one-- and preferably only one --obvious way to do it.Although that way may not be obvious at first unless you're Dutch.Now is better than never.Although never is often better than *right* now.If the implementation is hard to explain, it's a bad idea.If the implementation is easy to explain, it may be a good idea.Namespaces are one honking great idea -- let's do more of those!
Less time for development and more time for experiments!
● Most of the code we write is a boilerplate code, that is tedious to write and easy to mess up
● With Python we try to minimise the code we write, but maximise the things it does
● The unexpected effectiveness of Python in science
Java:class hello {
public static void main(String []args){System.out.println("Hello World");
}}
C:#include <stdio.h>main() {
printf("Hello World");}
Python:print("Hello World")
Jupyter notebooks
Jupyter notebooks
● Interactive python environment ● Open source● Large community● Jupyter Lab in development
Jupyter notebooks
● Client - Server architecture● Doesn’t require any installation to use, if the server is
hosted elsewhere (or if using Google Colab or similar services)
1_jupyter_intro.ipynb
Writing research code
● Exploratory code● Data processing/modeling pipelines ● Data visualisation
2_research_python.ipynb
Outline
● Writing research code● Python and Jupyter notebooks for research
● What can possibly go wrong● Tools and processes for writing reliable research code
“What if my code is a bit messy?”
“I just play around with my data here, my code is not supposed to be perfect.”
3_buggy_notebook.ipynb
What if I have bugs in my code ?
Preprocessing data
● Inconsistent preprocessing● Missing data● Duplicate data
Data collection
● Implicit selection bias● Wrong data collected● Slow and inefficient
Modeling pipelines
● Unexpected errors● Slow research ( we don’t want that)● Suboptimal results ● Too-good-to-be-true results
Calculating result measures
● Inadequate evaluation● Wrong or misleading results
All this leads to:
● Loss of productivity and time● Debugging code is frustrating
Finding out, that your results are wrong 2 hours before paper deadline is even more frustrating
Outline
● Writing research code● Python and Jupyter notebooks for research● What can possibly go wrong
● Tools and processes for writing reliable research code
Fix it now
● Probably the most cheesy advice you will ever receive● BUT: 5 minutes now, typically saves 30 minutes later
Write unit tests
● Traditionally used by software developers● Ensures that function have a desired behaviour● If used properly can identify bugs early● If your code requires a modification, you can verify that no
additional bugs are introduced
4_unittests.ipynb
Sanity checks in the pipeline
● Print before and after processing● Check your data at each step
5_sanity_checks.ipynb
Validate the output of the notebook
● https://github.com/computationalmodelling/nbval● Install with:
pip install nbval
● Execute with:
py.test --nbval 3_buggy_notebook.ipynb
If you copy-paste similar code within a notebook more than 3 times - extract it in
a function
If you reuse similar code between more than 3 notebooks - extract it into an
external file
Use version control systems
● Keeping track of different code versions● It will be easier to reproduce previous results● Track changes● Enables code sharing● Easy to set-up:● Have a free tier:
○ https://github.com/○ https://bitbucket.org/product
Use version control systems
● The learning curve for learning Git (or other version control) is steep, but it pays off
● There are a lot of good tutorials online:○ https://backlog.com/git-tutorial/what-is-git/○ https://try.github.io/
Code Reviews
● Once you did a significant change to your code - ask someone to review it
● It may be your friend, colleague, your supervisor● If your code is not proprietary you can ask someone from
outside your lab for help (reddit/stackoverflow)● You can go one step further and contribute to opensource
projects on github
Use virtual environments
● Self-contained● Can execute different projects with different versions of
your packages● Easy to create and use:
python3 -m venv name_of_your_venv
source name_of_your_venv/bin/activate
Save requirements.txt● One line of code:
pip freeze > requirements.txt
Save requirements.txtappnope==0.1.0attrs==19.1.0backcall==0.1.0bleach==3.1.0certifi==2019.3.9chardet==3.0.4cycler==0.10.0decorator==4.3.2defusedxml==0.5.0entrypoints==0.3idna==2.8ipykernel==5.1.0ipympl==0.2.1ipython==7.3.0ipython-genutils==0.2.0ipywidgets==7.4.2jedi==0.13.3Jinja2==2.10jsonschema==3.0.1jupyter==1.0.0jupyter-client==5.2.4
Use rich file directory structure
Before:
After:
Write the docs
● Write README files● Write system descriptions
Save your code revision (or notebook) + requirements.txt + data + README and
archive it.
Share your code with others
● Open science and whatnot● High accountability ● Empirically proven that code that is shared has less bugs
Fixing issues in our code is hard. We do it anyway !
Questions ?
Slides and additional information:
@G_Karadzhov or [email protected]
https://gkaradzhov.com/research-code-with-jupyter-notebooks/
Ask now or ping me later at: