Turbocharge Your Data Science Workbench
Use Web Components with Jupyter Notebooks
If you are a Data Scientist, you ought to have worked on a Jupyter Notebook. Notebooks are the simplest form of interface to Python, especially if you want to do some Data Science analytics. Notebooks have clearly become one of the most popular tools for Data Scientists. With continuous improvements, more features and enhanced abilities to ingest data from variety of data sources, this tool is here to stay.
So, let’s dive deep in.
This is where you start. A blank Jupyter Notebook, ready for action. It is assumed that all necessary software has been installed, including additional package like Pandas, Numpy, Matplotlib etc.
Introduction to Data
For the purpose of demonstration, we will analyse flight delays from public sources of information. United States Department of Transport (www.bts.gov) has made available Airline On-Time Statistics on its web site. Typically, this data is available within a month. We will analyse the flight delays for the month of July 2017 and will target to build something like the figure below.
There are good analysis and graphic representations on the site. However, as Data Scientists, we always have the curiosity to do further analysis on top of data, using our own algorithms and approaches. The starting point, however, is to match the results first.
The raw data is in the form of one line per airline per airport. The file contains 21 fields but for now, we will focus on two which are most relevant i.e. number of flights which landed and, number of flights which were delayed.
Figure 3 : Snapshot of raw data
Prepare for Analysis. Ingest Data in memory
Python Pandas is the most obvious choice to read the data and make it ready in memory. Since we already have the data in Comma Separated Values (CSV) format, Pandas has a ready made support for it.
The code is rather simple. All it needs, is one line of code to read and prepare the CSV file in memory. We will come to the use of “ontime” function, little later. As soon as this cell is executed, a Dataframe representation is available as a variable named “df”.
Build the re-usable Web Component
There are two parts to it. And that’s the whole idea of bringing two technologies together.
While we have used the Notebook cell to code the component, it is also possible to create the entire code in an external JS file and reuse it in multiple notebooks.
First parameter of the execute call, can be a simple Python command, e.g. df.head() or, a function e.g. ontime(df) as above.
That’s it. This needs to be done only once. Now we are ready to use the component.
Please note that the code pieces above, just illustrate the concepts. The entire working code, can be downloaded from here.
Use the component in Jupyter Notebooks
And it all boils down to one line of code
From here on, you can combine your HTML creativity to build a highly interactive application, which looks and behaves like a web application but still, combines the high power of Python and Jupyter Notebook to deliver a Data Science application.
We were able to build an interactive web application under Jupyter such that a use can select the various options from a dropdown menu and can see the results immediately, without clicking on the any of IPython hot buttons like “Run”. Like other web applications which react to change in a text input, we could connect the text entered, with the query, and deliver the results instantly.
Here are some snapshots from the final Notebook.
This blog, is one more in the series of blogs related to Web Components. Even though the standards are being finalized and implemented, the combination of Web Components with Jupyter Notebooks has tremendous potential in today’s Data Science development work.
- Custom Elements v1: Reusable Web Components by Eric Bidelman (https://developers.google.com/web/fundamentals/web-components/customelements)
- Airline On-Time Statistics (https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp)
- Install Python 3 (https://www.python.org), Jupyter(http://jupyter.org/), Pandas (http://pandas.pydata.org/) and Matplotlib (https://matplotlib.org/).