Wendelin Exanalytics Libre

WENDELIN combines Scikit Learn machine learning and NEO distributed storage for out-of-core data analytics in python

Overview

  • In this tutorial we will learn how to:
    • Install and configure Wendelin on Linux machine.
    • Ingest data into Wendelin using Fluentd. (open source data collector used to ingest data to Wendelin) Fluentd Website.
    • Manipulate data in Wendelin using Jupyter Notebook. Jupyter Website.
    • Present data in Wendelin on a web site using Renderjs and jIO javascript libraries.

1. Getting Wendelin on your Linux machine

Requirements

  • For this tutorial, you will need a Debian running machine with at least 4GB of RAM and 20GB of free disk space.
  • Optionally, you can set up a virtual machine with: VirtualBox.
  • You need root access.

Install Wendelin from terminal

  • Open terminal.
  • Switch to root with: sudo -su , and install Wendelin with:
  • wget http://deploy.nexedi.cn/wendelin-standalone && bash wendelin-standalone

Monitor Installaton process

  • First time, you run the installer, you will probably get an error with a message: "Your software is still building, be patient it can take a while." Do not panic. This is expected. Installation is now running.
  • You can check that installation is going ok by following SlapOS software installation log:
  • tail -f /opt/slapos/log/slapos-node-software.log
  • Be patient. Installation can take up to several hours to complete.

Check Software Installaton Status

  • Check status of Wendelin Installation With command erp5-show -s
  • Once done, it will return you an internal IPv4, where you can access Wendelin, username and password. Note them down.
  • Run bash wendelin-standalone again to make sure there are no errors.

Socat Bind

  • Next step is to bind the services of Wendelin to the correct ports. Execute:
  • wget https://lab.nexedi.com/nexedi/wendelin/raw/master/utils/wendelin-standalone-bind.sh && bash wendelin-standalone-bind.sh
  • This will activate urls provided to access Jupyter/ERP5.
  • Check if socat works: ps aux | grep socat
  • If this does not work try:
  • apt-get install socat
  • apt-get install net-tools
  • and try bash wendelin-standalone-bind.sh again.

Login

Wendelin-ERP5 Login
  • Now you can start using Wendelin services with your preferred web browser
  • Access link: http://internal IPv4:20001
  • Click on Zope management interface
  • Login using username and password
  • go to: http://internal IPv4:20001/erp5
  • For example, if internal IPv4 is 10.0.2.15:
  • http://10.0.2.15:20001
  • Here we are using internal IPv4, because wendelin services are on this machine.
  • To use services from oudside this machine, we would use external IPv6 of the machine.

Configure Site

  • Go to: My Favorites > Configure your Site .
  • Your ERP5 instance is now ready, but it has only very generic core library. This core can be specialized through different business templates for different purposes.
  • This can be done manually, installing individual business templates that provide your ERP5 instance with utilities that you need.
  • That can be done in My Favorites > Manage Business Templates
  • But Wendelin uses a lot of different business templates and that would be a lot of work, so configurator is provided that makes that work for you and installs just the business templates that are important for Wendelin.

Install Configuration

  • Select Wendelin Configuration.
  • IF there is no Wendelin Configurator, you have to install it:
    • Go to My Favourites > Manage Business Templates .
    • Click on Import/Export button (blue and red arrows).
    • Select Exchange > Install Business Templates from Repositories .
    • Check box before erp5_wendelin_configurator business template and click Install Business Templates from Repositories.
  • Configuration will take some minutes.

Install Configuration

  • On next screen make sure that the boxes set wendelin tutorial and set up Data Notebook module for Jupyter integration are checked.
  • Click Set of data Notebook module and on the next screen Install .

Wait

  • Installation may take a few minutes
  • Once down, click "Start using your ERP5 System"
  • Login again.

Main Interface

Wendelin ERP5 - All set
  • Your instance is now ready to use
  • The start screen shows a list of modules (data-types) directly accessible.
  • You can also access them through Modules selection tab at the top of the screen.
  • Probably you noticed that after configuration there are a lot more of them.
  • Modules can be contain anything from Persons, Organizations to Data Streams.
  • Modules prefixed with Portal are not displayed (e.g. portal ingestion policies)

2. Simulate Sensor and ingest data via Fluentd

Create Ingestion Policy

Wendelin-ERP5 Ingestion Policies
  • Now we will setup Wendelin to receive data from fluentd.
  • Goto: http://internal IPv4:20001/erp5/portal_ingestion_policies/
  • Ingestion policy is security Setting to prevent arbitrary stream from being sent
  • Currently fluentd and Wendelin are setup to receive streams of data
  • A data stream is a file, created by ingestion policy, to which data is continually appended

Fast Input

Wendelin-ERP5 Create Ingestion Policy
  • Hit the "Green Runner" to create new Ingestion Policy
  • Enter "pydata" as reference name and "Pydata" as title and click Create Ingestion Policy .
  • This creates a new ingestion policy.
  • The ingestion policy includes default scripts to be called on incoming data.
  • If you want to modify data handling, you could now write your own script.

Wendelin in Production

Wendelin | Windpark
  • First Wendelin prototype in production was used to monitor wind turbines.
  • Wendelin is used to collect large amounts of data and manage wind parks.
  • This allowed to use Machine Learning for failure prediction, structural health monitoring.

Wendelin in Production

Wendelin | Wind Turbine
  • Each wind turbine equipped with multiple sensors per blade and PC collecting sensor data
  • Each PC including fluentd for ingesting into Wendelin (network availability!)

Basic FluentD

Wendelin | Basic FluentD Flowchart
  • FluentD is open source unified data collector
  • Has a source and destination, generic, easy to extend
  • Can handle data collection even under poor network conditions
  • More info: FluentD Website

Complex FluentD

Wendelin | Complex FluentD Flowchart
  • Setup of FluentD can be much more complex.
  • In Wendelin production, every turbine has its own FluentD.

Record Audio

Wendelin-ERP5 Audio Recorder

Install fluentD

  • For this part of tutorial you will need Fluentd (Fluentd documentation page):
  • Open terminal on your machine and enter:
  • sudo apt-get install ruby ruby-dev
  • sudo gem install fluentd -v 0.14.14 --no-ri --no-rdoc
  • Fluentd is written in ruby languages, so you need to install ruby interpreter.
  • In addition you need ruby-dev package to build native extension gems like fluentd.

Install plugins

  • Install plugins, that we will need for tutorial with:
  • sudo mkdir -p /etc/fluent/plugin
  • cd /etc/fluent/plugin
  • sudo wget https://lab.nexedi.com/nexedi/fluent-plugin-wendelin/raw/master/lib/fluent/plugin/out_wendelin.rb
  • sudo wget https://lab.nexedi.com/nexedi/fluent-plugin-wendelin/raw/master/lib/fluent/plugin/wendelin_client.rb
  • sudo wget https://lab.nexedi.com/nexedi/fluent-plugin-bin/raw/master/lib/fluent/plugin/in_bin.rb
  • In Fluentd, data input/output is managed by plugins. Each plugin knows how to communicate with external endpoint.
  • In our case external endpoint of output is Wendelin, so we will use wendelin output plugin, and input is binary file, so we will use bin input plugin.
  • Plugins that come with fluentd are located in ruby gems directory, but for our custom plugins we make new folder /etc/fluent/plugin.

Create configuration for fluentD

  • To setup default configuration file for fluentd run:
  • sudo fluentd --setup /etc/fluent
  • Now you have default configuration for fluentd to use in file /etc/fluent/fluent.conf. You can open it and see how fluentd configuration file looks like.
  • Source tag is configuration for input and match tag for outputs. You can also have different tags like filter, etc. Our configuration for tutorial will be wery simple, so we will hardcode it. Create a new file with:
  • vi /etc/fluent/pydata.conf, enter code from the next slide and save
  • We are now creating a configuration file to pass to fluentd
  • The file contains all parameters for fluentD regarding data source and destination
  • Normally configuration is set upfront in the /etc/fluent/fluent.conf. If you open it you can see that it can be quite complex.
  • But for tutorial we will hardcode a simple example of configuration file.

FluentD Configuration File (Gist)

<source>
    @type bin
    format none
    path /!!path/to/audio file!!/sample.wav
    pos_file /!!path/to/audio file!!/sample.pos
    enable_watch_timer false
    read_from_head true
    tag pydata
</source>
<match pydata>
    @type wendelin
    @id wendelin_out
    streamtool_uri http://!!internal IPv4!!:20001/erp5/portal_ingestion_policies/pydata
    user zope
    password !!your erp5 password!!
    buffer_type memory
    flush_interval 1s
    disable_retry_limit true
</match>
  • Copy this script to new file that you created and edit parts inside exclamation marks.
  • Fluentd configuration file has two parts.
  • In source part edit path to your home directory
  • In output part edit url of your wendelin, and password.
  • Fluentd configuration file has two parts: Input or source part and output part.
  • Source part contains path to file that you wish to upload, path to pos file which is used to track position while sending the data, and a tag, which is used to match source to the output
  • Output part is a match tag. It must match a tag given to the data in the source part. It contains uri to external endpoint (which is in out case ingestion policy, that we created in Wendelin), and username and password to authenticate your Wendelin
  • Both parts also have type, which is yust a plugin, that it should use
  • Note that in tutorial we can use internal IPv4, since we have Wendelin and Fluentd on the same machine. To send data to Wendelin on another machine, we would have to use external IP address.
  • For more information on writing Fluentd configuration files check out: Fluentd configuration documentation.

Save and send from Terminal

  • Save configuration, go back to the terminal and run fluentd with:
  • fluentd -c /etc/fluent/pydata.cfg
  • Notice that we use option -c to pass out custon configuration to fluentd.
  • This will create new Data Stream in your Wendelin instance, and append sent data to it.
  • Notice that Fluentd is running and waiting for new input, until you interrupt it.
  • When you see that data is on your wendelin instance, you can interrupt it with keyboard interrupt Ctrl+C
  • In our case, we had just a small .vaw file, but we could also have continuous stream of data from the sensor and fluentd would be continuously appending it to Data Stream.

Check created Data Stream

Wendelin-ERP5 - Verify File was received
  • Head back to Wendelin/ERP5.
  • In the Data Stream Module, check the file size of the pydata stream.
  • It should show a file size larger than 0.

3. Work with Ingested Data

Out-of-Core

Wendelin Out of Core Computing
  • Wendelin.Core enables computation beyond limits of existing RAM
  • We have integrated Wendelin and Wendelin.Core With Jupyter
  • In Jupyter we can use ERP5 Kernel (out-of-core compliant) vs. Python 2 Kernel (default Jupyter)

Enable Data Notebook

  • Head to My Favourites > Preferences.
  • Click Default System Preferences and open Data Notebook tag.
  • Check the box to enable Data Notebook and click the save icon.

Head to Jupyter

  • Jupyter service is on port 20000. In new tab in browser open: https://internal IPv4:20000/tree
  • For password check the file knowledge0.cfg on a partition, where Jupyter is installed in your KVM (first use: slapos node | grep jupyter to check partition, and then: cat /srv/slapgrid/slappart__/knowledge0.cfg). Authenticate.
  • Start a new ERP5 Notebook
  • This will make sure you use the ERP5 Kernel
  • The Python 2 Kernel is the default Jupyter Kernel
  • Using Python 2 will disregard Wendelin and Wendelin.Core, so it's basic Jupyter.
  • Using ERP5 Kernel will use Wendelin.core in the background.
  • To make good use of it, all code written should be Out-of-core "compatible"
  • For example you should not just load a large file into memory (see below).

Learn ERP5 Kernel

  • Passing login/password will authenticate Juypter with Wendelin/ERP5
  • The reference you set will store your notebook in the Date Notebook Module

Getting Started

  • Authenticate and set arbitrary reference for your notebook.
  • This is always the first step when you start a new Notebook with ERP5 Kernel.
  • It makes sure that you are connected to ERP5/Wendelin instance any you can work with objects on it.
  • It also creates Data Notebook object on your ERP5/Wendelin instance.
  • You can go to Data Notebook Module and see that your Data Notebook object is now saved there.

Accessing Objects

  • Type context, this will give you the Wendelin/ERP5 Object.
  • Type context.data_stream_module["1"] to get your uploaded sound file.
  • Accessing data works the same ways throughout [IPv6]:30002/erp5/[module_name]/[id].
  • All modules you see on the Wendelin/ERP5 start page can be accessed like this.
  • Once you have an object you can manipulate it.
  • Note that accessing a file by internal id (1) is only one way.
  • The standard way would be using the reference of the respective object, which will also allow to user portal_catalog to query.

Import libraries

  • Import necessary libs.

Accessing Data Itself

  • Try to get the length of the file using getData and via iterate
  • Note then when using ERP5 kernel all manipulations should be "Big Data Aware"
  • Just loading a file via getData() works for small files, but will break with volume
  • It's important to understand that manipulations outside of Wendelin.Core need to be Big Data "compatible"
  • Internally Wendelin.Core will run all manipulations "context-aware"
  • An alternative way to work would be to create your scripts inside Wendelin/ERP5 and call them from Juypter
  • Scripts/Manipulations are stored in Data Operations Module

Compute Fourier

  • Proceed to fetch data using getData for now
  • Extract one channel, save it back to Wendelin and compute FFT
  • We can call methods from Wendelin/ERP5.
  • Wendelin/ERP5 has a system of method acquistion. Every module can come with its own module specific methods and method names are always context specific ([object_name]_[method_name] ). Base methods on the other hand are core methods of Wendelin/ERP5 and applicable to more than one object.

Display Fourier

  • Check the rendered Fourier graphs of your recorded sound file

Save Image

  • Save the image back to Wendelin/ERP5.
  • Close figure with plt.close() function, otherwise it will show on all outputs

Create BigFile Reader

  • Add a new class BigFileReader
  • Allows to pass out-of-core objects

Rerun using Big File Reader

  • Rerun using the Big File Reader
  • Now one more step is out of core compliant
  • Verify graphs render the same
  • We are now showing how to step by step convert our code to being Out-of-Core compatible
  • This will only be possible for code we write ourselves
  • Whenever we have to rely on 3rd party libraries, there is no guarantee that data will be handled in the correct way. The only option to be truly Out-of-Core is to either make sure the 3rd party methods used are compatible and fixing them accordingly/committing back or to reimplement a 3rd party library completely.

Check the graphs

  • Verify graphs render the same
  • Don't forget to close the figure.

Redraw from Wendelin

  • This is the way to redraw the plot directly from data stored in Wendelin/ERP5
  • Imidiatelly after you create content, you cannot redraw it. You must wait for the object to be catalogued.

Verify Images are Stored

  • Head back to Wendelin/ERP5
  • Go to Image module and verify your stored images are there.

Verify Data Arrays are Stored

  • Switch to the Data Array module
  • Verify all computed files are there.

4. Visualize, Display computed data

Running Web Sites from Wendelin

Wendelin-ERP5 - Start Page
  • Last step is to display results in a web app
  • Head back to main section in Wendelin/ERP5
  • Go to Website Module
  • One of the modules in erp5 is Web Site Module.
  • We will use it to create simple Web Site for presentation of our result.

WebSite Module

Wendelin-ERP5 - Website Module
  • Website Module contains websites
  • Open renderjs_runner - ERP5 gadget interface
  • Front end components are written with two frameworks, jIO and renderJS
  • jIO (Gitlab) is used to access documents across different storages
  • Storages include: Wendelin, ERP5, Dropbox, webDav, AWS, ...
  • jIO includes querying, offline support, synchronization
  • renderJS (Gitlab) allows to build apps from reusable components
  • Both jIO/renderJS are asynchronous using promises

Renderjs Runner

  • Parameters for website module
  • see ERP5 Application Launcher - base gadget
  • Open new tab: http://internal IPv4:20001/erp5/web_site_module/renderjs_runner/
  • It is important not to forget / at the end of the url, otherwise link will not work!
  • Apps from gadgets are built as a tree structure, the application launcher is the top gadget
  • All other gadgets are child gadgets of this one
  • RenderJS allows to publish/aquire methods from other gadget to keep functionality encapsulated

Renderjs Web App

  • ERP5 interface as responsive application
  • We will now create an application like this to display our data

Clone Website

  • Go back to renderjs_runner website
  • Clone the website

Rename Website

  • Change id to pydata_runner
  • Change name to PyData Runner
  • Save

Publish Website

  • Select action Publish and publish the site
  • This changes object state from embedded to published
  • Try to access: http://internal IPv4:20001/erp5/web_site_module/pydata_runner/
  • Every object in ERP5 has a state (For example draft, published,...).
  • Workflows are used to change the state of objects.
  • A workflow in this case is to publish a webpage, which means changing its status from Embedded to Published.
  • Workflows (among other properties) can be security restricted. For example, everybody can see Web Site in published state, but only its creator can see it while it is still in draft state.
  • This concept applies to all documents in ERP5.

Layout Properties

  • Change to Tab "Layout Properties tab"
  • Update Front Page Gadget to pydata
  • Refresh your app (disable cache), it will be broken, as pydata page gadget doesn't exist
  • One advantage working with an aync promise-chain based framework like renderJS is the ability to capture errors
  • It is possible to capture errors on client side, send report to ERP5 (stack-trace, browser) and not fail the app
  • Much more fine-grainded control, we currently just dump to screen/console

Web Page Module

  • Now we will create pydata page gadgets. Like with website we will clone existing default gadgets and modify them.
  • Change to web page module
  • Search for reference %worklist%
  • The web page module includes html, js and css files used to build the frontend UI
  • The usual way of working with static files is to clone a file, rename its reference and publish it alive (still editable)

Clone Worklist gadgets

  • Open both files in new tabs, clone, change title.
  • Replace "worklist" in references and titles with "pydata", save and publish alive.
  • When published alive, object are still editable.
  • We will now edit both files to display our graph.

Pydata Gadget HTML

  • Go to edit tab on html gadget.
  • Copy and paste this script to the contents of the gadget.
  • This is a default gadget setup with some HTML.
  • Gadgets should be self containable so they always include all dependencies
  • RenderJS is using a custom version of RSVP for promises (we can cancel promises)
  • The global gadget includes promisified event binding (single, infinite event listener)
  • We are using RenderJS and jIO javascript libraries.
  • RenderJS Website
  • jIO Website
  • This is a default gadget setup with some HTML.
  • Gadgets should be self containable so they always include all dependencies
  • RenderJS is using a custom version of RSVP for promises (we can cancel promises)
  • The global gadget includes promisified event binding (single, infinite event listener)

Pydata Gadget JS

  • Same with javascript gadget.
  • Copy and paste this script to the javascript gadget (in the edit tab).

Save, refresh web app

  • Once you saved your files, go back to the web app and refresh
  • You should now have a blank page with header set correctly
  • This are just default template gadgets.
  • We will now update our gadgets to fetch our graph and display it

Update Pydata Gadget HTML

  • Update html gadget with this script.
  • Took from existing project, HTML was created to fit a responsive grid of graphs
  • Added JS library for multidimensional arrays: NDArray
  • Added JS libarary for displaying graphs: Dygraph

Pydata Gadget JS (1)

  • Update js gadget with this script (screanshots on this and next slides).
  • First we only defined options for the Dygraph plugin
  • In production system these are either set as defaults or stored along with respective data

Pydata Gadget JS (2)

  • Add methods outside of the promise chain
  • Simplified (removed actual creation of date objects)

Pydata Gadget JS (3)

  • Edit url variable for your instance and for id of your spectrum2 Data Array!
  • "ready" triggered once gadget is loaded
  • define gadget specific parameters
  • "render" called by parent gadget or automatically
  • we hardcode url parameter, by default it would be URL based

Pydata Gadget JS (4)

  • Orchestrated process starting with a cancellable promise queue
  • First step requesting the full file (NOT OUT-OF-CORE compliant - we load the whole file)
  • Return file converted into ndarray
  • Convert data into graph compatible format, store onto gadget
  • "declareService" triggered once UI is built
  • Graph will be rendered there.

Refresh Web Application

  • Example computes client-side as project requires to work offline "in the field"

Summary: What did we do?

Wendelin-ERP5 - Hyperconvergence
  • We installed Wendelin on our Linux machine and configured it.
  • We ingested data with Fluentd.
  • We worked with data using Jupyter Notebook.
  • We presented data on a website in Wendelin using RenderJS and jIO javascript libraries.