WENDELIN combines Scikit Learn machine learning and NEO distributed storage for out-of-core data analytics in python
Table of Contents
Analyse: Work with Ingested Data
- Wendelin.Core enables computation beyond limits of existing RAM
- We have integrated Wendelin and Wendelin.Core With Jupyter
- ERP5 Kernel (out-of-core compliant) vs. Python 2 Kernel (default)
- Head to Juypter
- Start a new ERP5 Notebook
- This will make sure you use the ERP5 Kernel
- The Python 2 Kernel is the default Jupyter Kernel
- Using Python 2 will disregard Wendelin and Wendelin.Core, so it's basic Jupyter
- Using ERP5 Kernel will use Wendelin.core in the background
- To make good use of it, all code written should be Out-of-core "compatible"
- For example you should not just load a large file into memory (see below)
- Note you have to connect to Wendelin/ERP5
- The reference you set will store your notebook in the Date Notebook Module
- Passing login/password will authenticate Juypter with Wendelin/ERP5
- Note that your ERP5_URL in this case should be your internal url
- You can retrieve it be running
erp5-show -s in your webrunner terminal
- Note, outside of the tutorial we would set the external IPv6 adress of ZOPE
- Connect, set arbitrary reference and authenticate
- Import necessary libs
context , this will give you the Wendelin/ERP5 Object
context.data_stream_module["1"] to get your uploaded sound file
- Accessing data works the same ways throughout
- All modules you see on the Wendelin/ERP5 start page can be accessed like this
- Once you have an object you can manipulate it
- Note that accessing a file by internal id (1) is only one way
- The standard way would be using the reference of the respective object, which will also allow to user portal_catalog to query
notebook)">Todo: Accessing Data Itself (Notebook)
- Try to get the length of the file using
getData and via
- Note then when using ERP5 kernel all manipulations should be "Big Data Aware"
- Just loading a file via getData() works for small files, but will break with volume
- It's important to understand that manipulations outside of Wendelin.Core need to be Big Data "compatible"
- Internally Wendelin.Core will run all manipulations "context-aware"
- An alternative way to work would be to create your scripts inside Wendelin/ERP5 and call them from Juypter
- Scripts/Manipulations are stored in Data Operations Module
- Proceed to fetch data using
getData for now
- Extract one channel, save it back to Wendelin and compute FFT
- Note, that ERP5 kernel at this time doesn't support
- Note the way to call methods from Wendelin/ERP5 (
- Wendelin/ERP5 has a system of method acquistion. Every module can come with its own module specific methods and method names are always context specific (
[object_name]_[method_name] ). Base methods on the other hand are core methods of Wendelin/ERP5 and applicable to more than one object.
- Check the rendered Fourier graphs of your recorded sound file
- Save the image back to Wendelin/ERP5.
notebook)">Todo: Create BigFile Reader (Notebook)
- Add a new class BigFileReader
- Allows to pass out-of-core objects
notebook)">Todo: Rerun using Big File Reader (Notebook)
- Rerun using the Big File Reader
- Now one more step is out of core compliant
- Verify graphs render the same
- We are now showing how to step by step convert our code to being Out-of-Core compatible
- This will only be possible for code we write ourselves
- Whenever we have to rely on 3rd party libraries, there is no guarantee that data will be handled in the correct way. The only option to be truly Out-of-Core is to either make sure the 3rd party methods used are compatible and fixing them accordingly/committing back or to reimplement a 3rd party library completely.
- Redraw the plot directly from data stored in Wendelin/ERP5
Todo: Verify Images are Stored
- Head back to Wendelin/ERP5
- Go to Image module and verify your stored images are there.
Todo: Verify Data Arrays are Stored
- Switch to the Data Array module
- Verify all computed files are there.