Have AI. Need Data: How Big Data App Stores close the Data Divide between Startups and Industry

A Big Data App Store allows a startup company to run its data analysis and A.I. software against data sets of large corporations with revenues being shared between both. It offers an opportunity for industry and startups to compete with GAFA (Google Apple Facebook Amazon) or BAT (Baidu Alibaba Tencent) by closing the Data Divide resulting from compliance or strategic mismatch and can be implemented in three months by a small team using Wendelin technology.
  • Last Update:2016-08-30
  • Version:001
  • Language:en

A Store for Big Data?

A Big Data App Store solves the following problem: large corporations with great data sets have few data scientists to work on them while IT startups with great data scientists have little data to analyse ("Data Divide"). Large corporations usually refuse to grant startup companies access to their data, either for legal compliance or for strategic reasons. And so thousands of great ideas and potential businesses based on Big Data never see the light of day.

And because of this Data Divide the few companies with both huge data sets and great data scientists such as Google or Alibaba eventually end up taking the complete market. Worse, young scientists aware of this situation prefer to work for GAFA (Google Apple Facebook Amazon) or BAT (Baidu Alibaba Tencent) rather than for a company without data or without great mentors.

The Big Data App Store changes this situation by enabling startups to run their data analysis algorithms against data lakes of large corporations. Results can be accessed through an API strictly controlled by the large corporation to eliminate any risk of data leaks while revenues resulting from the use of the API can be shared between the startup company owning the algorithm source code and the corporation owning the data.

This way a Big Data App Store can bridge the Data Divide between startups and industry. How would this look like in real life? Imagine the following scenarios.

Data Divide Photo: fotolia.fr - CYCLONEPROJECT

Scenario 1: Health Big Data App Store

A group of hospitals with 1,000 TB of PET scan images was looking for a machine learning algorithm to predict lung cancer but had no in-house data scientists. The group considered hiring a world-class team but the salary of known experts was well above the salary of doctors - not acceptable under hospital rules. It also tried to hire a team of young graduates but nothing could be produced in two years. One day, a startup company abroad created by a famous mathematician suggested to research and develop a suitable algorithm. The startup company asked for a copy of all data in order to create the algorithm (soon to fit on 8 SSD disks). But the group refused, because it was not possible to provide a copy of such data under national laws.

Instead they proposed a different approach: rather than providing a copy of the data, the group asked the startup company to provide a copy of the source code within a revenue sharing agreement. The mathematician accepted, because he had not been able to find any suitable data sets in so many years. Also, he knew that Google was stealthly preparing something similar. So the best option for him was to share rather then eventually loosing his competitive advantage.

The hospital group provided an open source Big Data development environment called "Wendelin" to the mathematician which also included a small set of sample data that some patients had accepted to share. The mathematician could adapt his algorithm by programming a python script based on scikit-learn and scikit-image libraries. Once everything was tested to work, he submitted his script for review to the Big Data App Store setup by the hospital group.

The group development team reviewed the script, ensuring it did not include malware or was trying to steal data. Once the review passed, the script was approved and published in the Big Data App store allowing the algorithm to run on the complete data set. After a week of computation, a first machine learning model was created. Each time a new PET scan image was added to the data lake, the algorithm was able to detect lung cancer. After a few years in operation, it was also discovered that the algorithm could predict lung cancer with good accuracy three years in advance.

Today, most hospitals in the world utilize this algorithm by calling an API of the Big Data App Store. Both the source code and data remain secret. Each prediction is sold for 1€ with revenue being shared by the mathematician and hospital group. With over ten million API calls per year, the group is now starting to create new Big Data applications in other fields of health.

Scenario 2: Automotive Big Data App Store

An automotive company is afraid that the combination of Google Maps, Open Source Vehicle (OSV) and A.I. could lead to an industry where value added moved entirely to the data economy, an industry, in which cars are produced by small workshops and algorithms are provided by GAFA or BAT.

This company has already tried twice to build its own team of data scientists but due to the lack of progress after three years of trying best-of-breed open source solutions (OpenStack, HADOOP, Cloudera, Spark, Docker, etc.), it outsourced all its Big Data to a large IT corporation. However, the best data scientists of that corporation had also already moved to GAFA or BAT. The company thus became a playground for the sales team of the IT corporation and an exhibition center for legacy proprietary software with high licensing costs. All telematics services had also been outsourced resulting in a significant part of data no longer being owned and only available in mutually incompatible formats.

An engineer in the automotive company discovered a way to create a new algorithm that can at the same time predict vehicle failure leading to a measurable increase in sales. He tried to implement it with the Big Data system provided by the large IT corporation but it failed: due to poor architecture and high price licenses, operating costs were higher than the profits generated by this new algorithm.

He created a startup company and completed a first implementation in a couple of weeks using a python script based on scikit-learn machine learning library. Initial data was collected using a low cost telematics device purchased on Alibaba. Proprietary vehicle data was cracked by a team of engineers in India in less time that it previously took to contact the legal department in the automotive company.

But although a working algorithm now existed, there was no way to access data from a large fleet of vehicles. The engineer was approached by Google and Tesla, but tried one last time to convince the automotive company. Discussions started with the newly appointed Chief Digital Officer but due to the strict privacy laws in Europe or Japan, it was not possible to provide a copy of car data to the startup company, even though there was no problem of trust leading both the engineer and automotive company searching for ways to circumvent legal issues.

The engineer eventually suggested to the automotive company to build a "Big Data App Store" and copy all data into it using embulk. In order to run the algorithm efficiently, data needed to be stored in a data structured called ndarray that was not natively supported in the Big Data lake provided by the IT corporation. And in order to run efficiently using scikit-learn, Python had to be used natively. The code of the algorithm could then be uploaded into the "Big Data App Store" removing the need for any data to be taken "out" of the company.

The Big Data App Store was created in three months using 8 servers with 16 TB SSD disks each. The python code was run in the automative company's big data app store using the data available and eventually generated a 1% increase of sales as well as higher customer satisfaction. Revenues were then shared between the startup company and the automotive company.

The Anatomy of a Big Data App Store

A Big Data App Store can be launched in less than three months and a budget of less than 50k€ of using Wendelin technology.

It requires the following components:

  1. Reliable data collection and aggregation (to ingest data into the data lake)
  2. High performance scalable storage (to process data with data analysis libraries)
  3. Data analysis libraries (including machine learning)
  4. Parallel processing (to process huge data quickly)
  5. Out-of-core processing (to handle large machine learning models)
  6. Rule based data access restrictions (to isolate applications)
  7. Application submission workflow (to submit applications)
  8. API registration workflow (to let applications publish an API)
  9. Accounting (to count CPU usage, data usage and API calls of applications)
  10. Billing (to charge the different stakeholders and users)

With Wendelin, all components are based on the same technology and the same language: python. Wendelin leverages the largest community of data sciences: PyData. Data is handled in its native format - ndarray. without requiring format conversions. And with wendelin.core, there are no limits imposed on the size of the data set.

If one tried to build such App Store with heterogeneous technologies (ex. Java, Python, Spark, HDFS, etc.), the high complexity of the system would result in longer time to deliver, higher maintenance costs, frequent instability due to API changes and possibly lower performance related to overhead. It is for example well known that combining Python and Spark leads to various types of processing overheads and suboptimal memory management that can create serious issues in a mission critical system.

Moreover, embedding "rule based data access restrictions" directly into the python programming languages used in Wendelin is still a unique feature without which it would not be possible to operate an app store. "Restricted Python" technology ensures that every access to every data in every line of source code published in the Big Data App Store has previously been granted access and is traceable. Applications in the app store are thus blocked from stealing secrets from another application, although they share the same environment and the same database.

Conclusion

In less than three months any large corporation can create a "Big Data App Store" and invite thousands of startup companies to leverage their data and create new A.I. and machine learning applications. Revenue can easily be shared between data owners and creators of data analysis algorithms closing the Data Divide between startups and industry and using "Big Data App Stores" to compete efficiently with GAFA (Google Apple Facebook Amazon) or BAT (Baidu Alibaba Tencent).

Contact

  • Photo Jean-Paul Smets
  • Logo Nexedi
  • Jean-Paul Smets
  • jp (at) nexedi (dot) com
  • Jean-Paul Smets is the founder and CEO of Nexedi. After graduating in mathematics and computer science at ENS (Paris), he started his career as a civil servant at the French Ministry of Economy. He then left government to start a small company called “Nexedi” where he developed his first Free Software, an Enterprise Resource Planning (ERP) designed to manage the production of swimsuits in the not-so-warm but friendly north of France. ERP5 was born. In parallel, he led with Hartmut Pilch (FFII) the successful campaign to protect software innovation against the dangers of software patents. The campaign eventually succeeeded by rallying more than 100.000 supporters and thousands of CEOs of European software companies (both open source and proprietary). The Proposed directive on the patentability of computer-implemented inventions was rejected on 6 July 2005 by the European Parliament by an overwhelming majority of 648 to 14 votes, showing how small companies can together in Europe defeat the powerful lobbying of large corporations. Since then, he has helped Nexedi to grow either organically or by investing in new ventures led by bright entrepreneurs.
  • Photo Sven Franck
  • Logo Nexedi
  • Sven Franck
  • sven (dot) franck (at) nexedi (dot) com
  • Photo Cédric Le Ninivin
  • Logo Nexedi
  • Cédric Le Ninivin
  • cedric (dot) leninivin (at) nexedi (dot) com