Can OpenStreetMap provide routing information in real time or simply based on the time of the day? Probably not. But Google Map does.
Does ERP5 CRM automatically suggest incoming leads based on past track record. Probably not. But Salesforce.com can or could through data.com.
Do medical students have access to large annotated cancer data sets at the core of A.I. based prevention. Probably not. But some private A.I. medical companies do.
Free Software is no longer sufficient to regulate the digital economy through open and interoperable standards. It is no longer in the position to compete with data-augmented applications. Open Content Text Books - the Free Software of education - are no longer sufficient to transfer knowledge to future generations because this knowledge is now increasingly hidden in secret data owned by private companies and protected by privacy laws. In the case of medecine, impossibility to transfer knowledge to future generations is a violation of the 2nd principle of Hippocratic Oath ("I will respect the hard-won scientific gains of those physicians in whose steps I walk, and gladly share such knowledge as is mine with those who are to follow").
Market does not seem to provide any valid solution to the new issues posed by data economy.
The first theorem of welfare economics does not apply to data or in general to any non rival goods (software, culture, army, police, etc.). One of its consequence is that data welfare is maximised if and only if access to data is free. Unless access to data is free, some sub-optimal behaviour eventually happen, such as the formation of monopolies. But free access to personal data is prohibited by privacy laws. Free access to commercial data is prohibited by trade secret laws.
Current legislation is also biased in the advantage of large companies.
Whenever a data driven company with a data set A acquires a company with a data set B, the value of the merged company is equal to the value of each data set plus the value of possible correlations between A and B. Since there are virtually no diminishing returns in a data driven company, market eventually leads to data monopolies, because data correlation between A and B is only legal after merger. Small companies operating independently can not benefit from such correlations because any collaboration or data exchange among them would be against laws that protect privacy or trade secret.
Some companies may consider that protection of trade secret has a value that justifies not using data sources that may disclose their trade secrets, even though not using those services also means being less efficient.
For example, a European company willing keep its prospects secret may request its sales workforce to use OpenStreetMap with OSRM self-hosted routing rather than Google Map, and to use smartphones that do not connect to Google, Apple or any US based company. This could be useful for a European company in defence or aerospace, considering the fact that under US Law, any US company has an obligation to deliver specific data to the NSA (no matter the location of its data centres), and that there has been a fairly long track record of US government actions to defeat European, Japanese, etc. competitors in foreign export markets (eg. by providing a copy of competing offers to US companies involved in a market).
For this type of case, it would make sense to build a repository of routing information with average times for each route based on the day of the week, the month, weather conditions, etc. Even though this information is less precise than real time updates, it helps offering a service similar to Google Map navigation but that can operate offline and protects trade secret. It also empowers many researchers or small businesses to contribute new routing algorithms that might even be better than those of Google.
We can generalise this example to any situation where collecting in a public repository freely accessibly data sets that are smaller or less "fresh" that those of data monopolies can be compensated by some competitive advantage resulting from better privacy or more competition. We call this approach "Free Big Data". This approach is similar to Free Software. Anyone can download, process or contribute to any big data set. Each big data set is governed by a process of merge requests.
A first implementation of a Free Big Data repository will be presented on November 12, 2018 by the "Wendelin IA" project sponsored by French Government's FUI programme, following the release in 2010 by Nexedi and partners of Data Publica, one of the first repositories of downloadable public open data. All source code of "Wendelin IA" is licensed as Free Software.
There are currently two limitations to the "Free Big Data" approach: lack of awareness in the industry of trade secret protection risks and lack of will of governments to open their data.
On the one hand, some large European companies tend to underestimate the risks involved in sharing their data using US based cloud platforms. A typical example is a European aerospace company, which started a corporate-wide "Google Only" policy shortly after its CEO hired an ex-Googler as CTO then later adopted Google Suite and Palantir. It is now nearly certain that NSA has access to some parts of corporate data (due to extra-territorial effects of US Law that no contract can mitigate). Some government experts who witnessed how the Concorde programme was eventually killed by US government actions fear that the adoption of US based cloud platforms will ease similar attacks in the future and question why the company rejected equivalent European cloud solutions licensed as Free Software.
Government on the other hand have been trying to monetise their public data rather than provide free access to it. Legal obligations which were voted in favour of open data are still not always met. Governments in charge of enforcing open data may even try ignore it. Yet, the need to share big data as a key factor for the development of A.I. outside big platforms was recognised. French government published a Request for Information on A.I. data sharing. And some governmental organisations famous for their long history of rejection of open data are now providing initial access to their data.
Not all big data sets can be published or downloaded for free by anyone. Health data has to remain secret in application (also) of Hippocratic Oath unless a proper method exists to anonymise it and publish it in a Free Big Data repository.
Until that happens, this data could be shared in another way: using a Big Data Appstore (see "Have AI. Need Data: How Big Data App Stores close the Data Divide between Startups and Industry"). A Big Data Appstore lets users upload scripts that can process an entire data set, searching for correlation or A.I. models. Models produced by scripts can be downloaded, but not the data that was used to build models. Models can also be executed online.
The business model of a Big Data appstore is clear: developers who write scripts can upload scripts for free but need to pay for processing time on the big data appstore.
Typical organisations that can run a Big Data appstore are:
Any large data set that can not legally be transfered to anyone is a potential candidate for a Big Data appstore operated by its owner. Big Data appstore is thus a very effective and self-financed solution to open research to all types of data that do not fit into the "Free Big Data" model.
Every person or company is a potential source of very useful data published intentionally and explicitly. We could thus consider aggregating all data published intentionally by all companies and individuals into a large data set. Could we still keep a high quality?
Here comes the idea of curation: a company or person is entitled to publish any information he or she is willing to. This can be address, phone number, a list of diseases of that person, place of birth, etc. It could even be the log of IP adresses that have accessed their web site (if this is legal).
Each data that is published by that person or company can be curated on demand, at a modest price. If a search engine is built from that data, curated data will be often displayed first, non curated data afterwards. Any person willing to provide precise information for some reason (usually advertising, but no only) is thus going to pay for extensive curation.
The Paid Curation model is suitable as an alternative to "Pay-per-click" model of current large platforms to aggregate and finance the curation of aggregated data.