Many companies use SFTP protocol and server (S) to exchange files between two systems and interface two applications (A and B).
The interface logic can be one-way:
A -> S -> B
Or can be two-way:
A <--> S <--> B
In most cases we are aware of, this type of interface is unreliable for various reasons. We will try in this article to list various approaches to make a reliable SFTP interface, based on the experience of other reliable interface techniques.
There are numerous ways to create a reliable interface:
CFT is a well known tool that solves many problems found in file transfer between A and B. It provides the guarantee that:
It is thus possible to ensure that all files sent have been received.
An exchange table in a relational database has similar advantages:
Transactional HTTP protocol (ex. Zope transactions) is a way to implement HTTP in an application server which ensures through HTTP similar transactionality as with a database. Data is either transferred entirely or not at all. Each transfer (upload) leaves a trace in a log file. Each transfer (download) leaves a trace. If data is in addition self-certified (using an SHA or signature for consistency), downloaded data can be checked for consistency.
Synchronisation protocols such as SyncML, JIO and to some extend embulk ensure that all parties share the same view on the same data subset at any time. By running synchronisation recurringly, the state of A and B eventually becomes consistent, despite synchronisation not being transactionnal.
Protocols such as git or any kind of shared, read-only log file or graph, ensures that all parties share the same data consistently. All parties can thus ensure that the state of applications (A or B) is consistent with the common log file or graph.
All cases we described ensure the integrity and traceability of data transferred by A and B. However, this does not ensure that the interface between A and B is fully consistent.
Each data D that is transferred should in addition be well formed, by matching for example a JSON schema or an XML DTD. Beyond being well formed, the semantic of the data exchange should be validated using validation functions FA, FB which compares the state of applications (ex. FA for A) to the history of all data that were transferred:
FA(A(t), D(0), D(1), ..., D(t)) == True
It is mandatory to define schema and validation function FA, FB to ensure that an interface is consistent.
Without that, project teams behind A and B endlessly spend their time trying to prove that the other party is wrong instead of both having a way to prove they are both right. Schema and validation functions play the same role in interfaces as unit or functional tests in application development.
The role of validation function FA is essential in particular to prevent a common problem: B never receives data D from A because A did not even transfer it. Without FA, this problem can never be solved.
SFTP servers have a few issues that lead to unreliable interfaces:
There is thus no way to ensure a reliable interface between A and B by using only SFTP without extra protocols.
SFTP is a bit like UDP compared to TCP. It does not guarantee any kind of integrity in communication.
Obviously, the first step towards reliability is to design a well specified interface:
This step is not specific to SFTP. Without it, it is impossible to ever reach reliability.
Let us now solve problems specific to SFTP.
This problem can be solved with different techniques:
Compare SHA(day1.data) and day1.sha.
Solution 1: remove the "erase after download" feature of the SFTP server.
Solution 2: add to the interface
Solution 1: rename each file after download (ex. day1.data -> day1.received)
Defining a schema for Excel files or for COBOL-style data can be difficult. Yet, it is not impossible and it is mandatory.
Here are some suggestions of approaches:
About the validation function, it is implemented in ERP5 in a simple way: we simply store all files we receive and all files we send in specific modules. And each document that relies on them (ex. Packing List) has an aggregate relation to it. This way, we can ensure that ERP5 received everything. ERP5 can also send again everything. We even keep a trace of all access logs to SFTP server to ensure that we can prove for example that an SFTP directory was empty at a given date.
On systems that do not keep a trace of what they send or receive, the validation function needs to be developed. Yet, without it, there is no way to ensure that the interface works reliably.
Some people might consider blockchains as a kind of log file or graph. Actually, only the merkle tree is really useful here to ensure that data is consistent and equal on both sides, not the proof of work. Blockchains without proof of work are actually very similar to git. It is thus easier to use git.