Batch / offline processing design book / documentation

Question

Batch / offline processing design book / documentation

Is there a book or some documentation that describes best practices for developing batch (stand-alone) processes for exchanging data between two parties?

I found some useful information on the spring website, but it's pretty low level: batch processing strategies and principles of batch principles .

There are many considerations for batch processing, for example:

data transfer method (e.g. files)
control protocol between two parties
error processing
file naming conventions (when using files for transfer)
cut-off time synchronization between two sides
and etc.

It would be nice if there was some kind of author document or checklists that ensure that the projects are in line with best practices in this area.

UPDATE:

I will add answers to this section when I stumbled upon them.

General information about batch / offline processing

This section is taken from @ user1813068 answer.

You can find some architectural design patterns in this link , as well as in this one that describe approaches to integrating partners and partners and data synchronization.

This Wikipedia page also provides an overview of high-level architectural patterns and includes patterns for data integration: architectural patterns .

The book Data Integration Blueprint and Modeling is also very good.

Data files

Most of the content in this section is taken here: source

Using headers and footers to share flat files is considered best practice. Flat files can be exchanged without headers and footers, and file naming may contain part of the same information as the header. When using a delimited file, a field list header is always required.

Headings

When exchanging data between systems, it is very important that the receiving party knows exactly what type of data is being sent. One way to ensure this is to provide a header line that contains relevant information about the data content and how it is processed.

When working with flat files, the file name itself can also be used to inform the receiving party of the contents of the file. However, the title bar provides better support for all available options.

When working with the API, these header fields can be provided in a similar way. The implementation will be determined by the developer of the API service.

If the header is included, it consists of one data set and should always be the first in the file.

headers and footers

A footer can be provided using file-based formats to indicate that there is no more data to process.

During processing, data found after the footer line should be ignored. Also, when creating data, keep in mind that any data after the footer line will be ignored.

Data formats

Delimited Files

In fact, industry standard is delimited files.

Comma-delimited files (CSVs or comma-separated values) usually require encapsulating data, usually with double quotes ("), then double quotes must be escaped with either a backslash () or double double quotes ("). Due to inconsistencies in the CSV implementation, it is recommended that you use tabs as a delimiter without encapsulation. In this case, tabs must be removed from the data. Delimited files usually process these XML files faster.

XML files

There are some in the industry who prefer XML files. XML allows for a clearer presentation of information because it supports nested data. Many companies are limited or do not support this format, so it is not recommended.

Encoding

UTF-8 Encoding

All data must be encoded in UTF-8 encoding to ensure maximum compatibility between all systems.

Dates & Time

To prevent confusion, it is recommended that you use UTC for all date and time fields.

Some other recommendations: EDI planning and file transfer

+4

design architecture batch-processing

Chris snow Nov 08 '12 at 14:44

source share

2 answers

Depending on your requirements, you can either look at data replication systems to transfer data as is. There are many commercial and open source tools. You can see the source code and the SymmetricDS documentation.

If you need to do some transformations and processing, you can take a look at the ETL tools (Extract, Transform, Download). Most books with data materials have chapters on this subject, for example, Here.

+1

ali köksal Nov 09 '12 at 18:42

source share

Chris snow · Accepted Answer · 2012-11-10T14:58:34+0000

You can find some architectural design patterns on this as well as on this that describe approaches to partner and partner integration and data synchronization.

This Wikipedia page also provides a high-level overview of architectural patterns and includes patterns for data integration: architectural patterns .

The book "Data Integration and its Modeling" is also very good.

Batch / offline processing design book / documentation

General information about batch / offline processing

Data files

Headings

headers and footers

Data formats

Encoding

More articles: