Apache Beam - Integration Test with Unlimited PCollection

We are building an integration test for the Apache Beam pipeline and are facing some problems. See below context ...

Details of our pipeline:

  • We use PubsubIOas our data source (unlimited PCollection)
  • Intermediate conversions include a custom CombineFnand very simple on / off strategy
  • Our final conversion JdbcIO, using org.neo4j.jdbc.Driverfor writing in Neo4j

Current Testing Approach:

  • Run the Google Cloud Pub / Sub emulator on your computer so that the tests run on
  • Create a Neo4j database in memory and pass its URI to our pipeline parameters
  • Start the conveyor by calling OurPipeline.main(TestPipeline.convertToArgs(options)
  • Use the Google Cloud Java Pub / Sub client library to publish messages in a test topic (using the Pub / Sub emulator) that PubsubIOwill read from
  • Data must pass through the pipeline and ultimately delete our in-memory Neo4j instance
  • Make simple statements regarding the availability of this data in Neo4j

This will be a simple integration test that will verify that our pipeline as a whole behaves as expected.

, , , . DirectRunner pipeline.run() ( pipeline.run().waitUntilFinish()), , , . PCollection ( ), , , , .

, :

1) , ?

2) ? ( Pub/Sub ), , Pub/Sub.

3) , ? / .

, /​​ - !

+4
1

DirectRunner, isBlockOnRun false. PipelineResult , cancel(), .

. , ( ), PTransform. PTransform , , , .

, Create ( ), TestStream ( , , TestStream), DirectRunner , PTransform PCollection PAssert PCollection, , , .

- Beam TestStream.

+6

Source: https://habr.com/ru/post/1679996/


All Articles