Reading very large files (~ 1 TB) in sequential blocks

I need to read a large file in Scala and process it in blocks of k bits (k can be usually 65536). As a simple example (but not what I want):

file blocks (f1, f2, ... fk) .

I want to calculate SHA256(f1)+SHA256(f2)+...+ SHA256(fk)

Such a calculation can be performed gradually, using only persistent storage and the current block without the need for other blocks.

What is the best way to read a file? (maybe something that uses sequels?)

EDIT: The related question solves the problem, but not always, since the file I'm looking for contains binary data.

+6
source share
2 answers

Here is an approach using Akka threads. This uses read-only memory and can process fragments of files as they are read.

See "IO Stream File" at the bottom of this page for more information. http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0-RC3/scala/stream-io.html

Start with a simple build.sbt file:

 scalaVersion := "2.11.6" libraryDependencies ++= Seq( "com.typesafe.akka" %% "akka-stream-experimental" % "1.0-RC3" ) 

The interesting parts are Source , Flow and Sink . Source is a SynchronousFileSource that is read in a large file with a block size of 65536 . A ByteString of block size is emitted from Source and consumed by Flow , which computes the SHA256 hash for each fragment. Finally, Sink consumes the exit from Flow and prints arrays of bytes. You will want to convert them and summarize with fold to get the total amount.

 import akka.stream.io._ import java.io.File import scala.concurrent.Future import akka.stream.scaladsl._ import akka.actor.ActorSystem import akka.stream.ActorFlowMaterializer import java.security.MessageDigest object LargeFile extends App{ implicit val system = ActorSystem("Sys") import system.dispatcher implicit val materializer = ActorFlowMaterializer() val file = new File("<path to large file>") val fileSource = SynchronousFileSource(file, 65536) val shaFlow = fileSource.map(chunk => sha256(chunk.toString)) shaFlow.to(Sink.foreach(println(_))).run//TODO - Convert the byte[] and sum them using fold def sha256(s: String) = { val messageDigest = MessageDigest.getInstance("SHA-256") messageDigest.digest(s.getBytes("UTF-8")) } } 

BYTE ARRAYS!

 > run [info] Running LargeFile [ B@3d0587a6 [ B@360cc296 [ B@7fbb2192 ... 
+4
source

Creating a digest using a stream constantly, which I believe creates an iterator

 import java.File import java.FileInputStream import java.security.MessageDigest val file = new File("test.in") val is = new FileInputStream(file) val md = MessageDigest.getInstance("SHA-256") val bytes = Array.fill[Byte](65536)(0) Stream .continually((is.read(bytes),bytes)) .takeWhile(_._1 != -1) .foreach{ x => md.update(x._2,0,x._1) } println(md.digest()) // prinln(md.digest().map("%02X" format _).mkString) // if you want hex string 
0
source

Source: https://habr.com/ru/post/988845/


All Articles