Spark-shell tool to display all tables and their consumption
A typical and illustrative tool for Spark-shell, a looping cycle through all databases, tables and sections, for obtaining sizes and reports in a CSV file:
// sshell -i script.scala > ls.csv import org.apache.hadoop.fs.{FileSystem, Path} def cutPath (thePath: String, toCut: Boolean = true) : String = if (toCut) thePath.replaceAll("^.+/", "") else thePath val warehouse = "/apps/hive/warehouse" // the Hive default location for all databases val fs = FileSystem.get( sc.hadoopConfiguration ) println(s"base,table,partitions,bytes") fs.listStatus( new Path(warehouse) ).foreach( x => { val b = x.getPath.toString fs.listStatus( new Path(b) ).foreach( x => { val t = x.getPath.toString var parts = 0; var size = 0L; // var size3 = 0L fs.listStatus( new Path(t) ).foreach( x => { // partition path is x.getPath.toString val p_cont = fs.getContentSummary(x.getPath) parts = parts + 1 size = size + p_cont.getLength //size3 = size3 + p_cont.getSpaceConsumed }) // t loop println(s"${cutPath(b)},${cutPath(t)},${parts},${size}") // display opt org.apache.commons.io.FileUtils.byteCountToDisplaySize(size) }) // b loop }) // warehouse loop System.exit(0) // get out from spark-shell
PS: I checked, size3 is always 3 *, no additional information.
source share