Iterate files in folder using Spark Scala

This script loops through hdfs files system and reads the first line and writes it to console.  Most part its self explanatory.

This script uses pipeline delimiter  “|” .  Its optional and can be skipped.

import org.apache.hadoop.fs.Path

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
 
val path = "/hdfspath/"
val conf = new Configuration()
val fs = FileSystem.get(conf)
val p = new Path(path)
val ls = fs.listStatus(p)
 
ls.foreach( x => {
val f = x.getPath.toString
println(f)
val content = spark.read.option("delimiter","|").csv(f)
content.show(1)
} )
 
System.exit(0)

One thought on “Iterate files in folder using Spark Scala

  • November 21, 2019 at 10:35 pm
    Permalink

    thank you, how to do same with google storage buckets path?

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: