Iterate files in folder using Spark Scala

This script loops through hdfs files system and reads the first line and writes it to console.  Most part its self explanatory.

This script uses pipeline delimiter  “|” .  Its optional and can be skipped.

import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
 
val path = "/hdfspath/"
val conf = new Configuration()
val fs = FileSystem.get(conf)
val p = new Path(path)
val ls = fs.listStatus(p)
 
ls.foreach( x => {
val f = x.getPath.toString
println(f)
val content = spark.read.option("delimiter","|").csv(f)
content.show(1)
} )
 
System.exit(0)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: