HomeSpark

Iterate files in folder using Spark Scala

Like Tweet Pin it Share Share Email

This script loops through hdfs files system and reads the first line and writes it to console.  Most part its self explanatory.

This script uses pipeline delimiter  “|” .  Its optional and can be skipped.

import org.apache.hadoop.fs.Path

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
 
val path = "/hdfspath/"
val conf = new Configuration()
val fs = FileSystem.get(conf)
val p = new Path(path)
val ls = fs.listStatus(p)
 
ls.foreach( x => {
val f = x.getPath.toString
println(f)
val content = spark.read.option("delimiter","|").csv(f)
content.show(1)
} )
 
System.exit(0)

Comments (0)

Leave a Reply

Your email address will not be published. Required fields are marked *