bzip using HDFS command

When your Hadoop distribution lacks NFS Gateway support (unix command support), simple tasks becomes complicated.

For example bzipping a file is such a simple task if nfs gateway is enabled

$ bzip2  /hdfspath/file.csv

but when nfs is not present then it becomes little bit challenging

$  hdfs dfs -cat /hdfspath/file.csv | bzip2 | hadoop fs -put - /hdfspath/file.bz2 && hdfs dfs -rm /hdfspath/file.csv

The above command first  opens the  csv file, the output is passed to bzip2 command and then zipped output is sent to  .bz2 via  hdfs put command.  Finally when all these are done we manually remove the original .csv file.

 

Hive Sqoop change scratchdir

In this post  Hive Job Submission failed with exception

I had mentioned about deleting .Trash folder.  For any reason if you are not able to delete the folder you can go with Option B.

Change the  scratchdir for Hive / Sqoop to use

Create a new folder under  hadoop file system.  Example   /db/tmphive

Grant read write access to other user  to  /db/tmphive folder and use this SET property in hive script.

Hive> set hive.exec.scratchdir=/db/tmphive/;

If same is used in Sqoop use  it with -D option

Sqoop import  -Dhive.exec.scratchdir=/db/tmphive/ …. 

 

 

Hive Job Submission failed with exception

Job Submission failed with exception ‘org.apache.hadoop.hdfs.protocol.DSQuotaExceededException(The DiskSpace quota of /user/username is exceeded: quota = xx’

This is a common problem when you are working in a multi tenant environment with limited  quota.

Reasons:

  1. When large quantity of data is processed via Hive / Pig the temporary data gets stored in .Trash folder which causes  /home directory to reach the quota limit.
  2. When users delete large files without  -skipTrash command.

 

Why it gets stored in .Trash:

Like Recycle bin  /user/username/.Trash is place to store deleted files. If someone deletes file accidentally it can be recovered from here.

Solution:

  1. View the contents of  /user/username folder

$ hdfs dfs -ls /user/username
$ hdfs dfs -du -h /user/username

2. Find the folder with large size  (mostly it will be .Trash). If its something else then use that.

3. Find the large folder inside .Trash

$ hdfs dfs -du -h /user/username/.Trash

4. After making note of the folder for deletion, remove it using hdfs command  with  -skipTrash option

$ hdfs dfs -rm -r -skipTrash /user/username/.Trash/foldername_1

5. Run this command again to find out the status of  /user/username folder.

$ hdfs dfs -du -h /user/username

Now that .Trash is cleared, Hive should work without issues.

Happy Hiving 🙂

SQOOP : mapred.FileAlreadyExistsException : Output directory

Sometimes when you import data from RDBMS to Hadoop via Sqoop you will see this error.

org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory

hdfs://hadoopcluster/user/username/importtable  already exists

Solution:

$ hdfs dfs -rm -r -skipTrash  hdfs://hadoopcluster/user/username/importtable

Reason:

When Sqoop is used for Importing data, sqoop creates a temporary file under  home directory and later deletes those files. Sometimes due to some issue, sqoop will exit without deleting the folder. In that case it generates this error. Solution is to manually delete the folder using  hdfs dfs -rm -r command.

 

Hive : Create database in different folder

When database is created in Hive, its usually saved in default folder  /user/hive/warehouse/yourdabase.db

If alternate location is needed, then

create database <yourdatabase> location  ‘/db/path/yourdatabase.db’    [single quotes]

 

Linux : Compress entire folder to .gz or .bz2

gzip / bzip2 can compress individual files. To compress entire folder structure ‘tar’ command comes to our rescue.

Compress to .gz 

tar -zcf </path/output.gz>  /sourcefolder/path

Compress to .bz2

tar -jcf </path/output.bz2> /sourcefolder/path

tar options :

-z : gzip
-c : compress
-f : filename
-j : bzip2