All books and guides about HDFS mention that files in the filesystem are arranged in blocks, with default size of 128MB. All modern Operating systems and databases also store data in blocks. Today we will explore HDFS blocks.
Let’s do a little experiment:
First, we will create a directory in HDFS.
hdfs dfs -mkdir -p /tmp/blocktest
Then we prepare a small text file and move it into HDFS:
[[email protected] ~]$ echo "This is a text file" > test.txt [[email protected] ~]$ ls -l test.txt -rw-r--r-- 1 hdfs hadoop 20 Apr 19 12:36 test.txt [[email protected] ~]$ hdfs dfs put test.txt /tmp/blocktest
Now we look under the hood for the actual blocks. We go to one of the datanodes, under the data directory and find a directory named “current”. Under it we can find a directory with a name similar to ” BP-1035990879-127.0.0.1-1452421284676″ and a VERSION file.
We go deeper and enter the BP-1035990879-127.0.0.1-1452421284676 directory where we find another “current” and enter it. There we find those directories and files:
dfsUsed finalized rbw VERSION
We are interested in the “finalized” directory and we enter it. There we find two nested directories called “subdir0”. Inside we find the actual blocks:
If we order them from oldest to newest we find this block to be the newest: blk_1073741877
We look inside and hey, this is the file we just moved to HDFS ! Just as it was, in plain simple text:
So we see that HDFS just copied the original file into it’s data directory without any transformation.
What about large files that span more than one block ? Let’s try.
We will use this simple bash script to generate a large text file, larger than 128MB so it will span more than one block:
#!/bin/bash for ((i=0;i++;)) do echo "Line number "$i >> test.txt done
This way I created a 475MB text file containing line numbers:
Now we will copy this file into HDFS and examine the blocks created.
You can see that the HDFS has created four blocks for the file and that it is still plain text. Each time a block of 128MB is filled, the next block is created and tart to fill with the next lines of text.
What about binary data ? does it work the same way ?
First we will take as an example a simple image file. We will take this lovely puppy’s picture and copy it to HDFS:
This is how the file looks like on local disk (not HDFS):
Now, after copying it to HDFS, we look at the file in HDFS:
You can see that the file size,760178 bytes is the same. If, again, we will go to the data directory and dig our way into the sub directories we will find our block. We can see that it is also the same size:
If we extract the block and add a .jpg extension, we can open it with any viewer and see the picture.
And last experiment will be a binary file which is larger than a single block. We create a zip file containing a directory tree and copy it to HDFS. We expect HDFS to split it into several blocks of 128MB and concatenating them should restore the original file.
So I zipped Adobe digital editions to produce a 340Mb file and moved it into HDFS:
The file is split into Three blocks:
We now use cat to concatenate those files:
And now we try to unzip the file and see if it unzips correctly:
You can see that the file is not corrupted and is unzipping without any problems.
HDFS blocks aren’t really blocks in the sense of OS or Database blocks. You can’t pre allocate them and they are not a container for data. HDFS simply copies the files into it’s data directory, and if the file is lager then a block size then it is simply chopped into block size chunks. The files are just moved from one location on the local FS to another, and the block locations are kept in namenode.