- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
When we use SASHDAT libname engine the files are place on HDFS using the path= <HDFS path> and the
copies= for number of replication. The replication factor for SASHDAT tables is 2 by default.
Whereas on HDFS the replication factor is 3 by default. Now if a sas table is loaded to HDFS,
though SASHDAT it will have 2 copies or where as in HDFS it will have 3 Copies. How is that possible.
I'm bit confused. Can anyone explain me the above please?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I tested as LinusH suggested. He is correct. HDFS has a default replication factor and the SASHDAT engine overrides that when it creates files in HDFS. The LIBNAME engine for SASHDAT has a default value for copies= even if you don't specify it on the LIBNAME statement. This is what I found in the doc:
COPIES=n
specifies the number of replications to make for the data set (beyond the original blocks). The default value is 2 when the INNAMEONLY option is specified and otherwise is 1. Replicated blocks are used to provide fault tolerance. If a machine in the cluster becomes unavailable, then the blocks needed for the SASHDAT file can be retrieved from replications on other machines. If you specify COPIES=0, then the original blocks are distributed, but no replications are made and there is no fault tolerance for the data.
Here is the link to that part of the documentation: http://support.sas.com/documentation/cdl/en/inmsref/70021/HTML/default/viewer.htm#p0kn1b8a7yt44fn1qw...
also, here is how I discovered with HDFS commands how to determine the replication factor for HDFS files:
https://www.systutorials.com/qa/1297/how-to-check-the-replication-factor-of-a-file-in-hdfs
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Not doing this a lot, so from perspective I would guess that the SAS default is overriding the hdfs default. So no, I don't think that there will be three. Have you checked in the file system?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I tested as LinusH suggested. He is correct. HDFS has a default replication factor and the SASHDAT engine overrides that when it creates files in HDFS. The LIBNAME engine for SASHDAT has a default value for copies= even if you don't specify it on the LIBNAME statement. This is what I found in the doc:
COPIES=n
specifies the number of replications to make for the data set (beyond the original blocks). The default value is 2 when the INNAMEONLY option is specified and otherwise is 1. Replicated blocks are used to provide fault tolerance. If a machine in the cluster becomes unavailable, then the blocks needed for the SASHDAT file can be retrieved from replications on other machines. If you specify COPIES=0, then the original blocks are distributed, but no replications are made and there is no fault tolerance for the data.
Here is the link to that part of the documentation: http://support.sas.com/documentation/cdl/en/inmsref/70021/HTML/default/viewer.htm#p0kn1b8a7yt44fn1qw...
also, here is how I discovered with HDFS commands how to determine the replication factor for HDFS files:
https://www.systutorials.com/qa/1297/how-to-check-the-replication-factor-of-a-file-in-hdfs