BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AK23
Obsidian | Level 7

Hi,

 

When we drop a managed table , Hive deletes the data in the table is my understanding.

 

By default, SAS data step option DBCREATE_EXTERNAL is set to NO which means SAS data step using hive libraries like below creates a “managed table”.

 

enter image description here

 

When using proc sql - drop table / proc delete / proc dataset - delete, the hive metadata is deleted i.e. table structure is dropped from the schema but the underlying HDFS file is not. While running the same data step again (after deletion), i,e. creating the same table in the schema - the number of records ingested is incorrect.

  Steps

  1. Create a hive table using SAS data step and note the no of rows.
  2. Drop the table using proc sql delete / proc dataset delete.
  3. Run the create table step again.
  4. Count the number of rows.

Source : http://support.sas.com/documentation/cdl/en/acreldb/69580/HTML/default/viewer.htm#n12r2tbfrrrsgdn1fa...

 

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
25 REPLIES 25
Oligolas
Barite | Level 11

Hi,

 

I have no knowledge about hive.

Nevertheless I would try different ways of deleting that table and see if it resolves the issue:

 

proc sql;

drop table hive_lib.ext_test_default;

quit;

 

or

 

proc datasets lib=hive_lib nolist;

delete ext_test_default;

run;quit;

 

or: looks like you only need to empty your table and not deleting it entirely right?

 

data hive_lib.ext_test_default;

if 0 then set hive_lib.ext_test_default;

delete;

run;

 

________________________

- Cheers -

AK23
Obsidian | Level 7

As mentioned in the steps above, proc delete / proc dataset - delete / proc sql drop table - none of them are deleting the underlying hdfs file.

 

BrunoMueller
SAS Super FREQ

By default, tables created in Hive are managed tables. so if you delete a table from SAS using the methods you mentioned also the actual data in the HDFS is deleted.

 

Here is a code sample that will create a Hive table, show the metadata information from Hive, list the contents of the HDFS directory, delete, show HDFS directory info. You need to set the appropriate macro variables.

 

LIBNAME ahd HADOOP PORT=&hd_port SERVER="&hd_server" SCHEMA=&hd_schema;

proc delete data=ahd.mycars;
run;

data ahd.mycars;
  set sashelp.cars;
run;

proc sql;
  connect to hadoop (
    PORT=&hd_port SERVER="&hd_server" SCHEMA=&hd_schema
  );

  select * from connection to hadoop (
    describe formatted mycars
  );
  disconnect from hadoop;
quit;

options set=SAS_HADOOP_RESTFUL 1;
proc hadoop;
  hdfs ls="&hive_db_location/mycars" ;
run;
options set=SAS_HADOOP_RESTFUL 0;

proc delete data=ahd.mycars;
run;

options set=SAS_HADOOP_RESTFUL 1;
proc hadoop;
  hdfs ls="&hive_db_location" ;
run;
options set=SAS_HADOOP_RESTFUL 0;
AK23
Obsidian | Level 7

Can you please replace the describe formatted to select count(*) from the table and note the result from your second run ?

 

Since you are using the sashelp.cars dataset - the total should always be 428 but from the second run you will see the count getting increased.

 

In a nutshell, it will be appending the entire saslhelp.cars from the second run since the underlying HDFS file is not deleted.

BrunoMueller
SAS Super FREQ

In my sample code the Hive table mycars is deleted in Hive metadata as well as in the HDFS.

 

Please show the output of the "describe formatted tableName" after you created the Hive table using the DATA Step

 

Also the output of the two HDFS LS= should show, that the actual data in HDFS gets deleted.

AK23
Obsidian | Level 7
data hivlib.ext_test_default_ak;
	set sashelp.cars;
run;

proc sql;
	connect to hadoop(server="&hive_server" 
		port=&hive_port 
		schema="&hive_schema" 
		properties="hive.warehouse.data.skiptrash=true;" 
		hdfs_tempdir="&hdfs_tmpdir" 
		uri="jdbc:hive2://&hive_server:&hive_port/&hive_schema;transportMode=http;httpPath=cliservice;principal=&principal?hive.fetch.task.conversion=minimal;hive.fetch.task.conversion.threshold=1;hive.exec.scratchdir=&hdfs_tmpdir;");
		select * from connection to hadoop (
		select count(*) from ext_test_default_ak
		);
/*		select * from connection to hadoop (*/
/*		dfs -ls="&hive_db_location/ext_test_default_ak"*/
/*		);*/
		select * from connection to hadoop (
		describe formatted ext_test_default_ak
		);
	disconnect from hadoop;
quit;

proc datasets lib=hivlib nodetails nolist;
	delete ext_test_default_ak;
quit;

 

Not able to run dfs commands due to permissions but using unix hdfs commands I can see the hdfs file for the table and it is not gettting deleted. Table is dropped fine from the hive metadata.

 

hdfs dfs -ls -h hdfs://location/ext*
Found 3 items
-rw-r--r--    39.3 K 2017-06-15 10:05 ext_test_default_ak/sasdata-2017-06-15-10-05-07-268-e-00007.dlv
-rw-r--r--    39.3 K 2017-06-15 10:12 ext_test_default_ak/sasdata-2017-06-15-10-12-56-846-e-00008.dlv
-rw-r--r--    39.3 K 2017-06-15 10:24 ext_test_default_ak/sasdata-2017-06-15-10-24-08-882-e-00009.dlv

 

Check the two attachments. First run has 428 observations and even after deletion, the second run (after 5 mins) has 856 observations.


first_run_Capture.PNGsecond_run_Capture.PNG
ChrisNZ
Tourmaline | Level 20

The Official documentation of Hadoop reads :

The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.

AK23
Obsidian | Level 7

I have run in 5 minutes interval and still the same issue. Matter of fact it is just 33K. Since the hdfs file is not getting deleted and only the table structure - the next datastep run appends both files into the table.

 

Please check my last reply with code and screenshots. Thanks.

BrunoMueller
SAS Super FREQ

Looks strange what happens if you execute the drop table directly through SQL passthrough using

drop table <tablename> purge
AK23
Obsidian | Level 7

Same outcome. I have used the purge option and there is no error / warning. Just a note that "No rows were selected".

 

I have even set the skipTrash to true in libname properties.

BrunoMueller
SAS Super FREQ

did you use

execute (drop table ...) by hadoop
AK23
Obsidian | Level 7
data hivlib.ext_test_default_ak;
	set sashelp.cars;
run;

proc sql;
	connect to hadoop(server="&hive_server" 
		port=&hive_port 
		schema="&hive_schema" 
		properties="hive.warehouse.data.skiptrash=true;" 
		hdfs_tempdir="&hdfs_tmpdir" 
		uri="jdbc:hive2://&hive_server:&hive_port/&hive_schema;transportMode=http;httpPath=cliservice;principal=&principal?hive.fetch.task.conversion=minimal;hive.fetch.task.conversion.threshold=1;hive.exec.scratchdir=&hdfs_tmpdir;");
		select * from connection to hadoop (
		select count(*) from ext_test_default_ak
		);
		select * from connection to hadoop (
		describe formatted ext_test_default_ak
		);
		execute (
		drop table ext_test_default_ak purge
		)by hadoop;
	disconnect from hadoop;
quit;

proc datasets lib=hivlib nodetails nolist;
	delete ext_test_default_ak;
quit;

 

Yes, Same outcome.

 

I used proc delete / proc dataset - delete / proc sql - drop table / pass through - execute (drop table with purge option).

AK23
Obsidian | Level 7

Sorry corrected a typo and sql pass through with purge option deleted both metadata and hdfs file.

 

But using the datastep (without sql pass thorugh) - using proc sql drop / proc delete / proc dataset delete - still the same issue where metadata in hive warehouse is deleted but not the underlying hdfs file.

 

The replace option in the datastep is not supported either for hive libraries afaik.

BrunoMueller
SAS Super FREQ

What happens if you do the drop table without the purge? Again using SQL Passthrough

 

You can use the follwoing options to see which HiveQL is passed from to SAS to Hive:

 

options
  sastrace=",,,d"
  sastraceloc=saslog
  nostsuffix
;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 25 replies
  • 3930 views
  • 0 likes
  • 4 in conversation