How to delete hive tables using datastep

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 14
Accepted Solution

How to delete hive tables using datastep

Hi,

 

When we drop a managed table , Hive deletes the data in the table is my understanding.

 

By default, SAS data step option DBCREATE_EXTERNAL is set to NO which means SAS data step using hive libraries like below creates a “managed table”.

 

enter image description here

 

When using proc sql - drop table / proc delete / proc dataset - delete, the hive metadata is deleted i.e. table structure is dropped from the schema but the underlying HDFS file is not. While running the same data step again (after deletion), i,e. creating the same table in the schema - the number of records ingested is incorrect.

  Steps

  1. Create a hive table using SAS data step and note the no of rows.
  2. Drop the table using proc sql delete / proc dataset delete.
  3. Run the create table step again.
  4. Count the number of rows.

Source : http://support.sas.com/documentation/cdl/en/acreldb/69580/HTML/default/viewer.htm#n12r2tbfrrrsgdn1fa...

 

Thanks.


Accepted Solutions
Solution
‎07-13-2017 06:28 AM
Occasional Contributor
Posts: 14

Re: How to delete hive tables using datastep


All Replies
Frequent Contributor
Posts: 129

Re: How to delete hive tables using datastep

Hi,

 

I have no knowledge about hive.

Nevertheless I would try different ways of deleting that table and see if it resolves the issue:

 

proc sql;

drop table hive_lib.ext_test_default;

quit;

 

or

 

proc datasets lib=hive_lib nolist;

delete ext_test_default;

run;quit;

 

or: looks like you only need to empty your table and not deleting it entirely right?

 

data hive_lib.ext_test_default;

if 0 then set hive_lib.ext_test_default;

delete;

run;

 

________________________

- Cheers -

Occasional Contributor
Posts: 14

Re: How to delete hive tables using datastep

As mentioned in the steps above, proc delete / proc dataset - delete / proc sql drop table - none of them are deleting the underlying hdfs file.

 

SAS Super FREQ
Posts: 708

Re: How to delete hive tables using datastep

By default, tables created in Hive are managed tables. so if you delete a table from SAS using the methods you mentioned also the actual data in the HDFS is deleted.

 

Here is a code sample that will create a Hive table, show the metadata information from Hive, list the contents of the HDFS directory, delete, show HDFS directory info. You need to set the appropriate macro variables.

 

LIBNAME ahd HADOOP PORT=&hd_port SERVER="&hd_server" SCHEMA=&hd_schema;

proc delete data=ahd.mycars;
run;

data ahd.mycars;
  set sashelp.cars;
run;

proc sql;
  connect to hadoop (
    PORT=&hd_port SERVER="&hd_server" SCHEMA=&hd_schema
  );

  select * from connection to hadoop (
    describe formatted mycars
  );
  disconnect from hadoop;
quit;

options set=SAS_HADOOP_RESTFUL 1;
proc hadoop;
  hdfs ls="&hive_db_location/mycars" ;
run;
options set=SAS_HADOOP_RESTFUL 0;

proc delete data=ahd.mycars;
run;

options set=SAS_HADOOP_RESTFUL 1;
proc hadoop;
  hdfs ls="&hive_db_location" ;
run;
options set=SAS_HADOOP_RESTFUL 0;
Occasional Contributor
Posts: 14

Re: How to delete hive tables using datastep

Posted in reply to Bruno_SAS

Can you please replace the describe formatted to select count(*) from the table and note the result from your second run ?

 

Since you are using the sashelp.cars dataset - the total should always be 428 but from the second run you will see the count getting increased.

 

In a nutshell, it will be appending the entire saslhelp.cars from the second run since the underlying HDFS file is not deleted.

SAS Super FREQ
Posts: 708

Re: How to delete hive tables using datastep

In my sample code the Hive table mycars is deleted in Hive metadata as well as in the HDFS.

 

Please show the output of the "describe formatted tableName" after you created the Hive table using the DATA Step

 

Also the output of the two HDFS LS= should show, that the actual data in HDFS gets deleted.

Occasional Contributor
Posts: 14

Re: How to delete hive tables using datastep

[ Edited ]
Posted in reply to Bruno_SAS
data hivlib.ext_test_default_ak;
	set sashelp.cars;
run;

proc sql;
	connect to hadoop(server="&hive_server" 
		port=&hive_port 
		schema="&hive_schema" 
		properties="hive.warehouse.data.skiptrash=true;" 
		hdfs_tempdir="&hdfs_tmpdir" 
		uri="jdbc:hive2://&hive_server:&hive_port/&hive_schema;transportMode=http;httpPath=cliservice;principal=&principal?hive.fetch.task.conversion=minimal;hive.fetch.task.conversion.threshold=1;hive.exec.scratchdir=&hdfs_tmpdir;");
		select * from connection to hadoop (
		select count(*) from ext_test_default_ak
		);
/*		select * from connection to hadoop (*/
/*		dfs -ls="&hive_db_location/ext_test_default_ak"*/
/*		);*/
		select * from connection to hadoop (
		describe formatted ext_test_default_ak
		);
	disconnect from hadoop;
quit;

proc datasets lib=hivlib nodetails nolist;
	delete ext_test_default_ak;
quit;

 

Not able to run dfs commands due to permissions but using unix hdfs commands I can see the hdfs file for the table and it is not gettting deleted. Table is dropped fine from the hive metadata.

 

hdfs dfs -ls -h hdfs://location/ext*
Found 3 items
-rw-r--r--    39.3 K 2017-06-15 10:05 ext_test_default_ak/sasdata-2017-06-15-10-05-07-268-e-00007.dlv
-rw-r--r--    39.3 K 2017-06-15 10:12 ext_test_default_ak/sasdata-2017-06-15-10-12-56-846-e-00008.dlv
-rw-r--r--    39.3 K 2017-06-15 10:24 ext_test_default_ak/sasdata-2017-06-15-10-24-08-882-e-00009.dlv

 

Check the two attachments. First run has 428 observations and even after deletion, the second run (after 5 mins) has 856 observations.


first_run_Capture.PNGsecond_run_Capture.PNG
PROC Star
Posts: 1,759

Re: How to delete hive tables using datastep

The Official documentation of Hadoop reads :

The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.

Occasional Contributor
Posts: 14

Re: How to delete hive tables using datastep

I have run in 5 minutes interval and still the same issue. Matter of fact it is just 33K. Since the hdfs file is not getting deleted and only the table structure - the next datastep run appends both files into the table.

 

Please check my last reply with code and screenshots. Thanks.

SAS Super FREQ
Posts: 708

Re: How to delete hive tables using datastep

Looks strange what happens if you execute the drop table directly through SQL passthrough using

drop table <tablename> purge
Occasional Contributor
Posts: 14

Re: How to delete hive tables using datastep

Posted in reply to Bruno_SAS

Same outcome. I have used the purge option and there is no error / warning. Just a note that "No rows were selected".

 

I have even set the skipTrash to true in libname properties.

SAS Super FREQ
Posts: 708

Re: How to delete hive tables using datastep

did you use

execute (drop table ...) by hadoop
Occasional Contributor
Posts: 14

Re: How to delete hive tables using datastep

[ Edited ]
Posted in reply to Bruno_SAS
data hivlib.ext_test_default_ak;
	set sashelp.cars;
run;

proc sql;
	connect to hadoop(server="&hive_server" 
		port=&hive_port 
		schema="&hive_schema" 
		properties="hive.warehouse.data.skiptrash=true;" 
		hdfs_tempdir="&hdfs_tmpdir" 
		uri="jdbc:hive2://&hive_server:&hive_port/&hive_schema;transportMode=http;httpPath=cliservice;principal=&principal?hive.fetch.task.conversion=minimal;hive.fetch.task.conversion.threshold=1;hive.exec.scratchdir=&hdfs_tmpdir;");
		select * from connection to hadoop (
		select count(*) from ext_test_default_ak
		);
		select * from connection to hadoop (
		describe formatted ext_test_default_ak
		);
		execute (
		drop table ext_test_default_ak purge
		)by hadoop;
	disconnect from hadoop;
quit;

proc datasets lib=hivlib nodetails nolist;
	delete ext_test_default_ak;
quit;

 

Yes, Same outcome.

 

I used proc delete / proc dataset - delete / proc sql - drop table / pass through - execute (drop table with purge option).

Occasional Contributor
Posts: 14

Re: How to delete hive tables using datastep

Sorry corrected a typo and sql pass through with purge option deleted both metadata and hdfs file.

 

But using the datastep (without sql pass thorugh) - using proc sql drop / proc delete / proc dataset delete - still the same issue where metadata in hive warehouse is deleted but not the underlying hdfs file.

 

The replace option in the datastep is not supported either for hive libraries afaik.

SAS Super FREQ
Posts: 708

Re: How to delete hive tables using datastep

What happens if you do the drop table without the purge? Again using SQL Passthrough

 

You can use the follwoing options to see which HiveQL is passed from to SAS to Hive:

 

options
  sastrace=",,,d"
  sastraceloc=saslog
  nostsuffix
;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 25 replies
  • 589 views
  • 0 likes
  • 4 in conversation