BookmarkSubscribeRSS Feed
skruz83
Calcite | Level 5

Hi,

 

I am fairly new to hadoop and using SAS EG to access the data.

 

I want to run a series of data checks on the data that is stored in hadoop i.e. for a particular table (or library/database) for each (tables)columns identify the min, max, missing, no of records etc...

 

I tried using the traditional PROC Contents/ PROC Datasets but it takes ages given the volume of data etc..

 

Is there a better way to run the two commands in hadoop via hive sql?

 

Effectively I am after a table which shows:

 

table name, column_name, column type, no of records, no of missing values, no of distinct values, min value, max value, min length, max length,

 

Regards

1 REPLY 1
LinusH
Tourmaline | Level 20
Proc contents should give the structure, but it shouldn't take much time.
The other stats isn't available in contents nor datasets procedure. For those i sugest that you use SQL. But it requires a full table scan given the nature of your requirement. Just be sure that the SQL is sent to Hive. Try with a small table first, and use
Options msglevel = I sastrace = ',,,d' SASTRACELOC = saslog;
for verification.
Data never sleeps

sas-innovate-white.png

Special offer for SAS Communities members

Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 911 views
  • 0 likes
  • 2 in conversation