BookmarkSubscribeRSS Feed
Gorzyne
Calcite | Level 5

Hi all, this is my first message in the community, I hope I shall express my needs in the correct way

 

I discovered VCOLUMNS of the SASHELP recently and like it a lot. I understood also I could got something similar with an SQL equivalent.

 

However, I would need more information about my data like:

- how many lines does my table have ?

- how many different values do each of my fields take ?

- in case of numeric field, what are the means, min, max and why not, std deviation...?

 

But for this I'm stuck,

I would know how to retrieve the information for 1 field of 1 table, but I'd really like to have the information for my N tables and M(i) fields (i in [1-N])

 

I tried a couple of loops, using sql, proc contents and macro but my level is to week to write this properly, and in the most elegant way.

 

Therefore I would deeply enjoy if some of you guys could give me tips about how to retrieve extrainformation about my data, for at the best, ALL fields of ALL tables.

 

Thanks a lot

Gorzyne

15 REPLIES 15
RW9
Diamond | Level 26 RW9
Diamond | Level 26

- how many lines does my table have ?

No such thing as "lines", you mean observations.  You can get this most simply by looking at the sashelp.vtables dataset, which contains table information (like vcolumns, but for tables).

 

- how many different values do each of my fields take ?

You would use a procedure for this, proc freq is the obvious choice here.

 

- in case of numeric field, what are the means, min, max and why not, std deviation...?

You would use a procedure for this, proc means is the obvious choice here.

 

You can find plenty of examples of each procedure in the manual, off google, or on this forum.

Gorzyne
Calcite | Level 5

Thank you for your quick reply it sounds indeed simple, I'm new to the way of thinking in SAS

 

I tried this sequence

 

PROC FREQ DATA=ORACLE._ALL_ OUT=LISTE_TABLE3 NOPRINT;
RUN;

However I get an error, I guess I'm rather close to the correct syntax

 

I get a

 

ERROR: A database error occurred. The database specific error follows:
       DATABASE error: ORA-00911: caractère non valid

 

 

it looks like my _ALL_ isn't accepted by MEANS and FREQ instruction (but okey with CONTENTS..)

 

and yeah I meant "observation" sorry I was searching for the word

 

 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Yes, you can't use _all_ in that scenario, it will only take one dataset as input as will most procedures.  Please avoid coding all in uppercase.  Something like:

data _null_;
  set sashelp.vcolumn (where=(libname="ORACLE"));
  call execute(cats('proc freq data=oracle.',memname,'; tables _character_ / out=',memname,'_freq; run;'));
run;

This will create a proc freq for each dataset in the oracle libname, table all the character variables.  Output would be to work library, called the dataset name plus _freq.  

I would really question why you want to do this though, don't see any use for the information.

PaigeMiller
Diamond | Level 26

I agree with @RW9 who said

 

I would really question why you want to do this though, don't see any use for the information.

 

It seems like a total waste of time, I have been through a bazillion different projects in my career and I have never had the need to do this. The closest thing I can think of is checking for outliers, which probably can be done a lot more efficiently.

--
Paige Miller
Gorzyne
Calcite | Level 5

Thank you for this code I would never have figured myself, I'll try this in
a couple of minutes and let you know

I need this because all the data is going to migrate towards another
datalake and so I must make sure that the data is still the same, well it's
not exactly that but it's the global idea..

But in general it's good to know what's in your data checking for anomalies
for example you expect some fields to have multiple values so you could
check easily if your data is good

 

 

Gorzyne
Calcite | Level 5

okey I've been greedy indeed and the instruction seems to last for minutes....

 

so what if perhaps I'd like to launch the same but on a set of tables how would I proceed

I created a table containing the name of the tables I am working on but perhaps something smarter could be done like creating a kind of sub-libname of some sort, if this is possible ?

 

best regards

Gorz

RW9
Diamond | Level 26 RW9
Diamond | Level 26

It will do, you asking for a frequency of all character data in that libname - which from the name would indicate that that data is on an oracle database.  Are you migrating the database?  If so why are you using SAS here?  

As for how, just pop your dataset in rather than sashelp.vtables:

data _null_;
  set yourdata;
  call execute(cats('proc freq data=oracle.',memname,'; tables _character_ / out=',memname,'_freq; run;'));
run;

I assume here its called yourdata and has a memname field with dataset name in.

Your still going to hit slowness of communicating with DB if that is a connected libname. 

Gorzyne
Calcite | Level 5

ok I'll try that one

 

your question is good in fact I'm not in charge of the migration or even checking if the migration went well from an IT point of vue, but from a user point of vue, they have a couple of WPS/SAS programs running on the datawarehouse so after the migration they want to make sure they have all the data they need in the proper order, like there might have been a few changes in tables architecture+ name of fields and of tables might change also, so idea is to get some kind of "signature" for each field like a min/max/deviation/freq would help to figure out... in fact I don't have to check the whole dataware but like around a few dozens of tables

 

tried to stop last routine, still running, hope server won't crash Robot Embarassed

 

Gorz

Reeza
Super User

@Gorzyne wrote:

ok I'll try that one

 

your question is good in fact I'm not in charge of the migration or even checking if the migration went well from an IT point of vue, but from a user point of vue, they have a couple of WPS/SAS programs running on the datawarehouse so after the migration they want to make sure they have all the data they need in the proper order, like there might have been a few changes in tables architecture+ name of fields and of tables might change also, so idea is to get some kind of "signature" for each field like a min/max/deviation/freq would help to figure out... in fact I don't have to check the whole dataware but like around a few dozens of tables

 

tried to stop last routine, still running, hope server won't crash Robot Embarassed

 

Gorz


If your data sets are large, the statistics will match and this method will not work. For example, for a normally distributed variable, with 1 million observations, if you add 1000 observations, you're looking for differences in the 4/5th decimal place. 

 

Given that these are on different machines and with large numbers you may also have numeric precision issues. 

I'm not sure what the appropriate method would be in this case.

ballardw
Super User

@Gorzyne wrote:

Thank you for this code I would never have figured myself, I'll try this in
a couple of minutes and let you know

I need this because all the data is going to migrate towards another
datalake and so I must make sure that the data is still the same, well it's
not exactly that but it's the global idea..

But in general it's good to know what's in your data checking for anomalies
for example you expect some fields to have multiple values so you could
check easily if your data is good

 

 


Caution Will Robinson: Danger approaches.

 

 

Formats applied to variables could make the results of your summary tables very misleading. If a variable has a format such as F4.0 applied then an actual value of 4.1 would appear as 4 in a proc freq table. Custom formats applied to categorical variables may be even more problematic. Example: I have data that contains site codes but I have a format that groups sites into regions for some reporting needs. So the table for that report contains site codes that are formatted with the region code. Your Proc Freq on that table would show results of the Region format applied and not the actual site code.

 

Currency values often get a format with two decimals applied, but actual values may have more actual decimal places depending on the data.

 

Some additional consideration is going to be needed before using simple procedures to verify data migration.

 

Which also brings up the possibly migration of Formats.

Gorzyne
Calcite | Level 5

thank you for theses very interesting considerations

as a matter of fact, it's the IT business to make sure the data has been loaded correctly

we are supposed to work on rather raw stuff, that has been treated through the same copybook so the data is likely to be rather close

 

complementary information I'm trying to retrieve is to have so, 1) identify eigen fields 2) fast check signature, is the number of observations the same, if I have a list of 14321 different names do I find the same names on the other side, if the total cash is 287987897287,85knuts then it's ok if we only find 287987897000knuts on the other side, I mean slight différences would be acceptable I guess, we want to check for outliers but maybe you guys have better suggestions on how to identify the needs

Reeza
Super User

@Gorzyne wrote:

if the total cash is 287987897287,85knuts then it's ok if we only find 287987897000knuts on the other side, I mean slight différences would be acceptable I guess, we want to check for outliers but maybe you guys have better suggestions on how to identify the needs


Are you sure about that? Usually that's not the case because it also means somewhere a record is wrong for someone. You may want to check with the business if that's an acceptable difference. Small differences may not be significant to totals or means, but if that data is used for other things - ie filtering people for a specific program it could be quite problematic. And then you'll also lose the trust of the end users as soon as that starts happening.

Gorzyne
Calcite | Level 5

Yeah I know that's what I was thinking about, but that's what I was told, that end users wouldn't bother for a few knuts, it's about accountancy purposes so if there's a slight différences in the global reporting all in all, they will presume the newest is the better (current datamart is known for having some errors) but I agree that I would feel comfortable with strict equality, but in fact this is not in my critical path, my aim is to get the final reporting looking normal

 

anyway I had issues with the latest verison of the code you gave me

 

I tried

 

data MYTABLES;
	input var1 $30.;
	cards;
		MYTABLE1
		MYTABLE2;
run;
data _null_;
  set MYTABLES;
  call execute(cats('proc freq data=oracle.',var1,'; tables _character_ / out=',var1,'_freq; run;'));
run;

 

it executes correctly and gives me the two output tables in my work but when I open the output I only have something like that

 

Capture.PNG

 

but my tables have many fields so something went wrong... because in the first version with _ALL_ I was able to catch a glimpse of the first results and everything looked correct...

ballardw
Super User

@Gorzyne wrote:

Yeah I know that's what I was thinking about, but that's what I was told, that end users wouldn't bother for a few knuts, it's about accountancy purposes so if there's a slight différences in the global reporting all in all, they will presume the newest is the better (current datamart is known for having some errors) but I agree that I would feel comfortable with strict equality, but in fact this is not in my critical path, my aim is to get the final reporting looking normal

 

anyway I had issues with the latest verison of the code you gave me

 

I tried

 

data MYTABLES;
	input var1 $30.;
	cards;
		MYTABLE1
		MYTABLE2;
run;
data _null_;
  set MYTABLES;
  call execute(cats('proc freq data=oracle.',var1,'; tables _character_ / out=',var1,'_freq; run;'));
run;

 

it executes correctly and gives me the two output tables in my work but when I open the output I only have something like that

 

Capture.PNG

 

but my tables have many fields so something went wrong... because in the first version with _ALL_ I was able to catch a glimpse of the first results and everything looked correct...


The OUT= option only creates a table for ONE of the table requests.

You need to get each variable as a single table request to use the OUT option for proc freq .

proc freq data=sashelp.class;
   tables age / out=work.ages ;
   tables sex / out=work.sexes;
   tables name/ out=work.names;
run;

An uglier output data set can be created with

 

proc summary data=sashelp.class  ;
   class _character_ /missing;
ways 1; output out=work.charsum; run;

but you'll want to read up on the _type_ automatic variable and options such as the WAYS statement.

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 1064 views
  • 1 like
  • 5 in conversation