Re: How to obtain extra information for all fields of all tables in my...

Gorzyne · Posted 12-04-2018 07:49 AM

Hi all, this is my first message in the community, I hope I shall express my needs in the correct way

I discovered VCOLUMNS of the SASHELP recently and like it a lot. I understood also I could got something similar with an SQL equivalent.

However, I would need more information about my data like:

- how many lines does my table have ?

- how many different values do each of my fields take ?

- in case of numeric field, what are the means, min, max and why not, std deviation...?

But for this I'm stuck,

I would know how to retrieve the information for 1 field of 1 table, but I'd really like to have the information for my N tables and M(i) fields (i in [1-N])

I tried a couple of loops, using sql, proc contents and macro but my level is to week to write this properly, and in the most elegant way.

Therefore I would deeply enjoy if some of you guys could give me tips about how to retrieve extrainformation about my data, for at the best, ALL fields of ALL tables.

Thanks a lot

Gorzyne

RW9 · Posted 12-04-2018 07:54 AM

- how many lines does my table have ?

No such thing as "lines", you mean observations. You can get this most simply by looking at the sashelp.vtables dataset, which contains table information (like vcolumns, but for tables).

- how many different values do each of my fields take ?

You would use a procedure for this, proc freq is the obvious choice here.

- in case of numeric field, what are the means, min, max and why not, std deviation...?

You would use a procedure for this, proc means is the obvious choice here.

You can find plenty of examples of each procedure in the manual, off google, or on this forum.

Gorzyne · Posted 12-04-2018 08:03 AM

Thank you for your quick reply it sounds indeed simple, I'm new to the way of thinking in SAS

I tried this sequence

PROC FREQ DATA=ORACLE._ALL_ OUT=LISTE_TABLE3 NOPRINT;
RUN;

However I get an error, I guess I'm rather close to the correct syntax

I get a

ERROR: A database error occurred. The database specific error follows:
DATABASE error: ORA-00911: caractère non valid

it looks like my _ALL_ isn't accepted by MEANS and FREQ instruction (but okey with CONTENTS..)

and yeah I meant "observation" sorry I was searching for the word

RW9 · Posted 12-04-2018 08:25 AM

Yes, you can't use _all_ in that scenario, it will only take one dataset as input as will most procedures. Please avoid coding all in uppercase. Something like:

data _null_;
  set sashelp.vcolumn (where=(libname="ORACLE"));
  call execute(cats('proc freq data=oracle.',memname,'; tables _character_ / out=',memname,'_freq; run;'));
run;

This will create a proc freq for each dataset in the oracle libname, table all the character variables. Output would be to work library, called the dataset name plus _freq.

I would really question why you want to do this though, don't see any use for the information.

PaigeMiller · Posted 12-04-2018 08:31 AM

I agree with @RW9 who said

I would really question why you want to do this though, don't see any use for the information.

It seems like a total waste of time, I have been through a bazillion different projects in my career and I have never had the need to do this. The closest thing I can think of is checking for outliers, which probably can be done a lot more efficiently.

--
Paige Miller

Gorzyne · Posted 12-04-2018 09:43 AM

Thank you for this code I would never have figured myself, I'll try this in
a couple of minutes and let you know

I need this because all the data is going to migrate towards another
datalake and so I must make sure that the data is still the same, well it's
not exactly that but it's the global idea..

But in general it's good to know what's in your data checking for anomalies
for example you expect some fields to have multiple values so you could
check easily if your data is good

Gorzyne · Posted 12-04-2018 10:16 AM

okey I've been greedy indeed and the instruction seems to last for minutes....

so what if perhaps I'd like to launch the same but on a set of tables how would I proceed

I created a table containing the name of the tables I am working on but perhaps something smarter could be done like creating a kind of sub-libname of some sort, if this is possible ?

best regards

Gorz

RW9 · Posted 12-04-2018 10:21 AM

It will do, you asking for a frequency of all character data in that libname - which from the name would indicate that that data is on an oracle database. Are you migrating the database? If so why are you using SAS here?

As for how, just pop your dataset in rather than sashelp.vtables:

data _null_;
  set yourdata;
  call execute(cats('proc freq data=oracle.',memname,'; tables _character_ / out=',memname,'_freq; run;'));
run;

I assume here its called yourdata and has a memname field with dataset name in.

Your still going to hit slowness of communicating with DB if that is a connected libname.

Gorzyne · Posted 12-04-2018 10:40 AM

ok I'll try that one

your question is good in fact I'm not in charge of the migration or even checking if the migration went well from an IT point of vue, but from a user point of vue, they have a couple of WPS/SAS programs running on the datawarehouse so after the migration they want to make sure they have all the data they need in the proper order, like there might have been a few changes in tables architecture+ name of fields and of tables might change also, so idea is to get some kind of "signature" for each field like a min/max/deviation/freq would help to figure out... in fact I don't have to check the whole dataware but like around a few dozens of tables

tried to stop last routine, still running, hope server won't crash

Gorz

Reeza · Posted 12-04-2018 11:23 AM

@Gorzyne wrote:

ok I'll try that one

your question is good in fact I'm not in charge of the migration or even checking if the migration went well from an IT point of vue, but from a user point of vue, they have a couple of WPS/SAS programs running on the datawarehouse so after the migration they want to make sure they have all the data they need in the proper order, like there might have been a few changes in tables architecture+ name of fields and of tables might change also, so idea is to get some kind of "signature" for each field like a min/max/deviation/freq would help to figure out... in fact I don't have to check the whole dataware but like around a few dozens of tables

tried to stop last routine, still running, hope server won't crash

Gorz

If your data sets are large, the statistics will match and this method will not work. For example, for a normally distributed variable, with 1 million observations, if you add 1000 observations, you're looking for differences in the 4/5th decimal place.

Given that these are on different machines and with large numbers you may also have numeric precision issues.

I'm not sure what the appropriate method would be in this case.

ballardw · Posted 12-04-2018 10:56 AM

@Gorzyne wrote:

Thank you for this code I would never have figured myself, I'll try this in
a couple of minutes and let you know

I need this because all the data is going to migrate towards another
datalake and so I must make sure that the data is still the same, well it's
not exactly that but it's the global idea..

But in general it's good to know what's in your data checking for anomalies
for example you expect some fields to have multiple values so you could
check easily if your data is good

Caution Will Robinson: Danger approaches.

Formats applied to variables could make the results of your summary tables very misleading. If a variable has a format such as F4.0 applied then an actual value of 4.1 would appear as 4 in a proc freq table. Custom formats applied to categorical variables may be even more problematic. Example: I have data that contains site codes but I have a format that groups sites into regions for some reporting needs. So the table for that report contains site codes that are formatted with the region code. Your Proc Freq on that table would show results of the Region format applied and not the actual site code.

Currency values often get a format with two decimals applied, but actual values may have more actual decimal places depending on the data.

Some additional consideration is going to be needed before using simple procedures to verify data migration.

Which also brings up the possibly migration of Formats.

Gorzyne · Posted 12-04-2018 11:50 AM

thank you for theses very interesting considerations

as a matter of fact, it's the IT business to make sure the data has been loaded correctly

we are supposed to work on rather raw stuff, that has been treated through the same copybook so the data is likely to be rather close

complementary information I'm trying to retrieve is to have so, 1) identify eigen fields 2) fast check signature, is the number of observations the same, if I have a list of 14321 different names do I find the same names on the other side, if the total cash is 287987897287,85knuts then it's ok if we only find 287987897000knuts on the other side, I mean slight différences would be acceptable I guess, we want to check for outliers but maybe you guys have better suggestions on how to identify the needs

Reeza · Posted 12-04-2018 12:01 PM

@Gorzyne wrote:

if the total cash is 287987897287,85knuts then it's ok if we only find 287987897000knuts on the other side, I mean slight différences would be acceptable I guess, we want to check for outliers but maybe you guys have better suggestions on how to identify the needs

Are you sure about that? Usually that's not the case because it also means somewhere a record is wrong for someone. You may want to check with the business if that's an acceptable difference. Small differences may not be significant to totals or means, but if that data is used for other things - ie filtering people for a specific program it could be quite problematic. And then you'll also lose the trust of the end users as soon as that starts happening.

Gorzyne · Posted 12-04-2018 12:14 PM

Yeah I know that's what I was thinking about, but that's what I was told, that end users wouldn't bother for a few knuts, it's about accountancy purposes so if there's a slight différences in the global reporting all in all, they will presume the newest is the better (current datamart is known for having some errors) but I agree that I would feel comfortable with strict equality, but in fact this is not in my critical path, my aim is to get the final reporting looking normal

anyway I had issues with the latest verison of the code you gave me

I tried

data MYTABLES;
	input var1 $30.;
	cards;
		MYTABLE1
		MYTABLE2;
run;
data _null_;
  set MYTABLES;
  call execute(cats('proc freq data=oracle.',var1,'; tables _character_ / out=',var1,'_freq; run;'));
run;

it executes correctly and gives me the two output tables in my work but when I open the output I only have something like that

but my tables have many fields so something went wrong... because in the first version with _ALL_ I was able to catch a glimpse of the first results and everything looked correct...

ballardw · Posted 12-04-2018 01:05 PM

@Gorzyne wrote:

Yeah I know that's what I was thinking about, but that's what I was told, that end users wouldn't bother for a few knuts, it's about accountancy purposes so if there's a slight différences in the global reporting all in all, they will presume the newest is the better (current datamart is known for having some errors) but I agree that I would feel comfortable with strict equality, but in fact this is not in my critical path, my aim is to get the final reporting looking normal

anyway I had issues with the latest verison of the code you gave me

I tried
data MYTABLES;
	input var1 $30.;
	cards;
		MYTABLE1
		MYTABLE2;
run;
data _null_;
  set MYTABLES;
  call execute(cats('proc freq data=oracle.',var1,'; tables _character_ / out=',var1,'_freq; run;'));
run;
it executes correctly and gives me the two output tables in my work but when I open the output I only have something like that

but my tables have many fields so something went wrong... because in the first version with _ALL_ I was able to catch a glimpse of the first results and everything looked correct...

The OUT= option only creates a table for ONE of the table requests.

You need to get each variable as a single table request to use the OUT option for proc freq .

proc freq data=sashelp.class;
   tables age / out=work.ages ;
   tables sex / out=work.sexes;
   tables name/ out=work.names;
run;

An uglier output data set can be created with

proc summary data=sashelp.class  ;
   class _character_ /missing;
   ways 1;
   output out=work.charsum;
run;

but you'll want to read up on the _type_ automatic variable and options such as the WAYS statement.

How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Re: How to obtain extra information for all fields of all tables in my lib ?

Ready to join fellow brilliant minds for the SAS Hackathon?